2024-04-23 LHCb WGprod Echo overload
On the evening of the 23rd, LHCb started running a new job type called WGprod at RAL at approximately 5PM.
These jobs pulled down large amounts of data through the WN gateways. At 6PM the WN gateways started passing vector reads through to the cluster. Probably because the XCaches were running out of resource to cache new requests.
The number of client operations on Echo quickly spiked to >100k IOPS, which is the approximate limit of the cluster.
The client operation time slowed down dramatically and transfers started failing. The cluster recovered by itself after the load subsided.
This was by far the highest consistent IO that has ever been requested from Echo by an LHC VO.
Alex R suggested the reduction in prefetch on the WN XCaches will reduce the amount of passthrough. We could also potentially increase the amount of memory available to the XCaches.
Prefetch off change was deployed on the batch farm but the large amount of lhcb jobs caused the '21 generation workernodes to stall on friday (25/04/24)
LHCb WGProduction Job failures. Most of the jobs failed due to vector read timeouts.
Looks like around 15k jobs were enough to overload ECHO and cause ~1/3rd of requests to fail.