...
xrootd functional test results icinga
kibana dashboard for WN/tranche IOPS monitoring
Echo storage node IOPS (per generation)
XrootD production changes
External gateways
9/05/23 - pgwrite bugfix rollout on external gateways | deemed irrelevant to the incident
Batch farm:
9th May (9:30 am): Draining of first half of worker-nodes
11th May (9:30 am): Update drained worker-nodes
Bring back online updated tranches
12th May (16:00): Drain remaining half of worker-nodes
15th May (14:00): Update drained worker-nodes
Bring back online updated tranches
Health check entire farm
Plots and associated info
...
The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:
Code Block CephIOAdapterRaw::read: Error in read: -16 LoadCache Error: -16
Code Block Non expected offset: -1 8388608 41943040 Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 437 rc:-22 off:8388608 len:8388608 230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument
Code Block Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 3494 rc:-22 off:41943040 len:8388608 XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16
'file already open for write' type errors
Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below
...
There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm
...
Most sns coped well, except for dell-2019 devices with a specific ssd hdd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.
...
=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.
This has been found to be due to a missed line change in the dockerfile.
Hard limit for read IOPs before the crash in ceph monitoring seem to be 150k, with a desirable rate of <100k. current rate (without readV) is 30k