Other related links:
XRootD: Code review for ReadV implementation in XrdCeph (March 2023) (code review summary)
https://stfc.atlassian.net/wiki/spaces/GRIDPP/pages/137265343/Non-striper+read+v+implementation+for+WN+s+xrootd+gateways (Change control document)
Useful pages for monitoring
xrootd functional test results icinga
kibana dashboard for WN/tranche IOPS monitoring
Echo storage node IOPS (per generation)
XrootD production changes
External gateways
9/05/23 - pgwrite bugfix rollout on external gateways | deemed irrelevant to the incident
Batch farm:
9th May (9:30 am): Draining of first half of worker-nodes
11th May (9:30 am): Update drained worker-nodes
Bring back online updated tranches
12th May (16:00): Drain remaining half of worker-nodes
15th May (14:00): Update drained worker-nodes
Bring back online updated tranches
Health check entire farm
Plots and associated info
Please at a screen shot (or more) of the plot, and the timestamp url from which it was obtained.
For URLs which are generally useful please also add to the section above, with some brief description.
The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:
CephIOAdapterRaw::read: Error in read: -16 LoadCache Error: -16
Non expected offset: -1 8388608 41943040 Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 437 rc:-22 off:8388608 len:8388608 230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument
Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 3494 rc:-22 off:41943040 len:8388608 XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16
'file already open for write' type errors
Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below
which suggested some slowdown or issue in ceph. This was further supplemented by the slow operation monitoring
There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm
Most sns coped well, except for dell-2019 devices with a specific hdd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.
The issue was exacerbated by the incorrect version of xrdceph being deployed on the workernode (without xrdceph side buffers) , which further increased the number of I/O operations sent to the cluster.
During recovery, xrdceph-buffered was attempted to be deployed on the workernodes but the plugin throws a version conflict error when compiling the workernode container:
=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.
This has been found to be due to a missed line change in the dockerfile.
Hard limit for read IOPs before the crash in ceph monitoring seem to be 150k, with a desirable rate of <100k. current rate (without readV) is 30k