12–16 May 2023 Echo instability following readV rollout

Other related links:

 

Useful pages for monitoring

Gateway status monitoring

XrootD report monitoring

Slow I/O monitoring

xrootd functional test results icinga

legacy network traffic

new network traffic

storage monitoring

kibana dashboard for WN/tranche IOPS monitoring

Echo storage node IOPS (per generation)

XrootD production changes

External gateways

9/05/23 - pgwrite bugfix rollout on external gateways | deemed irrelevant to the incident

Batch farm:

  • 9th May (9:30 am): Draining of first half of worker-nodes

  • 11th May (9:30 am): Update drained worker-nodes

  • Bring back online updated tranches

  • 12th May (16:00): Drain remaining half of worker-nodes

  • 15th May (14:00): Update drained worker-nodes

  • Bring back online updated tranches

  • Health check entire farm

Plots and associated info

Please at a screen shot (or more) of the plot, and the timestamp url from which it was obtained.
For URLs which are generally useful please also add to the section above, with some brief description.

 

The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:

  • CephIOAdapterRaw::read: Error in read: -16 LoadCache Error: -16
  • Non expected offset: -1 8388608 41943040 Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 437 rc:-22 off:8388608 len:8388608 230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument
  • Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 3494 rc:-22 off:41943040 len:8388608 XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16
  • 'file already open for write' type errors

 

Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below

which suggested some slowdown or issue in ceph. This was further supplemented by the slow operation monitoring

There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm

 

Most sns coped well, except for dell-2019 devices with a specific hdd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.

The issue was exacerbated by the incorrect version of xrdceph being deployed on the workernode (without xrdceph side buffers) , which further increased the number of I/O operations sent to the cluster.

During recovery, xrdceph-buffered was attempted to be deployed on the workernodes but the plugin throws a version conflict error when compiling the workernode container:

=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.

This has been found to be due to a missed line change in the dockerfile.

Hard limit for read IOPs before the crash in ceph monitoring seem to be 150k, with a desirable rate of <100k. current rate (without readV) is 30k