Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Other related links:

Useful pages for monitoring

Gateway status monitoring

XrootD report monitoring

Slow I/O monitoring

xrootd functional test results icinga

legacy network traffic

new network traffic

storage monitoring

Echo storage node IOPS (per generation)

Plots and associated info

Please at a screen shot (or more) of the plot, and the timestamp url from which it was obtained.
For URLs which are generally useful please also add to the section above, with some brief description.

The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:

  • CephIOAdapterRaw::read: Error in read: -16
    LoadCache Error: -16
  • Non expected offset: -1  8388608  41943040
    Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608
    XrdCephOssBufferedFile::Write: Write error  fd: 437 rc:-22 off:8388608 len:8388608
    230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument
  • Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608
    XrdCephOssBufferedFile::Write: Write error  fd: 3494 rc:-22 off:41943040 len:8388608
    XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16
  • 'file already open for write' type errors

Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below

which suggested some slowdown or issue in ceph. This was further supplemented by the slow operation monitoring

There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm

Most sns coped well, except for dell-2019 devices with a specific hdd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.

The issue was exacerbated by the incorrect version of xrdceph being deployed on the workernode (without xrdceph side buffers) , which further increased the number of I/O operations sent to the cluster.

During recovery, xrdceph-buffered was attempted to be deployed on the workernodes but the plugin throws a version conflict error when compiling the workernode container:

=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.

This has been found to be due to a missed line change in the dockerfile.

  • No labels