Other related links:
XRootD: Code review for ReadV implementation in XrdCeph (March 2023) (code review summary)
https://stfc.atlassian.net/wiki/spaces/GRIDPP/pages/137265343/Non-striper+read+v+implementation+for+WN+s+xrootd+gateways (Change control document)
Useful pages for monitoring
Plots and associated info
Please at a screen shot (or more) of the plot, and the timestamp url from which it was obtained.
For URLs which are generally useful please also add to the section above, with some brief description.
The problem was initially found by gateway functional tests failing Friday evening. Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below
which suggested some slowdown or issue in ceph. This was further supplemented by the slow operation monitoring
There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm
Most sns coped well, except for dell-2019 devices with a specific ssd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.
The issue was exacerbated by the incorrect version of xrdceph being deployed on the workernode (without xrdceph side buffers) , which further increased the number of I/O operations sent to the cluster.
During recovery, xrdceph-buffered was attempted to be deployed on the workernodes but the plugin throws a version conflict error when compiling the workernode container:
=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.