...
which suggested some slowdown or issue in ceph. This was further supplemented by the slow operation monitoring
...
There was a red herring from the recovery traffic, but the root cause seems to be from the increased number of I/O operations occurring due to switching to atomic reads in the batch farm
...
Most sns coped well, except for dell-2019 devices with a specific ssd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.
The issue was exacerbated by the incorrect version of xrdceph being deployed on the workernode (without xrdceph side buffers) , which further increased the number of I/O operations sent to the cluster.
During recovery, xrdceph-buffered was attempted to be deployed on the workernodes but the plugin throws a version conflict error when compiling the workernode container:
=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.