...
The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:
Code Block CephIOAdapterRaw::read: Error in read: -16 LoadCache Error: -16
Code Block Non expected offset: -1 8388608 41943040 Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 437 rc:-22 off:8388608 len:8388608 230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument
Code Block Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608 XrdCephOssBufferedFile::Write: Write error fd: 3494 rc:-22 off:41943040 len:8388608 XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16
'file already open for write' type errors
Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below
...