Page Comparison

...

XrootD production changes

External gateways

9/05/23 - pgwrite bugfix rollout on external gateways | deemed irrelevant to the incident

Batch farm:

9th May (9:30 am): Draining of first half of worker-nodes
11th May (9:30 am): Update drained worker-nodes

Bring back online updated tranches

12th May (16:00): Drain remaining half of worker-nodes
15th May (14:00): Update drained worker-nodes

Bring back online updated tranches
Health check entire farm

Plots and associated info

...

The problem was initially found by gateway functional tests failing Friday evening.
Unexpected failures were found on the xrootd logs, such as:

Code Block
CephIOAdapterRaw::read: Error in read: -16 LoadCache Error: -16

Code Block

Non expected offset: -1  8388608  41943040
Error trying to write out of order: expeted at: 41943040 got offset8388608 of len 8388608
XrdCephOssBufferedFile::Write: Write error  fd: 437 rc:-22 off:8388608 len:8388608
230512 19:50:36 4111476 ofs_write: patls002.4684:11466@lcg2290 Unable to write atlas:datadisk/rucio/mc23_13p6TeV/17/17/EVNT.33427665._009684.pool.root.1; invalid argument

Code Block

Error trying to write out of order: expeted at: 16777216 got offset41943040 of len 8388608
XrdCephOssBufferedFile::Write: Write error  fd: 3494 rc:-22 off:41943040 len:8388608
XrdCephOssBufferedFile::Close: flush Error fd: 3494 rc:-16

'file already open for write' type errors

Restarting the gateways didn’t fix the issue. Memory spikes correlate to increased connections in the xrootd report monitoring shown below

...

Most sns coped well, except for dell-2019 devices with a specific ssd hdd type, which caused enough slow operations to trigger general timeouts and errors on the gateways, as well as hanging socket connections which resulted in the failure.

...

=====> ofs.xattrlib /usr/lib64/libXrdCephXattr.so
Plugin version XrdOss v5.3.3 is incompatible with XrdCephOss v5.5.4 (must be <= 5.3.x) in osslib /usr/lib64/libXrdCeph-5.so
++++++ Checkpoint initialization started.

This has been found to be due to a missed line change in the dockerfile.

Hard limit for read IOPs before the crash in ceph monitoring seem to be 150k, with a desirable rate of <100k. current rate (without readV) is 30k

Version	Old Version 5	New Version Current
Changes made by	Thomas, Jyothish (STFC,RAL,SC)	Thomas, Jyothish (STFC,RAL,SC)
Saved on	May 16, 2023	May 23, 2023

Versions Compared

Key

XrootD production changes

Plots and associated info