Currently the farm is suffering from xrootd proxy issue, yet to be identified? Symptoms are high CLOSE_WAIT connection value and xrootd unresponsiveness. Previously it’s been identified that more than ~400 CLOSE_WAIT connections causes xrootd to act erratically.
As of 16/11/23:
lcg2631
lcg2635
lcg2638
lcg2617
The above nodes how this xrootd proxy issue.
last 24hrs atlas:
VO’s running on affected nodes:
LHCB (tlhcb006)
ATLAS (patls002)
NA62 (tna62a001)
Biomed (bio045)
enmr022
file read timeouts on proxy logs
Nov 16 04:20:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 04:20:34 92982 XrootdAioTask: async read failed for tlhcb006.2794:170@htcjob4969334_0_slot1_246_pid3678980.ralworker; aio file read timed out /lhcb:buffer/lhcb/MC/2016/SIM/00204827/0008/00204827_00084722_1.sim
Nov 16 04:21:08 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 04:21:08 92969 XrootdAioTask: async read failed for tlhcb006.301:77@htcjob5638675_0_slot1_14_pid2226160.ralworker; aio file read timed out /lhcb:buffer/lhcb/MC/2018/SIM/00204836/0009/00204836_00096682_1.sim
futex_wait
Nov 16 06:21:38 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 06:21:38 96524 oss_Open_ufs: Unable to reloc FD /xcache/lhcb:buffer/lhcb/MC/2018/SIM/00204818/0008/00204818_00083792_1.sim.cinfo; invalid argument
logs cycle between authentications - no read logs happening
files in cache can be downloaded
Proxy cache discovery
Alex R noticed that cached files from the xrootd-proxy are retrieved successfully, indicating the issue exists with the proxy recalling data from the gateway.