Currently the farm is suffering from xrootd proxy issue, yet to be identified? Symptoms are high CLOSE_WAIT connection value and xrootd unresponsiveness. Previously it’s been identified that more than ~400 CLOSE_WAIT connections causes xrootd to act erratically.
As of 16/11/23:
lcg2631
lcg2635
lcg2638
lcg2617
The above nodes how this xrootd proxy issue.
last 24hrs atlas:
VO’s running on affected nodes:
LHCB (tlhcb006)
ATLAS (patls002)
NA62 (tna62a001)
Biomed (bio045)
enmr022
file read timeouts on proxy logs
Nov 16 04:20:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 04:20:34 92982 XrootdAioTask: async read failed for tlhcb006.2794:170@htcjob4969334_0_slot1_246_pid3678980.ralworker; aio file read timed out /lhcb:buffer/lhcb/MC/2016/SIM/00204827/0008/00204827_00084722_1.sim
Nov 16 04:21:08 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 04:21:08 92969 XrootdAioTask: async read failed for tlhcb006.301:77@htcjob5638675_0_slot1_14_pid2226160.ralworker; aio file read timed out /lhcb:buffer/lhcb/MC/2018/SIM/00204836/0009/00204836_00096682_1.sim
futex_wait
Nov 16 06:21:38 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 06:21:38 96524 oss_Open_ufs: Unable to reloc FD /xcache/lhcb:buffer/lhcb/MC/2018/SIM/00204818/0008/00204818_00083792_1.sim.cinfo; invalid argument
logs cycle between authentications - no read logs happening
files in cache can be downloaded
Proxy cache discovery
Alex R noticed that cached files from the xrootd-proxy are retrieved successfully, indicating the issue exists with the proxy recalling data from the gateway.
looks like the cache FD cannot be freed?
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,527-67018-DEBUG-Converted /lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz to lhcb, user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,706-67018-DEBUG-Striper: Object size:67108864, Total size:5227299, Num Stripes:1, Last Stripe size:5227299
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,706-67018-DEBUG-adler32, 1700083974, 1, 0, b'\x00', b'\x04', 4, b'\xf9\xe4\xe9d'
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,706-67018-DEBUG-Time info: 1700083974, 2023-11-15 21:32:54, 1, 0:00:01, 2023-11-15 21:32:55
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,706-67018-INFO-XrdCks.adler32: f9e4e964 ; 2023-11-15 21:32:54; 0:00:01; 4
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,706-67018-INFO-Path:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz; From:metadata; Checksum:f9e4e964
Nov 15 21:37:07 lcg2631.gridpp.rl.ac.uk docker[1542921]: CEPHSUM-2023-11-15 21:37:07,728-67018-INFO-Result:Done, pool:lhcb, path:/lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz, checksum:f9e4e964, time_s:0.20102, filesize_bytes:5227299, source:metadata, exit_code:0, srccks:N/A
-----READ FINISHED---------
Nov 16 09:59:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 09:59:34 102659 acc_Audit: xrootd.102663:6647@localhost grant gsi 0495af69.0@localhost read /lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz
Nov 16 09:59:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 09:59:34 102659 oss_Open_ufs: Unable to reloc FD /xcache/lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz.cinfo; invalid argument
Nov 16 09:59:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 09:59:34 102659 oss_Open_ufs: Unable to reloc FD /xcache/lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz; invalid argument
Nov 16 09:59:34 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 09:59:34 102659 oss_Open_ufs: Unable to reloc FD /xcache/lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz.cinfo; invalid argument
---after manually clearing file from cache and attempting to read----
Nov 16 10:12:13 lcg2631.gridpp.rl.ac.uk docker[1542921]: 231116 10:12:13 103058 XrootdAioTask: aioR overdue 8 inflight requests for xrootd.103062:6877@localhost /lhcb:user/lhcb/user/k/kmattiol/GangaJob_350/InputFiles/diracInputFiles_350_dba170b4-0658-444a-bbeb-adb973370624.tgz
Tcpdump shows that gateway for new connections closes the channes on its side for some reason immediately after connection establishment:
10:31:09.812402 IP (tos 0x0, ttl 64, id 55828, offset 0, flags [DF], proto TCP (6), length 60) 172.28.1.1.36580 > 172.28.1.2.nicelink: Flags [S], cksum 0x5a6a (incorrect -> 0x539f), seq 1788468839, win 64240, options [mss 1460,sackOK,TS val 2782928166 ecr 0,nop,wscale 8], length 0 10:31:09.812498 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60) 172.28.1.2.nicelink > 172.28.1.1.36580: Flags [S.], cksum 0x5a6a (incorrect -> 0x3470), seq 1456405645, ack 1788468840, win 65160, options [mss 1460,sackOK,TS val 4249275106 ecr 2782928166,nop,wscale 8], length 0 10:31:09.812515 IP (tos 0x0, ttl 64, id 55829, offset 0, flags [DF], proto TCP (6), length 52) 172.28.1.1.36580 > 172.28.1.2.nicelink: Flags [.], cksum 0x5a62 (incorrect -> 0x60cb), seq 1, ack 1, win 251, options [nop,nop,TS val 2782928166 ecr 4249275106], length 0 10:31:39.842286 IP (tos 0x0, ttl 64, id 51510, offset 0, flags [DF], proto TCP (6), length 52) 172.28.1.2.nicelink > 172.28.1.1.36580: Flags [F.], cksum 0x5a62 (incorrect -> 0xeb77), seq 1, ack 1, win 255, options [nop,nop,TS val 4249305136 ecr 2782928166], length 0 10:31:39.843151 IP (tos 0x0, ttl 64, id 55830, offset 0, flags [DF], proto TCP (6), length 52) 172.28.1.1.36580 > 172.28.1.2.nicelink: Flags [.], cksum 0x5a62 (incorrect -> 0x762c), seq 1, ack 2, win 251, options [nop,nop,TS val 2782958197 ecr 4249305136], length 0