<various theories and attempted mitigations during the week>- see 2024-06-12/13 Manager Upgrade failure
20/06/24 Thursday
08:26 - Alex R - failures seen this morning and previous afternoon
08:28 - Jyothish - svc18 identified as being in stuck auth
09:00 - Alex R - problems still happening
09:07 - Jyothish - svc14 also went stuck after the previous check. load patterns identified to look like crashes:
...
09:13 - Jyothish - found that coredump location changed for rocky8, only the 10 most recent ones are kept (hence why this was not detected sooner)
09:11 - Jyothish - found that the crash trigger seems to be due to XRootD trying to access a specific memory location
09:41 - Jyothish - asked Alex R for help in debugging, pointed to ceph-svc14 for checking
09:50 - Jyothish - spike in write iops detected during problematic periods:
...
09:55 - Brij - checking atlas side, if a 'vector write' type operation was triggered
10:26 - Alex R - asks if there’s any way to reproduce the crash, Jyothish replies none have been found yet
12:09 - Jyothish - due to seeing a few checksum comparison errors on the ATLAS monitoring (~8 within 6hours) writes a test writing a test file into prod and comparing the checksum. This test reported a checksum mismatch ~20-30% of the time. flagged in data services and caused mild panic.
12:10 - 12:50 - Jyothish, Alex R, James W - checksum mismatch turned to be a red herring. The high failure rate reported by the test was due to wrongly cached metadata checksums during the overwrite. adler32 is still unreliable (e.g. the 8 reported mismatches in 6hours were due to adler32 inaccuracy), but the file themselves were safe
12:46 - Alastair - offers help on making service status decisions, joins XrootD meeting
mitigation for stuck auth deemed sufficient for overnight
13:48 - Alex R - identified ipv6 address contacted at the time of the crash. Tom Birkett confirms this is not a RAL workernode
3:00 PM - Brian D + Jyothish - check for gateways in stuck authentication enabled with automatic restart
...
3:40 PM - Jyothish - attempted creation of RPM with possible underlying auth problem solution, failed due to ongoing VPN issues. Postponed for day after
5:56 PM - Alex - identified ips doing the transfer at crash times to be xcaches
Overnight frequent restarts happen, partially due to crashes, partially due to the script. Higher failure rate seen in production
21/06/24 Friday
7:00 AM - Andy H - replies to the email, issue was not seen elsewhere.
...
16:38 PM - Jyothish, Alex R - monitored situation, no failures seen since this rollout
Takeaways
Good reporting and communication from everyone involved. Emergency meeting proved useful in informing Tier1 production team and management about the nature of the issue and the attempted fix. Previous experience has been useful in devising the patch and RPM creation and deployment pipeline was quick enough to deploy the change in time for sufficient observations to happen before the end of day.
Switch testing (Switch Testing) proved vital for rapid testing of attempted fixes throughout the week, and taking a host temporarily out of prod for the attempted fixes made checking the fix and deployment a lot quicker.