Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


<various theories and attempted mitigations during the week>- see 2024-06-12/13 Manager Upgrade failure

20/06/24 Thursday

08:26 - Alex R - failures seen this morning and previous afternoon

08:28 - Jyothish - svc18 identified as being in stuck auth

09:00 - Alex R - problems still happening

09:07 - Jyothish - svc14 also went stuck after the previous check. load patterns identified to look like crashes:

...

09:13 - Jyothish - found that coredump location changed for rocky8, only the 10 most recent ones are kept (hence why this was not detected sooner)

09:11 - Jyothish - found that the crash trigger seems to be due to XRootD trying to access a specific memory location

09:41 - Jyothish - asked Alex R for help in debugging, pointed to ceph-svc14 for checking

09:50 - Jyothish - spike in write iops detected during problematic periods:

...

09:55 - Brij - checking atlas side, if a 'vector write' type operation was triggered

10:26 - Alex R - asks if there’s any way to reproduce the crash, Jyothish replies none have been found yet

12:09 - Jyothish - due to seeing a few checksum comparison errors on the ATLAS monitoring (~8 within 6hours) writes a test writing a test file into prod and comparing the checksum. This test reported a checksum mismatch ~20-30% of the time. flagged in data services and caused mild panic.

12:10 - 12:50 - Jyothish, Alex R, James W - checksum mismatch turned to be a red herring. The high failure rate reported by the test was due to wrongly cached metadata checksums during the overwrite. adler32 is still unreliable (e.g. the 8 reported mismatches in 6hours were due to adler32 inaccuracy), but the file themselves were safe

12:46 - Alastair - offers help on making service status decisions, joins XrootD meeting

2024-06-20 Meeting Notes

mitigation for stuck auth deemed sufficient for overnight

13:48 - Alex R - identified ipv6 address contacted at the time of the crash. Tom Birkett confirms this is not a RAL workernode

3:00 PM - Brian D + Jyothish - check for gateways in stuck authentication enabled with automatic restart

...

3:40 PM - Jyothish - attempted creation of RPM with possible underlying auth problem solution, failed due to ongoing VPN issues. Postponed for day after

5:56 PM - Alex - identified ips doing the transfer at crash times to be xcaches

Overnight frequent restarts happen, partially due to crashes, partially due to the script. Higher failure rate seen in production

21/06/24 Friday

7:00 AM - Andy H - replies to the email, issue was not seen elsewhere.

...

2:12 PM - Jyothish - notifies that the issue has been found but would like a consultation before any changes are made. asks for James W, Tom B (both of them) Byrne, Tom Birkett to attend if possible and posts a zoom link for all interested parties.

...

16:38 PM - Jyothish, Alex R - monitored situation, no failures seen since this rollout

Takeaways

Good reporting and communication from everyone involved. Emergency meeting proved useful in informing Tier1 production team and management about the nature of the issue and the attempted fix. Previous experience has been useful in devising the patch and RPM creation and deployment pipeline was quick enough to deploy the change in time for sufficient observations to happen before the end of day.

Switch testing (Switch Testing) proved vital for rapid testing of attempted fixes throughout the week, and taking a host temporarily out of prod for the attempted fixes made checking the fix and deployment a lot quicker.