Week of 2024-06-21 - constant gateway crashing


<various theories and attempted mitigations during the week>- see 2024-06-12/13 Manager Upgrade failure

20/06/24 Thursday

08:26 - Alex R - failures seen this morning and previous afternoon

08:28 - Jyothish - svc18 identified as being in stuck auth

09:00 - Alex R - problems still happening

09:07 - Jyothish - svc14 also went stuck after the previous check. load patterns identified to look like crashes:

image-20240701-120631.png

09:13 - Jyothish - found that coredump location changed for rocky8, only the 10 most recent ones are kept (hence why this was not detected sooner)

09:11 - Jyothish - found that the crash trigger seems to be due to XRootD trying to access a specific memory location

09:41 - Jyothish - asked Alex R for help in debugging, pointed to ceph-svc14 for checking

09:50 - Jyothish - spike in write iops detected during problematic periods:

image-20240701-121401.png

09:55 - Brij - checking atlas side, if a 'vector write' type operation was triggered

10:26 - Alex R - asks if there’s any way to reproduce the crash, Jyothish replies none have been found yet

12:09 - Jyothish - due to seeing a few checksum comparison errors on the ATLAS monitoring (~8 within 6hours) writes a test writing a test file into prod and comparing the checksum. This test reported a checksum mismatch ~20-30% of the time. flagged in data services and caused mild panic.

12:10 - 12:50 - Jyothish, Alex R, James W - checksum mismatch turned to be a red herring. The high failure rate reported by the test was due to wrongly cached metadata checksums during the overwrite. adler32 is still unreliable (e.g. the 8 reported mismatches in 6hours were due to adler32 inaccuracy), but the file themselves were safe

12:46 - Alastair - offers help on making service status decisions, joins XrootD meeting

2024-06-20 Meeting Notes

mitigation for stuck auth deemed sufficient for overnight

13:48 - Alex R - identified ipv6 address contacted at the time of the crash. Tom Birkett confirms this is not a RAL workernode

3:00 PM - Brian D + Jyothish - check for gateways in stuck authentication enabled with automatic restart

3:10 PM - Jyothish - sends an email to the xrootd mailing list if similar issues were seen

3:11 PM - Jyothish - inform production team of the check being enabled. no oncall procedure change required

3:40 PM - Jyothish - attempted creation of RPM with possible underlying auth problem solution, failed due to ongoing VPN issues. Postponed for day after

5:56 PM - Alex - identified ips doing the transfer at crash times to be xcaches

Overnight frequent restarts happen, partially due to crashes, partially due to the script. Higher failure rate seen in production

21/06/24 Friday

7:00 AM - Andy H - replies to the email, issue was not seen elsewhere.

08:44 AM - Jyothish - stopped restarts for the duration of working hours in case the previous gateway restarts were caused by high check sensitivity

failure rate increases over the day, proving it was mainly caused by the crashes, not the script

10:00 AM - Jyothish - compiled and deployed RPM with possible auth solution patch

10:22 AM - Alex R - manages to reproduce the issue

10:22 AM - 2:10PM - Alex R, Jyothish - ongoing debugging until the issue was found

11:10 AM - Jyothish - takes out gw14-16 from prod to reboot and see if the pending kernel updates help

12:22 PM - Alastair - asks for service status for the weekend. Jyothish replies progress is being made with debugging, but the current state might persist over the weekend

12:23 PM - Thomas Birkett - asks to notify grid services if the service is going to be problematic over the weekend, so appropriate downtimes can be placed. Jyothish replies that a status update will be given by 2:30 if there is no progress

1:47 PM - Jyothish, Alex R - issue is found. Jyothish sends an update to Andy in the xrootd mailing list with the findings. Solution is being discussed

2:12 PM - Jyothish - notifies that the issue has been found but would like a consultation before any changes are made. asks for James W, Tom Byrne, Tom Birkett to attend if possible and posts a zoom link for all interested parties.

2:13 -2:50 PM - Ongoing consultation. Jyothish explains the problem. The issue is caused in the paged read section of the code (introduced in xrootd 5.5.3) and a proper fix introduces a nontrivial chance of silent file corruption without extensive testing. This is the long term solution needed, however a simpler solution that was previously used by Jyothish on other pagedIO errors during the initial feature release might be possible to be used. This involves a code change that tricks xrootd into thinking it’s talking to a server with a lower xrootd version and avoids using pagedIO. The solution has been used in prod more than a year ago and is reasonably safe to deploy. Discussion ensues with Brij, James W, Brian D, Alex R and Alastair on the current state of production and wether this can be left as is over the weekend. Consus is that the failure is high, but can be tolerated. Jyothish proposes a quick test if the second (safer) solution stops the issue from occurring using Alex’s test and roll it out ASAP if it does. All present agree.

15:16 PM - Jyothish - new RPMS created and deployed on gw14 (which was out of production from the reboot earlier in the morning and the following debugging). Test succeeds and the change is rolled out in production

15:38 PM - Jyothish - no failures seen since the rollout. updated atlas ggus ticket on the situation

16:38 PM - Jyothish, Alex R - monitored situation, no failures seen since this rollout

Takeaways

Good reporting and communication from everyone involved. Emergency meeting proved useful in informing Tier1 production team and management about the nature of the issue and the attempted fix. Previous experience has been useful in devising the patch and RPM creation and deployment pipeline was quick enough to deploy the change in time for sufficient observations to happen before the end of day.

Switch testing (Switch Testing) proved vital for rapid testing of attempted fixes throughout the week, and taking a host temporarily out of prod for the attempted fixes made checking the fix and deployment a lot quicker.