Week of 2024-06-21 - constant gateway crashing

Thursday

3:00 PM - Brian D + Jyothish - check for gateways in stuck authentication enabled with automatic restart

3:10 PM - Jyothish - sends an email to the xrootd mailing list if similar issues were seen

3:11 PM - Jyothish - inform production team of the check being enabled. no oncall procedure change required

3:40 PM - Jyothish - attempted creation of RPM with possible underlying auth problem solution, failed due to ongoing VPN issues. Postponed for day after

Overnight frequent restarts happen, partially due to crashes, partially due to the script. Higher failure rate seen in production

Friday

7:00 AM - Andy H - replies to the email, issue was not seen elsewhere.

08:44 AM - Jyothish - stopped restarts for the duration of working hours in case the previous gateway restarts were caused by high check sensitivity

failure rate increases over the day, proving it was mainly caused by the crashes, not the script

10:00 AM - Jyothish - compiled and deployed RPM with possible auth solution patch

10:22 AM - Alex R - manages to reproduce the issue

10:22 AM - 2:10PM - Alex R, Jyothish - ongoing debugging until the issue was found

11:10 AM - Jyothish - takes out gw14-16 from prod to reboot and see if the pending kernel updates help

12:22 PM - Alastair - asks for service status for the weekend. Jyothish replies progress is being made with debugging, but the current state might persist over the weekend

12:23 PM - Thomas Birkett - asks to notify grid services if the service is going to be problematic over the weekend, so appropriate downtimes can be placed. Jyothish replies that a status update will be given by 2:30 if there is no progress

1:47 PM - Jyothish, Alex R - issue is found. Jyothish sends an update to Andy in the xrootd mailing list with the findings. Solution is being discussed

2:12 PM - Jyothish - notifies that the issue has been found but would like a consultation before any changes are made. asks for James W, Tom Byrne, Tom Birkett to attend if possible and posts a zoom link for all interested parties.

2:13 -2:50 PM - Ongoing consultation. Jyothish explains the problem. The issue is caused in the paged read section of the code (introduced in xrootd 5.5.3) and a proper fix introduces a nontrivial chance of silent file corruption without extensive testing. This is the long term solution needed, however a simpler solution that was previously used by Jyothish on other pagedIO errors during the initial feature release might be possible to be used. This involves a code change that tricks xrootd into thinking it’s talking to a server with a lower xrootd version and avoids using pagedIO. The solution has been used in prod more than a year ago and is reasonably safe to deploy. Discussion ensues with Brij, James W, Brian D, Alex R and Alastair on the current state of production and wether this can be left as is over the weekend. Consus is that the failure is high, but can be tolerated. Jyothish proposes a quick test if the second (safer) solution stops the issue from occurring using Alex’s test and roll it out ASAP if it does. All present agree.

15:16 PM - Jyothish - new RPMS created and deployed on gw14 (which was out of production from the reboot earlier in the morning and the following debugging). Test succeeds and the change is rolled out in production

15:38 PM - Jyothish - no failures seen since the rollout. updated atlas ggus ticket on the situation

16:38 PM - Jyothish, Alex R - monitored situation, no failures seen since this rollout

Takeaways

Good reporting and communication from everyone involved. Emergency meeting proved useful in informing Tier1 production team and management about the nature of the issue and the attempted fix. Previous experience has been useful in devising the patch and RPM creation and deployment pipeline was quick enough to deploy the change in time for sufficient observations to happen before the end of day.

Switch testing (Switch Testing) proved vital for rapid testing of attempted fixes throughout the week, and taking a host temporarily out of prod for the attempted fixes made checking the fix and deployment a lot quicker.