2024-06-12/13 Manager Upgrade failure

12/06

10:00 - Jyothish - manager01 upgrade started
12:30 - Jyothish -manager01 finishes upgrade, memory allocation errors show up on xrootd, mainly around the host lookup section

12:49 - Jyothish - services stopped on manager01 (all traffic redirected to manager02). manager02 is showing the same errors. This is now thought to be a rocky8 issue with xrootd

<error information and debugging info is collected>

13:50 - Jyothish - decision taken to revert manager01 to sl7, reversion started

14:10 - Maksim - informs he has been having issues on host upgrades due to a network change introduced by cloud

14:50 - Alexander Dibbo - posts in #data services confirming the network change accidentally broke the legacy tier-1 network (on which the echo-managers are). Fix is available in prod aquilon

15:35 - Jyothish - manager01 came back online with the pre-upgrade configuration, but it’s still throwing memory allocation errors. The log information was more useful and flagged the memory allocation errors taking place within the load balancing section of the code

16:37 - Jyothish - xrootd version on manager01 rolled back to pre-load balancing improvements version (5.5.4-4). This also rolled back the patch for el9 case insensitive authz headers. Manager is still under heavy load (due to network change not yet being deployed by aquilon).

17:07 - Jyothish - quattor finishes running on the host with significant manual prodding (yup and xrootd process kills to ensure full capacity was given for the quattor run)

17:12 - Jyothish - host is back in a working state, still on the previous xrootd version

13/06

09:47 - Jyothish - patched and compiled xrootd-5.5.4-7 rpms addressing an identified memory inefficiency on the load balancing in 5.5.4-6 (pre-upgrade version). Tested deployment on manager02. manager01 stopped but kept on sl7 for immediate reversion if the test fails. Switch testing process Switch Testing created and implemented for future attempted fixes

09:51 - Jyothish - no memory failures observed on manager02

10:07 - Jyothish - all gws start to fail functional tests. change reverted

10:16 - Jyothish - gws are still very slow and failing tests. service restart did not fix it

10:24 - Jyothish - initial nc response time is very slow on tests from lcgui (20s). potential DNS issue suspected

10:35 - Jyothish - attempted rebase and quattor reconfiguration to deploy further fixes from yesterday’s issues. quattor is stalling and timing out, even on rocky8 where the run would usually be very fast

10:35 - Tom Birkett - raised in the #network channel about also seeing DNS issues

10:58 - Maksim - also attempts to get quattor to run with mixed results. failures to fetch from yum repos or very long response times

10:30-11:14 - DNS issues get worse, all services affected

11:14 - James Adams, Tom Birkett, Rob Harper - identified a problematic DNS server, first attempted fix put in place

12:50 - Jyothish - one of the gateways succeeded in running quattor and starts succeeding functional tests again

15:28 - James Adams, Tom Birkett - found the issue with DNS, reversion done on DNS servers

15:44 - Jyothish - managers and gws reconfigured and switch to rocky8 manager attempted

15:47 - Jyothish - probles still present. switch reverted

15:52 - James Adams - DNS issues resolved, hosts might need a quattor rerun

16:00 - Alexander Rogovskiy - problems still present on transfers

16:19 - Jyothish - reconfigured managers, asked Alexander Rogovskiy for confirmation, issue persists

16:54 - Jyothish - identified ceph-svc21 to be in a stuck authentication state. fixed and error rate lowered, but timeout errors still present. Load on manager01 is very high, thought to be due to backlog of transfers, should calm down later

17:10 - Jyothish - load on manager01 keeps increasing, rebase and reconfigure attempted, yum lock frozen so configuration runs fail

18:00 - Jyothish - attempted restarts and quattor runs, yum lock still frozen, force runs failed

18:50 - Alexander Rogovskiy - failures increased and still present

20:00 - quattor finishes running, failures go down

14/06

04:00 - LHCb sees some WN failures due to timeouts

05:00-06:00 - high failure rate for ATLAS, but only against SARA

12:15 - Brian - high failure on transfers with Antares seen. Jyothish investigating

13:45 - Jyothish - the rollback on the latest load balancing algorithm on manager01 (SL7) also rolled back some patches that might affect el9 compatibility, which could be the causes of the SSL errors. Attempting to upgrade xrootd to the pre-OS upgrade version.

13:55 - Jyothish - pre-upgrade version is still throwing errors. Post update informing changes are frozen for today as further changes might destabilize the system more. Current state is the best it can be for the weekend.

14:10 - Jyothish - found a possible fix on the load balancing code and post update on planned actions for next week:

list of things to try out next week:

  1. latest RAL version of xrootd reverting the custom load balancing algorithm on rocky8. (and try to identify any other bug that arises)

    1. if it works, upgrade manager01 to that version on rocky8 while the load balancing code gets reviewed

  2. try xrootd5.6.9 with RAL patches

  3. if the above fix does not work by monday afternoon, redirect checksums from manager01 to manager02 (current sl7 version in prod is using suboptimal checksumming)

 


Main issues

quattor is a lot slower to run on sl7, even under low loads (~5minutes minimum). This significantly slowed down the time taken for the manager01 to come back after the os reversal was started (~2hrs), as well as slowing down the deployment of the network fixes. Under heavy xrootd load this slowed down further taking ~20min to multiple hours on the 13th evening.

Opaque errors in xrootd caused the start of the reversion to take place before the general network issues were identified. While this guaranteed getting back into a working state at the end of the issues, it is possible that recompiling rocky8 packages for the same version used in the current sl7 host might have got similar results while having a significantly faster deployment time for any aquilon changes.

 


Successful Change

17/06/24

08:40 - Jyothish - first patch (adding additional checks on the host status before adding it to the load balancing selection) is deployed on manager02.

09:00 - Jyothish, Katy, Alex - no failures since, but kept under observation for confirmation

10:48 - Alex - LHCb started seeing failures again, similar error

10:50 - Jyothish - revert to manager01 primary

11:00 - Jyothish - deployed xrootd 5.6.9 without the improved load balancing

13:15 - Alex - LHCb looks fine since the change

16:02 - Katy - CMS sam tests are failing

16:03 - Jyothish - failing tests are not on the xrootd endpoint

16:05 - Jyothish, Katy - issue is on the cms-aaa

16:20 - Jyothish, Katy - issue identified as stuck auth gateways on cms-aaa, service restarts fixed it

16:44 - Jyothish- “manager02 is running stably for now. I'd like to keep it this way until tomorrow.
The issue seen on the version of xrootd with the custom load balancing took some time before appearing (onset of ~1 hour runtime).
The current version has been running fine since 11AM today. There is a small risk of that or similar issue reoccurring out of hours, but we'd need the server to run that long to know if it's fully gone. provided manager02 looks ok overnight, I'd like to upgrade manager01 to rocky8 with this setup tomorrow afternoon, and put it into production on wednesday morning”

18/06/24

08:44 - Jyothish -no error overnight, but the memory/load has not come down, which is concerning

09:00-14:00 - Jyothish, Alex, James A - network debugging, identified occasional slow ping response and DNS resolution, attempted enabling DNS caching without much success

13:41 - Jyothish - reverted DNS caching, reverted to using manager01 as primary. different error still around the load balancing observed, notified the xrootd github PR for this feature, asking for help in debugging

14:31 - Jyothish - issue identified: explicit memory allocation is not handled well by the malloc wrapper.

15:23 - Jyothish - “There's two ways to go about it:

  • turn the weighed sum array into a class variable for the cluster

    • it's the most optimal solution in terms of memory

    • it's going to need some sort of locking protection on updating the weighted sums array being rewritten now that it's going to be static - i'm not quite sure how to go about it. the node load calculation has a separate module updating their load, but that poller (/src/XrdCms/XrdCmsMeter.cc) only has the context about individual nodes

    • the tricky bit is to avoid a deadlock on requests, as each transfer request would trigger and release this lock

  • remove the memory allocation from the array and turn into a fixed allocation standard array

    • it's going to duplicate the space required for each request, but there will be no explicit memory allocation involved

    • potential memory overhead (later calculation estimated this to be ~1.5MB at current production transfer volumes)

suggestions welcome, especially for option A“

16:08 - Jyothish - following a discussion of the above with James W, option B is to be tested for a short time (~5 min), then reverted and if working will be deployed again tomorrow.

16:13 - Jyothish - preliminary test was ok. change reverted

19:53 - Alex, Brij, Katy - large spike in failures seen, mostly around authentication

19/06/24

08:30 - Jyothish - 4 gateways overnight got in the stuck auth state. restart fixed it but this proved the problem got a lot more frequent.
”A mitigation for the noauth issue would be to stick a service restart on the check that detects noauth failures. I'll make it configurable so that it can be changed as necessary”

09:07 - Jyothish - “xrootd 5.6.9 appears to now be working on the gateway with xrdceph. I've accidentally installed that version on gw14 (the if condition I set on aquilon to only use it on the managers didn't work as I expected). before rolling it back I just checked the logs to confirm whether it was segfaulting and got very surprised to see it working as normal. I'd like to keep gw14 on it for a while to see if that keeps working well, as that would mean we can finally go up to 5.6+ on the gateways“

09:21 - Jyothish - CMS test are green again, redirector change from previous evening rolled out again

10:11 - Alastair - asks to start the liason meeting with an update on the current status of xrootd/echo and the impact on VOs

11:45 - Brij, Jyothish - atlas and CMS tests stayed green

15:36 - Jyothish - manager02 throwing weird errors in aquilon, host thinks it’s on el9 inventory. This was tracked down to a human error on the fabric team, all hosts rechecked to ensure they’re on the correct personality

16:28 - Jyothish - reverted from 5.6.9 to 5.5.4 base version due to frequent SSL errors. evidence at the time seemed to indicate gw14 having trouble, but future evidence indicates this could have been due to the crash restart as well. current state kept overnight


AT THIS POINT THE MANAGER ISSUE WAS FIXED