2024-06-12/13 Manager Upgrade failure

12/06

10:00 - Jyothish - manager01 upgrade started
12:30 - Jyothish -manager01 finishes upgrade, memory allocation errors show up on xrootd, mainly around the host lookup section

12:49 - Jyothish - services stopped on manager01 (all traffic redirected to manager02). manager02 is showing the same errors. This is now thought to be a rocky8 issue with xrootd

13:50 - Jyothish - decision taken to revert manager01 to sl7, reversion started

14:10 - Maksim - informs he has been having issues on host upgrades due to a network change introduced by cloud

14:50 - Alexander Dibbo - posts in #data services confirming the network change accidentally broke the legacy tier-1 network (on which the echo-managers are). Fix is available in prod aquilon

15:35 - Jyothish - manager01 came back online with the pre-upgrade configuration, but it’s still throwing memory allocation errors. The log information was more useful and flagged the memory allocation errors taking place within the load balancing section of the code

16:37 - Jyothish - xrootd version on manager01 rolled back to pre-load balancing improvements version (5.5.4-4). This also rolled back the patch for el9 case insensitive authz headers. Manager is still under heavy load (due to network change not yet being deployed by aquilon).

17:07 - Jyothish - quattor finishes running on the host with significant manual prodding (yup and xrootd process kills to ensure full capacity was given for the quattor run)

17:12 - Jyothish - host is back in a working state, still on the previous xrootd version

13/06

09:47 - Jyothish - patched and compiled xrootd-5.5.4-7 rpms addressing an identified memory inefficiency on the load balancing in 5.5.4-6 (pre-upgrade version). Tested deployment on manager02. manager01 stopped but kept on sl7 for immediate reversion if the test fails.

09:51 - Jyothish - no memory failures observed on manager02

10:07 - Jyothish - all gws start to fail functional tests. change reverted

10:16 - Jyothish - gws are still very slow and failing tests. service restart did not fix it

10:24 - Jyothish - initial nc response time is very slow on tests from lcgui (20s). potential DNS issue suspected

10:35 - Jyothish - attempted rebase and quattor reconfiguration to deploy further fixes from yesterday’s issues. quattor is stalling and timing out, even on rocky8 where the run would usually be very fast

10:35 - Tom Birkett - raised in the #network channel about also seeing DNS issues

10:58 - Maksim - also attempts to get quattor to run with mixed results. failures to fetch from yum repos or very long response times

10:30-11:14 - DNS issues get worse, all services affected

11:14 - James Adams, Tom Birkett, Rob Harper - identified a problematic DNS server, first attempted fix put in place

12:50 - Jyothish - one of the gateways succeeded in running quattor and starts succeeding functional tests again

15:28 - James Adams, Tom Birkett - found the issue with DNS, reversion done on DNS servers

15:44 - Jyothish - managers and gws reconfigured and switch to rocky8 manager attempted

15:47 - Jyothish - probles still present. switch reverted

15:52 - James Adams - DNS issues resolved, hosts might need a quattor rerun

16:00 - Alexander Rogovskiy - problems still present on transfers

16:19 - Jyothish - reconfigured managers, asked Alexander Rogovskiy for confirmation, issue persists

16:54 - Jyothish - identified ceph-svc21 to be in a stuck authentication state. fixed and error rate lowered, but timeout errors still present. Load on manager01 is very high, thought to be due to backlog of transfers, should calm down later

17:10 - Jyothish - load on manager01 keeps increasing, rebase and reconfigure attempted, yum lock frozen so configuration runs fail

18:00 - Jyothish - attempted restarts and quattor runs, yum lock still frozen, force runs failed

18:50 - Alexander Rogovskiy - failures increased and still present

20:00 - quattor finishes running, failures go down

14/06

04:00 - LHCb sees some WN failures due to timeouts

05:00-06:00 - high failure rate for ATLAS, but only against SARA

Main issues

quattor is a lot slower to run on sl7, even under low loads (~5minutes minimum). This significantly slowed down the time taken for the manager01 to come back after the os reversal was started (~2hrs), as well as slowing down the deployment of the network fixes. Under heavy xrootd load this slowed down further taking ~20min to multiple hours on the 13th evening.

Opaque errors in xrootd caused the start of the reversion to take place before the general network issues were identified. While this guaranteed getting back into a working state at the end of the issues, it is possible that recompiling rocky8 packages for the same version used in the current sl7 host might have got similar results while having a significantly faster deployment time for any aquilon changes.