12/06
10:00 - Jyothish - manager01 upgrade started
12:30 - Jyothish -manager01 finishes upgrade, memory allocation errors show up on xrootd, mainly around the host lookup section
12:49 - Jyothish - services stopped on manager01 (all traffic redirected to manager02). manager02 is showing the same errors. This is now thought to be a rocky8 issue with xrootd
<error information and debugging info is collected>
13:50 - Jyothish - decision taken to revert manager01 to sl7, reversion started
14:10 - Maksim - informs he has been having issues on host upgrades due to a network change introduced by cloud
14:50 - Alexander Dibbo - posts in #data services confirming the network change accidentally broke the legacy tier-1 network (on which the echo-managers are). Fix is available in prod aquilon
15:35 - Jyothish - manager01 came back online with the pre-upgrade configuration, but it’s still throwing memory allocation errors. The log information was more useful and flagged the memory allocation errors taking place within the load balancing section of the code
16:37 - Jyothish - xrootd version on manager01 rolled back to pre-load balancing improvements version (5.5.4-4). This also rolled back the patch for el9 case insensitive authz headers. Manager is still under heavy load (due to network change not yet being deployed by aquilon).
17:07 - Jyothish - quattor finishes running on the host with significant manual prodding (yup and xrootd process kills to ensure full capacity was given for the quattor run)
17:12 - Jyothish - host is back in a working state, still on the previous xrootd version
13/06
09:47 - Jyothish - patched and compiled xrootd-5.5.4-7 rpms addressing an identified memory inefficiency on the load balancing in 5.5.4-6 (pre-upgrade version). Tested deployment on manager02. manager01 stopped but kept on sl7 for immediate reversion if the test fails.
09:51 - Jyothish - no memory failures observed on manager02
10:07 - Jyothish - all gws start to fail functional tests. change reverted
10:16 - Jyothish - gws are still very slow and failing tests. service restart did not fix it
10:24 - Jyothish - initial nc response time is very slow on tests from lcgui (20s). potential DNS issue suspected
10:35 - Jyothish - attempted rebase and quattor reconfiguration to deploy further fixes from yesterday’s issues. quattor is stalling and timing out, even on rocky8 where the run would usually be very fast
10:35 - Tom Birkett - raised in the #network channel about also seeing DNS issues
10:58 - Maksim - also attempts to get quattor to run with mixed results. failures to fetch from yum repos or very long response times
10:30-11:14 - DNS issues get worse, all services affected
11:14 - James Adams, Tom Birkett, Rob Harper - identified a problematic DNS server, first attempted fix put in place
12:50 - Jyothish - one of the gateways succeeded in running quattor and starts succeeding functional tests again
15:28 - James Adams, Tom Birkett - found the issue with DNS, reversion done on DNS servers
15:44 - Jyothish - managers and gws reconfigured and switch to rocky8 manager attempted
15:47 - Jyothish - probles still present. switch reverted
15:52 - James Adams - DNS issues resolved, hosts might need a quattor rerun
16:00 - Alexander Rogovskiy - problems still present on transfers
16:19 - Jyothish - reconfigured managers, asked Alexander Rogovskiy for confirmation, issue persists
16:54 - Jyothish - identified ceph-svc21 to be in a stuck authentication state. fixed and error rate lowered, but timeout errors still present. Load on manager01 is very high, thought to be due to backlog of transfers, should calm down later
17:10 - Jyothish - load on manager01 keeps increasing, rebase and reconfigure attempted, yum lock frozen so configuration runs fail
18:00 - Jyothish - attempted restarts and quattor runs, yum lock still frozen, force runs failed
18:50 - Alexander Rogovskiy - failures increased and still present
20:00 - quattor finishes running, failures go down
14/06
04:00 - LHCb sees some WN failures due to timeouts
05:00-06:00 - high failure rate for ATLAS, but only against SARA
Main issues
quattor is a lot slower to run on sl7, even under low loads (~5minutes minimum). This significantly slowed down the time taken for the manager01 to come back after the os reversal was started (~2hrs), as well as slowing down the deployment of the network fixes. Under heavy xrootd load this slowed down further taking ~20min to multiple hours on the 13th evening.
Opaque errors in xrootd caused the start of the reversion to take place before the general network issues were identified. While this guaranteed getting back into a working state at the end of the issues, it is possible that recompiling rocky8 packages for the same version used in the current sl7 host might have got similar results while having a significantly faster deployment time for any aquilon changes.