timeline of events:
8:55 Alexander Rogovskiy points out spikes in failed WN uploads overnight as well as antares ↔︎ ECHO transfers
8:58 Thomas, Jyothish (STFC,RAL,SC) Alexander Rogovskiy rule out workernode restarts as a cause as the failure period is too long
8:59 Thomas, Jyothish (STFC,RAL,SC) checks gateway dashboard and everything looks normal, but there’s a noticable dip in traffic since ~6AM
9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways
...
9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now
9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning
...
1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load).
system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias
...
4:10- confirmed significant improvement, things look OK for the weekend