timeline of events:
8:55 Alexander Rogovskiy points out spikes in failed WN uploads overnight as well as antares ↔︎ ECHO transfers
8:58 Thomas, Jyothish (STFC,RAL,SC) Alexander Rogovskiy rule out workernode restarts as a cause as the failure period is too long
8:59 Thomas, Jyothish (STFC,RAL,SC) checks gateway dashboard and everything looks normal, but there’s a noticable dip in traffic since ~6AM
9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways
...
9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now
9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning
...
1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load).
system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias
...
4:10- confirmed significant improvement, things look OK for the weekend
Key Takeaways
...
group meeting was very helpful in resolution and prioritization during the incident.
quick turnaround for testing different ideas due to implementer having access to external host (lxplus)
load issues can compound each other to make matters worse
system resources should be reviewed at appropriate periods to ensure they’re fit for purpose.
resource access timelines should be clarified on proposed solution - e.g. CPU could have been added to the VM earlier if it was confirmed someone present at the meeting could do it, which was overlooked and not well worded.
This oversight did result in a more thorough investigation that improved the efficiency of the system, but could have been cleared up earlier if wasn’t.
Good rapport and expertise awareness enabled a focused group to be called to assist.