Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

timeline of events:
8:55 Alexander Rogovskiy points out spikes in failed WN uploads overnight as well as antares ↔︎ ECHO transfers
8:58 Thomas, Jyothish (STFC,RAL,SC) Alexander Rogovskiy rule out workernode restarts as a cause as the failure period is too long
8:59 Thomas, Jyothish (STFC,RAL,SC) checks gateway dashboard and everything looks normal, but there’s a noticable dip in traffic since ~6AM

image-20250206-141815.pngImage Added

9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways

...

9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now

image-20250206-142126.pngImage Added

image-20250206-141943.pngImage Added


9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning

...

1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load).

image-20250206-142715.pngImage Added

system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias

...

4:10- confirmed significant improvement, things look OK for the weekend

image-20250206-142351.pngImage Added

Key Takeaways

...

group meeting was very helpful in resolution and prioritization during the incident.

quick turnaround for testing different ideas due to implementer having access to external host (lxplus)

load issues can compound each other to make matters worse

system resources should be reviewed at appropriate periods to ensure they’re fit for purpose.

resource access timelines should be clarified on proposed solution - e.g. CPU could have been added to the VM earlier if it was confirmed someone present at the meeting could do it, which was overlooked and not well worded.
This oversight did result in a more thorough investigation that improved the efficiency of the system, but could have been cleared up earlier if wasn’t.

Good rapport and expertise awareness enabled a focused group to be called to assist.