Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

timeline of events:
8:55 Alexander Rogovskiy points out spikes in failed WN uploads overnight as well as antares ↔︎ ECHO transfers
8:58 Thomas, Jyothish (STFC,RAL,SC) Alexander Rogovskiy rule out workernode restarts as a cause as the failure period is too long
8:59 Thomas, Jyothish (STFC,RAL,SC) checks gateway dashboard and everything looks normal, but there’s a noticable dip in traffic since ~6AM

image-20250206-141815.pngImage Added

9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways

...

9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now

image-20250206-142126.pngImage Added

image-20250206-141943.pngImage Added


9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning

...

1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load).

image-20250206-142715.pngImage Added

system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias

...

4:10- confirmed significant improvement, things look OK for the weekend

image-20250206-142351.pngImage Added