Content Comparison

timeline of events:
8:55 Alexander Rogovskiy points out spikes in failed WN uploads overnight as well as antares ↔︎ ECHO transfers
8:58 Thomas, Jyothish (STFC,RAL,SC) Alexander Rogovskiy rule out workernode restarts as a cause as the failure period is too long
8:59 Thomas, Jyothish (STFC,RAL,SC) checks gateway dashboard and everything looks normal, but there’s a noticable dip in traffic since ~6AM

Image Added

9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways

...

9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now

Image Added

9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning

...

1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load).

Image Added

system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias

...

4:10- confirmed significant improvement, things look OK for the weekend

Image Added

Version	Old Version 2	New Version 3
Changes made by	Thomas, Jyothish (STFC,RAL,SC)	Thomas, Jyothish (STFC,RAL,SC)
Saved on	Jan 31, 2025	Feb 06, 2025

Versions Compared

Key