...
9:07 Thomas, Jyothish (STFC,RAL,SC) reports nothing odd on usual keyword searches on xrootd logs on the gateways
9:07 George Patargias Alexander Rogovskiy Thomas, Jyothish (STFC,RAL,SC) narrow connect the antares issue to the drop in throughput from Echo.
9:10 Alexander Rogovskiy Thomas, Jyothish (STFC,RAL,SC) attempt to replicate the issue manually. Alex finds out the managers time out on ping
9:15 Thomas, Jyothish (STFC,RAL,SC) checks managers. noticable slowness in ssh connections, disk space is availble but cmsd status shows odd errors. e.g. /system/cmsd@.service:14: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
. both managers are rebooted.
9:20 Thomas, Jyothish (STFC,RAL,SC) attempts ping tests. service is better but ipv4 port pings are noticably slower than ipv6.
9:30 Thomas, Jyothish (STFC,RAL,SC) attempted quattor network run. the fetch is stalling. suspected ipv4 network issues and asked James Adams and Thomas Birkett for help
9:33 James Walder posts dashboard for system metrics showing high CPU load https://vande.gridpp.rl.ac.uk/next/d/jdKEehP7k/fe528f99-a35e-57ba-b879-755c2fb4cd83?orgId=1&refresh=1m&var-datasource=ykH9GHGMk&var-domain=gridpp.rl.ac.uk&var-server=echo-manager01.gridpp.rl.ac.uk&var-inter=$__auto_interval_inter&var-prefix=&var-rp=autogen&from=now-12h&to=now
9:34 Thomas, Jyothish (STFC,RAL,SC) James Walder theorise that latency caused a buildup of threads, that took CPU resources away from the managers. Root cause timeline narrowed down to 1AM that morning
9:43Thomas, Jyothish (STFC,RAL,SC) narrowed down the issue to echo managers ipv4 only (gateways and WN are working fine), asked for help in networking channel
9:34Thomas, Jyothish (STFC,RAL,SC) further narrowed down to keepalived IPs only, not host IP
9:55Thomas, Jyothish (STFC,RAL,SC) Thomas Birkett further investigation on keepalived show the script checking the service status timing out. Hypothesis that hosts fighting for the VRIP are making the issue worse
10:14James Walder points out managers came back from reboot with swap enabled. swap is turned off
10:22 Rob requests ticket creation and asks to focus on understanding what might have been the root cause that started the failures at 6:00. Thomas, Jyothish (STFC,RAL,SC) agrees in principle but usual avenues to investigate root cause (logs, elog, dashboard) didn’t provide any lead to start an investigation, hence decides to prioritize mitigating the issue first before more time can be taken to investigate in more depth. creates ticket initially in the Ceph Service Queue but not all present could access it. Moved to the XRootD queue later
10:26 Thomas Birkett points out the antares buffersthe high CPU usage on the managers - consensus at the time is this is primarily caused by thread buildup due to network latency, due to test managers, gateways and physical managers responding fine to network checks.
10:29 Thomas, Jyothish (STFC,RAL,SC) lxplus nc manual checks seemed to work fine. this later proved to be a red herring - the nc timeouts were set lower on lxplus. additionally, it was later known that the slow response was not guaranteed, but took place ~80% of the attempts. initial tests might have been lucky
10:08-10:44 Thomas, Jyothish (STFC,RAL,SC) convenes first emergency meeting with Production team (Darren Moore ,Brian Davies ) Acting Tier-1 group lead (Alastair Dewhurst ), network experts James Adams , Thomas Birkett ), storage team (Rob Gian Carlo Garces ) and VO liasons (Alexander Rogovskiy Brij Jashal Katy Ellis ). Relays the problem and sets out a possible solution. Initial plan was to move ahead the approved change control and switch to using the bare metal hosts that were going to be the new managers (/wiki/spaces/GRIDPP/pages/872644647 ) however this was blocked due to needing both firewall and DNS advertisements which cannot be done in a day. VO liasons point out new job mixes that started around that time and might have contributed to the problem. Proposal by James Adams to to increase keepalived timeouts and intervals. Plan is to try that as well as investigate further, reconvene at 11:45 and make further DNS changes if needed in the early afternoon.
10:44 Katy Ellis Thomas Birkett hypothesis on possible DI change ruled out as it only affects these managers
11:05 Thomas, Jyothish (STFC,RAL,SC) keepalived changes deployed and tested, results look promising as ping times went down. Katy Ellis sees last CMS check as green
11:19 Alexander Rogovskiy points out load is still very high. James Walder points out lower load of one of the gateways that was recently restarted. Both gateways are restarted but load is still increasing.
11:42 Brian Davies links general VO dashboard to monitor broaded metrics as seen by VOs. overall efficiency is still low
11:45 second group meeting:
DNS change discouraged by James Adams - if things go wrong it takes time to know and we wouldn’t have time to rollback today. also points out resource contention between keepalived and xrootd.
actions from 11:45 meeting - stop keepalived on one host to see if fixing the IPs improves things
reduce the priority of xrootd process compared to keepalived to enable network priority
reconvene at 1:30
11:50 Katy Ellis confirms the observation, tests are still mostly failing
11:58-12:40 Thomas, Jyothish (STFC,RAL,SC) IPs fixed on one host. some improvement in network latency but CPU load is still climbing. James Adams creates a sandbox for tuning xrootd systemd process nicety to allow core system processes priority and tests deployment on test cluster. Thomas, Jyothish (STFC,RAL,SC) adds the change on prod at 12:42
12:11 Alexander Rogovskiy identifies transfer patterns. most active connections are from LHCb, followed by CMS. most of the CPU load is from xrootd
12:22 James Adams Identifies a backup operation on the VMWare cluster that happened at the same time the issues first started
12:27 Alexander Rogovskiy suggests increasing the core count on the VMs, Thomas Birkett suggests increasing the niceness. Thomas, Jyothish (STFC,RAL,SC) - the second idea is what is going to be tried in 10 minutes. increasing the core count might be unfeasible due to delays in communication and action with the relevant teams.
1:30 meeting update - both changes were actioned and improved things, especially on the load metric that was being observed (later known to be the contention, not the CPU load). system was more responsive and ping times were improved. Decision taken by production team not to declare downtime as tickets were deemed needed to moitor changes of the status. actions to reduce VO occupancy in batch farm to reduce load if needed and possibly redirect LHCb jobs to use the managers directly instead of the alias
Meeting notes:
Status at 1:30 meeting: Load problem resolved, but network latency persists.
Actions: Try another reboot - done, not much improvement
limit batch farm traffic to reduce load - done
switch lhcb enpoint to be echo-manager01 (so it avoids keepalivedIP)
Theories: possible that rapid ip switching in keepalived caused the network switch to break
Observations: this only affects the keepalived IP, not the host IP
Possible mitigations:
add internal managers as main cluster managers and see if distributing the load reduces it - try Mon
replace DNS alias entries with manager host IPs - try Mon/Tue
ensure proper network advertisements are in place for the physical managers and use those- Tue if working
put gateways into RR - last resort
2:00 Thomas, Jyothish (STFC,RAL,SC) updates VO tickets:We've identified the issue as caused by a delay in ipv4 response from our HA setup on the XRootD redirectors. We've tracked it down to a combination of CPU load and network latency on keepalived. We've tracked down and resolved the CPU load issue to a process priority conflict and resolved it, however the response delay is still ongoing. We have further mitigation plans that will necessitate DNS changes which cannot be actioned until Monday. I'll post further updates on Monday after attempting those fixes. Regards, Jyothish
2:05 Alexander Rogovskiy Thomas Birkett Thomas, Jyothish (STFC,RAL,SC) discussion on changing the endpoint. decision to hold until more is known
2:05-2:30 Thomas, Jyothish (STFC,RAL,SC) James Adams investigate network side and get very confused as the ping response is very specific but should not be - machine with 2 IPs only shows delays when pinging the keepalived IP on the same port.
2:20-3:30 Thomas, Jyothish (STFC,RAL,SC) Thomas Birkett further debugging showed the CPUs were still overloaded. decided to disable the script check for keepalived in order to fix the IPs but still maintain capacity for the alias. Discovered that the keepalived when stopped but not disabled would turn itself back on, causing the floating IPs to hobbe between them. Tom confirms he has access to the VMWare dashboard and can make the change to add CPU cores
3:30Thomas, Jyothish (STFC,RAL,SC) Thomas Birkett Darren Moore Brian Davies - discuss this option and agreed to try as it will be transparent and straightforward.
3:46 - change is complete, both managers core count increased from 4 to 8
4:10- confirmed significant improvement, things look OK for the weekend