...
kibana dashboard for WN/tranche IOPS monitoring
Echo storage node IOPS (per generation)
XrootD production changes
External gateways
9/05/23 - pgwrite bugfix rollout on external gateways | deemed irrelevant to the incident
Batch farm:
9th May (9:30 am): Draining of first half of worker-nodes
11th May (9:30 am): Update drained worker-nodes
Bring back online updated tranches
12th May (16:00): Drain remaining half of worker-nodes
15th May (14:00): Update drained worker-nodes
Bring back online updated tranches
Health check entire farm
Plots and associated info
...
This has been found to be due to a missed line change in the dockerfile.
Hard limit for read IOPs before the crash in ceph monitoring seem to be 150k, with a desirable rate of <100k. current rate (without readV) is 30k