Reboot Campaign
All external gateways need to get rebooted every 90 days to install firmware updates and as a general maintenance. The procedure is the same for all gateways except for the ones in the s3.echo.stfc.ac.uk and echo.stfc.ac.uk aliases due to those being in a DNS round robin for rados gw access. Check the spreadsheet for an up to date list of those hosts
Scripts
blacklist.sh
ssh -i .ssh/id_rsa root@echo-manager02.gridpp.rl.ac.uk "echo \"${1}.gridpp.rl.ac.uk\" >> /etc/xrootd/cms.blacklist"
ssh -i .ssh/id_rsa root@echo-manager01.gridpp.rl.ac.uk "echo \"${1}.gridpp.rl.ac.uk\" >> /etc/xrootd/cms.blacklist"
unblacklist.sh
ssh -i .ssh/id_rsa root@echo-manager02.gridpp.rl.ac.uk "sed -i \"/${1}/d\" /etc/xrootd/cms.blacklist"
ssh -i .ssh/id_rsa root@echo-manager01.gridpp.rl.ac.uk "sed -i \"/${1}/d\" /etc/xrootd/cms.blacklist"
Procedure
check the current transfer load on the gateways trough the grafana dashboard.
If the troughput average is more than 22Gb/s (>90% of maximum network capacity) do not proceed
For each host or batch of hosts that is currently in production use:
run the following command. hostname_prefix is the part before .gridpp.rl.ac.uk, for example ceph-svc01
bash blacklist.sh <hostname_prefix>
wait till the traffic drops (usually 15 min).
ssh into the host and run “reboot“
wait for the host to come back (10-20min)
check the systemd services xrootd@{unified,tpc} and cmsd@unified are running and active
run