/
Reboot Campaign

Reboot Campaign

All external gateways need to get rebooted every 90 days to install firmware updates and as a general maintenance. The procedure is the same for all gateways except for the ones in the s3.echo.stfc.ac.uk and echo.stfc.ac.uk aliases due to those being in a DNS round robin for rados gw access. Check the spreadsheet for an up to date list of those hosts

Scripts

blacklist.sh

ssh -i .ssh/id_rsa root@echo-manager02.gridpp.rl.ac.uk "echo \"${1}.gridpp.rl.ac.uk\" >> /etc/xrootd/cms.blacklist" ssh -i .ssh/id_rsa root@echo-manager01.gridpp.rl.ac.uk "echo \"${1}.gridpp.rl.ac.uk\" >> /etc/xrootd/cms.blacklist"

unblacklist.sh

ssh -i .ssh/id_rsa root@echo-manager02.gridpp.rl.ac.uk "sed -i \"/${1}/d\" /etc/xrootd/cms.blacklist" ssh -i .ssh/id_rsa root@echo-manager01.gridpp.rl.ac.uk "sed -i \"/${1}/d\" /etc/xrootd/cms.blacklist"

Procedure

  1. check the current transfer load on the gateways trough the grafana dashboard.

  2. If the troughput average is more than 22Gb/s (>90% of maximum network capacity) do not proceed

  3. For each host or batch of hosts that is currently in production use:

    1. run the following command. hostname_prefix is the part before .gridpp.rl.ac.uk, for example ceph-svc01

      bash blacklist.sh <hostname_prefix>
    2. wait till the traffic drops (usually 15 min).

    3. ssh into the host and run “reboot“

    4. wait for the host to come back (10-20min)

    5. check the systemd services xrootd@{unified,tpc} and cmsd@unified are running and active

    6. run