Temporary alias for xrootd internal traffic

Proposed solution

1. Basic Information

Title	Temporary alias for xrootd internal traffic
Submitted by	Jyothish Thomas
Requested by	Alastair Dewhurst
Summary	Problem The new gateways do not have external ipv6 access due to pending network interventions that have delayed their deployments for multiple weeks. The current external gateways are under heavy load from the current traffic and cause functional test and job failures. the new gateways are: ceph-svc11.gridpp.rl.ac.uk ceph-svc13.gridpp.rl.ac.uk ceph-svc14.gridpp.rl.ac.uk ceph-svc15.gridpp.rl.ac.uk ceph-svc17.gridpp.rl.ac.uk ceph-svc18.gridpp.rl.ac.uk ceph-svc01.gridpp.rl.ac.uk ceph-svc02.gridpp.rl.ac.uk Proposed solution Add an additional DNS round robin alias (internal.echo.stfc.ac.uk) that maps to the gateways pending deployment. Add a routing rule on the batch farm to redirect xrootd.echo.stfc.ac.uk and webdav.echo.stfc.ac.uk traffic to it instead. (similar to current redirection to workernode gateway container) This can be done by assigning each job container to a random gateway in the above list, Direct transfers would take place without issues, and if the jobs perform tpc the traffic should go over ipv4 as the job containers are ipv4 only.
Urgency	Urgent
Impact of successfully implementing the change	Workernode load affecting external xrootd gateways will be diverted to a set of currently unused gateways, thereby reducing load related issues in production
Consultation
Type of Change
Link to Change Control master ticket (RT or JSM)	XRD-74 - Getting issue details... STATUS

2. Likelihood of Problems Occurring

Details of testing carried out	After creating the alias and mapping the new gateways to it, functional tests will be run on the alias. VOs can also run their functional tests targeting the alias.
Further tests required prior to implementation
Deployed/tested at other WLCG/EGEE site?
Can be phased in stages?
Implementation plan
Post implementation testing
Reversion plan in case of problems	delete the iptables rule
Has this been successfully reviewed with production team against new service ticklist. (This should be done for significant changes to services too).

3. Residual risks

Residual risk 1	One gw going down can affect a percentage of jobs (Round robin related risks)
Residual risk 2	ipv6 change might take longer (or other problems occur)
Residual risk 3

4. Impact of problems if they occur

Taking into account the risks described above:

Affected components	batch farm, external gateways
VOs likely to be affected	all VOs running jobs
Impact on existing data	none
Impact on subsequent data	none