Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Current »

Proposed solution

1.     Basic Information

Title

 Temporary alias for xrootd internal traffic

Submitted by

 Jyothish Thomas

Requested by

 Alastair Dewhurst

Summary

Problem

The new gateways do not have external ipv6 access due to pending network interventions that have delayed their deployments for multiple weeks. The current external gateways are under heavy load from the current traffic and cause functional test and job failures.
the new gateways are:
ceph-svc11.gridpp.rl.ac.uk
ceph-svc13.gridpp.rl.ac.uk
ceph-svc14.gridpp.rl.ac.uk
ceph-svc15.gridpp.rl.ac.uk
ceph-svc17.gridpp.rl.ac.uk
ceph-svc18.gridpp.rl.ac.uk

ceph-svc01.gridpp.rl.ac.uk
ceph-svc02.gridpp.rl.ac.uk

Proposed solution

Add an additional DNS round robin alias (internal.echo.stfc.ac.uk) that maps to the gateways pending deployment.
Add a routing rule on the batch farm to redirect xrootd.echo.stfc.ac.uk and webdav.echo.stfc.ac.uk traffic to it instead. (similar to current redirection to workernode gateway container)

This can be done by assigning each job container to a random gateway in the above list,

Direct transfers would take place without issues, and if the jobs perform tpc the traffic should go over ipv4 as the job containers are ipv4 only.

Urgency

 Urgent

Impact of successfully implementing the change

 Workernode load affecting external xrootd gateways will be diverted to a set of currently unused gateways, thereby reducing load related issues in production

Consultation

 

Type of Change

 

Link to Change Control master ticket (RT or JSM)

  XRD-74 - Getting issue details... STATUS

2.     Likelihood of Problems Occurring

Details of testing carried out

 After creating the alias and mapping the new gateways to it, functional tests will be run on the alias. VOs can also run their functional tests targeting the alias.

Further tests required prior to implementation

 

Deployed/tested at other WLCG/EGEE site?

 

Can be phased in stages?

 

Implementation plan

 

Post implementation testing

 

Reversion plan in case of problems

 delete the iptables rule

Has this been successfully reviewed with production team against new service ticklist.

(This should be done for significant changes to services too).

 

3.     Residual risks

Residual risk 1

 One gw going down can affect a percentage of jobs (Round robin related risks)

Residual risk 2

 ipv6 change might take longer (or other problems occur)

Residual risk 3

 

 4.     Impact of problems if they occur

Taking into account the risks described above:

Affected components

 batch farm, external gateways

VOs likely to be affected

 all VOs running jobs

Impact on existing data

 none

Impact on subsequent data

 none

  • No labels