Proposed solution
1. Basic Information
Title | Temporary alias for xrootd internal traffic |
Submitted by | Jyothish Thomas |
Requested by | Alastair Dewhurst |
Summary | ProblemThe new gateways do not have external ipv6 access due to pending network interventions that have delayed their deployments for multiple weeks. The current external gateways are under heavy load from the current traffic and cause functional test and job failures. Proposed solutionAdd an additional DNS round robin alias (internal.echo.stfc.ac.uk) that maps to the gateways pending deployment. This can be done by assigning each job container to a random gateway in the above list, Direct transfers would take place without issues, and if the jobs perform tpc the traffic should go over ipv4 as the job containers are ipv4 only. |
Urgency | Urgent |
Impact of successfully implementing the change | Workernode load affecting external xrootd gateways will be diverted to a set of currently unused gateways, thereby reducing load related issues in production |
Consultation |
|
Type of Change |
|
Link to Change Control master ticket (RT or JSM) |
2. Likelihood of Problems Occurring
Details of testing carried out | After creating the alias and mapping the new gateways to it, functional tests will be run on the alias. VOs can also run their functional tests targeting the alias. |
Further tests required prior to implementation |
|
Deployed/tested at other WLCG/EGEE site? |
|
Can be phased in stages? |
|
Implementation plan |
|
Post implementation testing |
|
Reversion plan in case of problems | delete the iptables rule |
Has this been successfully reviewed with production team against new service ticklist. (This should be done for significant changes to services too). |
|
3. Residual risks
Residual risk 1 | One gw going down can affect a percentage of jobs (Round robin related risks) |
Residual risk 2 | ipv6 change might take longer (or other problems occur) |
Residual risk 3 |
|
4. Impact of problems if they occur
Taking into account the risks described above:
Affected components | batch farm, external gateways |
VOs likely to be affected | all VOs running jobs |
Impact on existing data | none |
Impact on subsequent data | none |