Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Background

Current status of the gateways:
cephgwmachineoverview.xlsx

17 gws in the xrootd rdr cluster, 4 of which are shared with s3, 7 of which are shared with gridftp
4 gateways exclusively assigned temporarily for WN traffic following this emergency CC:

/wiki/spaces/GRIDPP/pages/252346459

WN exclusive gws refers to the external gws ceph-svc01,02,17,18.

The change plan

  1. remove gw[4-7] from the cluster and add the 4 currently WN exclusive gateways (svc01,02,17,18)

    1. this will keep the same amount of gateways in the rdr cluster and allow FTS/other traffic to make use of those gws along the batch farm

    2. if necessary this can be staggered by doing 2 gws in the morning and 2 at noon

    3. emergency revert just involves reverting the cmsd blacklist to the previous state (adding gw4-7 and removing exclusive gws)

  2. (optional)remove gw4-7 from gridftp.echo round robin

    1. this will keep svc97-99 as gridftp gws.

    2. There is no current need for 7 gws in gridftp as each has <10 active connections at any time:

  3. Remove the DNS poisoning for batch farm uploads

    1. this will result in the WN load being spread out over all available gws and make batch farm uploads more resilient

  4. create an s3 personality without xrootd (and gridftp if removed) services for gw4-7, and make those hosts with it

    1. this is a sanity change to formally separate s3 from the rest of the gws to finalize the change.

Reasons for change

  1. the exclusive WN gateways were assigned during a period where we had less and older gws (11) trying to do both FTS and WN traffic. the redirectors can balance the load but the overall load generated was too high and resulted in all gateways getting loaded.
    We now have more gateways and the total amount of gateways in use will not change. There will still be 17 gws handling both batch farm and FTS traffic

  2. Currently if any one of the WN exclusive gws goes down, it takes a quarter of the farm with it. The setup has no redundancy or failover built in.

  3. This change gets us to a stable state while the gw architecture is being decided on.
    Any further major architecture change is unlikely to take place until January, and the current setup should not stay up till then, as it heavily relies on manual intervention if anything goes wrong on WN exclusive gws.

  4. Have dedicated gateways for S3 traffic. We've had to tell S3 users to specifically point their process to a single gateway that gets a bit less traffic in order to keep things going because S3 was getting crowded out by XrootD. Cloud Team have been asking for this.

  5. No issue should arise from mixing the WN and FTS loads. If any do occur, they will have impact on decisions made for the new architecture and should be known beforehand.

  • No labels