...
Apologies:
James Walder
\uD83E\uDD45 Goals
...
Item | Presenter | Notes | |
---|
XrootD gateway architecture review (What should the XrootD access to Echo look like in a year’s time) | | /wiki/spaces/GRIDPP/pages/255262851 Ideas on xrootd batch farm architecture Current State ECHO aliases Key questions: Should we segregate s3 and WN traffic from FTS? Multiple clusters? with or w/o shared servers? (no shared servers if possible) [aim to ] every service instance needs to be as resilient as possible. each DNS endpoint should have keepalived for redundancy and redirectors for high availability manageability adding/removing gateways from a service should be simple and the overall setup should not be too complex to understand or manage aim for simpler config keep the config understandable flexibility on gw/use case we should be able to swap gws between service endpoints as smoothly as possible
Containerizing everything (shared containers across all hardware) is the preferred desired end state. This has the prerequisite of every service being behind an expandable high availability setup (xrootd cmsd managers) [and an orchestrated setup to spin up more gws for load increase] some system resource overhead should be reserved to keep the gateways running smoothly WN gateways: this should be kept going forwards as they mean we have an additional gateway’s worth of capacity for every workernode. They currently only redirect traffic for reads over root (job operations using the xrootd.echo endpoint). This is because of Xcache, which is read only. Xcache is good at what it does and reduces the number of iops hitting ceph from reads. During the vector read deployment they were removed and resulted in enough IOps to slow down the echo storage cluster enough to fail xcache can be removed if xrdceph buffers provide similar functionality (allows R/W over local gw) xrdeph buffers do not work on out of order reads or separate read requests (like the case with alice gateways) some sort of xrootd manager tree setup might work for WN gw containers this could be similar to CMS AAA, with a hierarchy for access, but the first point of contact should be highly available a single gw failing on a workernode should not cause all its jobs to fail. currently there is no failover built in for WN reads, so if the gateway is down all jobs on that WN will fail a functional test equivalent healthcheck for WN gw will ensure the gateway is killed and restarted, and makes condor know if the gateway is still down. This would stop new jobs being sent to a WN with a broken gw but the jobs currently on it will still run. The solution should strongly prefer a WN’s own gw. Ideally there should be some fallback mechanism where the transfer attempts to use its own gateway first and fails over to its neighbour WNs' gateway if unavailable. cmsd is not smart enough to deal with r only and r/w servers as part of its cluster (this was attempted by Sam at Glasgow during early 5.x) strong preference for having the same endpoint for reads and writes (removing xcache). This makes the configuration simpler and allows it to be managed by a cmsd redirector without issues.
A: evaluate whether xcache can be removed with xrdceph buffers enabled (measure IOps on single WN) A: design a better solution for the gws on the WNs A: create redirector managers for alice and s3 A: develop cmsd redirector capability to redirect onto own gateway preferably and have xcaches be included in the redirector in a mixed gw setup
| |
XRootD Releases | | 5.6.23-3 1 is out Glasgow Lancs has been using it (el7 and rocky8) (no cmfst post centos7) | |
Checksums fixes | | planned for deployments checksum server service for external gws | |
Prefetch studies and WN changes | Alex | planned for week of 20th to resume partial deployment over the farm | |
Deletion studies through RDR | Ian | | |
CMSD rollout | | Jira Legacy |
---|
server | System JIRA |
---|
serverId | 929eceee-34b0-3928-beeb-a1a37de31a8b |
---|
key | XRD-41 |
---|
| New diagram required ? svc01,02,17,18 stay as internal WN gateways for now. The other svc hosts 03,05, 11,13-16 to added to CMSD production cluster. svc19 (designated for Alice gateway) | Gateways: observations | cmsd sandbox has been deployed
| |
Gateways: observations | | cluster was up and active, but one sn was very slow in troughput slowing the whole cluster down enough for gateways to fail functional tests | |
CMSD outstanding items | | sandbox deployed | |
Tokens testing | | To Liaise with the TTT Token Trust Traceability Taskforce (aka. Matt Doidge ) no updatereport by end of this month CMS GGUS for enabling token auth planned deployment on the week of the 20th CC for next week | |
AAA Gateways | | Sandbox ready for review: http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3 Needs a bigger discussion regarding Tokens, and deployment in production hosts. to be reviewed and deployed this week | |
SKA Gateway box | | /wiki/spaces/UK/pages/215941180 ongoing network cleanup to access Deneb | |
containerised gateways (kubernetes cluster) | | working but still needs ironing a few bugs and scaling up | |
on GGUS:
Site reports
Lancaster - Not much more, updated to latest xrootd broke scitokens as the scitoken package also needs updating (done manually)
Glasgow - 5.6.3 on rocky8, needs to do redirector on internal gateways, also updating xrootd and xrdceph version. Internal gateways were using up all memory (64G RAM and Swap), plan to update RAM
swap 0 atm, need to switch to swap off on reboot. newer versions of ceph are more determined to use resources
✅ Action items
⤴ Decisions
...