2023-09-21 Meeting Notes

 Date

Sep 21, 2023

 Participants

  •  

  •  

Apologies:

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

XRootD Releases

 

5.6.2-2 is out
Testing status at RAL @Thomas, Jyothish (STFC,RAL,SC)

 

Prefetch studies

Alex

(temporarily to be rolled back, with the ongoing work in batch farm WNs)

 

Deletion studies through RDR

Ian

 

 

ATLAS concern over deletion rate for DC24

JW

DC24 ATLAS expected _average_
deletion rate from RAL storage will be ~ 40-60k files per hour.
Considering history of Echo deletions performance issues [1] could you
please make sure that everything works fine up to these rates.

Can we cope with this rate (assuming additional gateways) without fundamental changes?
Is a re-architecture of how deletions are performed needed (either for DC, or towards HL-LHC).
Total throughput and per-file deletion times to be considered.

Rate (nominal) for atlas assumes therefore ~ 20Hz

Production deletion times (recent logs); only including the time within ceph, and not the xrootd and client RTT:
[in seconds]

count 167980.000000 mean 2.951339 std 5.467953 min 0.015000 25% 0.282000 50% 0.570000 75% 3.486000 max 271.880000

 

 

 

CMSD rollout

 

https://stfc.atlassian.net/browse/XRD-41

 

Future Architecture of Data access on WNs

 

VOs asked to provide input on their requirements / use cases

 

Gateways: observations

 

workernode write traffic temporarily redirected to gateways on the new network. Results look promising, initial testing of 3 generations to one gateway resulted in 40k uploads over 1.5 days with only 1 failure due to an expired certificate proxy. This change will be reverted once external ipv6 is available on the new network, but future separation of job and fts traffic seems sensible

 

CMSD outstanding items

 

Icinga / nags callout tests changes. - live and available

Improved load balancing / server failover triggering -

better 'rolling server restart script'

Documentation; setup / configuration / operations / troubleshooting / testing

Review of Sandbox and deployment to prod:
- Initial review spotted requirement to split the feature to have a non-CMSD version.

  • New feature (copy of existing prod version) made, but needs testing after adding in ‘named variable substitution’ into the xrootd config script

  • cms feature: Add in the ‘named variable substitution’ and finalise the review.

 

Tokens testing

 

NTR

 

AAA Gateways

 

Sandbox ready for review:

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3

 

SKA Gateway box

 

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

now working using ska pool on ceph dev

Initial Iperf3 tests: (see table and plots below).

  • Actions

    • Ensure Xrootd01 is tuned correct, according to the Nvidia / mellanox instructions

    • Repeat the iperf tests

  • Xrootd tests against:

    • dev-echo

    • cephfs (Deneb dev)

    • cephfs (openstack; permissions/routing issues)?

    • local disk / mem

  • Frontend routing is also being worked on

 

 

extra gateways deployment

 

… awaiting networking updates; 4 being repurposed for internal (mostly) writes …

correlation between 'spikes' on new internal gateways to additional jobs running by particular VOs.

 

 

ALICE WN gateways

 

(Birmingham using eos, Oxford no storage)

Relationship to OSD issues ?

 

Best practice document for Ceph configuration?

 

e.g. autoscaling features ?

 

 

on GGUS:

Site reports

Lancaster - .

Glasgow

 Action items

  •  

 

 Decisions