Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Item

Presenter

Notes

XRootD Releases

5.6.2-2 is out
Testing status at RAL Thomas, Jyothish (STFC,RAL,SC)

Prefetch studies

Alex

(temporarily to be rolled back, with the ongoing work in batch farm WNs)

Deletion studies through RDR

Ian

ATLAS concern over deletion rate for DC24

JW

DC24 ATLAS expected _average_
deletion rate from RAL storage will be ~ 40-60k files per hour.
Considering history of Echo deletions performance issues [1] could you
please make sure that everything works fine up to these rates.

Image RemovedImage Added

Can we cope with this rate (assuming additional gateways) without fundamental changes?
Is a re-architecture of how deletions are performed needed (either for DC, or towards HL-LHC).
Total throughput and per-file deletion times to be considered.

Rate (nominal) for atlas assumes therefore ~ 20Hz

Production deletion times (recent logs); only including the time within ceph, and not the xrootd and client RTT:
[in seconds]

count 167980.000000 mean 2.951339 std 5.467953 min 0.015000 25% 0.282000 50% 0.570000 75% 3.486000 max 271.880000

Image Added

CMSD rollout

Jira Legacy
serverSystem JIRA
serverId929eceee-34b0-3928-beeb-a1a37de31a8b
keyXRD-41

Future Architecture of Data access on WNs

VOs asked to provide input on their requirements / use cases

Gateways: observations

workernode write traffic temporarily redirected to gateways on the new network. Results look promising, initial testing of 3 generations to one gateway resulted in 40k uploads over 1.5 days with only 1 failure due to an expired certificate proxy. This change will be reverted once external ipv6 is available on the new network, but future separation of job and fts traffic seems sensible

CMSD outstanding items

Icinga / nags callout tests changes. - live and available

Improved load balancing / server failover triggering -

better 'rolling server restart script'

Documentation; setup / configuration / operations / troubleshooting / testing

Review of Sandbox and deployment to prod:
- Initial review spotted requirement to split the feature to have a non-CMSD version.

  • New feature (copy of existing prod version) made, but needs testing after adding in ‘named variable substitution’ into the xrootd config script

  • cms feature: Add in the ‘named variable substitution’ and finalise the review.

Tokens testing

NTR

AAA Gateways

Sandbox ready for review:

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3

SKA Gateway box

/wiki/spaces/UK/pages/215941180

now working using ska pool on ceph dev

Initial Iperf3 tests: (see table and plots below).

  • Actions

    • Ensure Xrootd01 is tuned correct, according to the Nvidia / mellanox instructions

    • Repeat the iperf tests

  • Xrootd tests against:

    • dev-echo

    • cephfs (Deneb dev)

    • cephfs (openstack; permissions/routing issues)?

    • local disk / mem

  • Frontend routing is also being worked on

extra gateways deployment

… awaiting networking updates; 4 being repurposed for internal (mostly) writes …

correlation between 'spikes' on new internal gateways to additional jobs running by particular VOs.

ALICE WN gateways

(Birmingham using eos, Oxford no storage)

Relationship to OSD issues ?

Test

Src

Dest

Thr [Gb/s] (single stream, naive iperf3)

Gateway <->SN

Xrootd01[10.16.190.4]

*Ceph-sn1053 [130.246.177.167]

11.8

Gateway <->SN

*Ceph-sn1053 [130.246.177.167]

Xrootd01[10.16.190.4]

19.6

Gateway <->SN

*Xrootd01[10.16.190.4]

Ceph-sn1053 [130.246.177.167]

11.3

Gateway <->SN

Ceph-sn1053 [130.246.177.167]

*Xrootd01[10.16.190.4]

23.0

 Gateway <->SN

*Xrootd01[10.16.190.4]

Ceph-sn1128 [130.246.178.98]

14.1

 Gateway <->SN

Ceph-sn1128 [130.246.178.98]

*Xrootd01[10.16.190.4]

23.1

SN ↔︎ SN

~ 12 – 14 Gb/s

Jasmin gpuhosts(↔︎)

20 – 25Gb/s (perhaps untuned 100 Gb/s links)

gpuhost ↔︎ xrootd01

50 (gpu → xrd), 25 (xrd-> gpu)

SN → SN window scaling:

...

Best practice document for Ceph configuration?

e.g. autoscaling features ?

on GGUS:

Site reports

Lancaster - Nothing exciting going on .

Glasgow

✅ Action items

⤴ Decisions

...