Priority Actions:

James (50%): 100Gb/s Gateways

Brij: Deletions – can we scale or do we need async?

Alex: Writable WN gateways

Jyothish: Containerized XRootD

Ian: Improving buffer layer in XrdCeph

Development needed to ensure that we can cope with much higher data rates (i.e. 100Gb/s NIC)

Long term sustainability of code base (Libradosstriper, ALICE authz plugin)

Data Safety –

Correct Operation, verified by (multiple?) checksums to assure integrity

Deletions – satisfy VO requirements

Availablility –

Cope with varying load – containerise?

Usability –

support for multiple authN methods

??

Sustainability –

utilisation of resources (human effort, energy ysage, etc)

Q: Development scope of knowledge

Alice AuthN plugin availability?

Virtual placement improvements? hybrid cache/noncache cluster in cmsd

improviing buffer layer in xrdceph

On the fly checksumming/ deletion

writable WN gateways (separate WN and external GW traffic)

containerised XRootD (kubernetes orchestration, autoscaling

correct port access moving caching to xrootd ceph?

host certs for containers

tuning vector reads

performance monitoring

testing framework (Mariam is working on)

redirector level op summary

log file processing/scraping

=====================================================================

PNG from Zoom whiteboard

Notes from 16/01

tier1 100G testing with ska, can be temporarily put into prod to check load patterns

writable WN gateways to test - writes to the WN gateway, root writes to be tested as it’s historically not been reliable

containarized xrootd - possibly start with ipv6 only cluster, ipv4 can be added later once eBGP is available or trough an extrernal LB.

checksumming - on the fly checksumming POC, speed not much different. Adler32 implementation/ where to put this? edge cases, validation, crc32c? for SKA

deletion - extrapolate to higher rates. - miniDC late feb tests deletion rates

future architecture ideas - pelican style SSD caches backed by slower tape/HDD pools?

buffer layer improvements - improved latency hiding, directIO now reading into buffer, which then feeds back into client - workernode buffer configuration needs to be full number

ring buffers?

hybrid cluster with xcache for reads

xrootd 6 - RAL xrdceph to be included

should we use libradosstriper in the future?
can we change the file structure of future files?

xrootd monitoring - fstream monitoring needs to be moved into opensearch

any big changes in architecture need to supply a single endpoint as VO workflows rely on it.

SKA might implement a combined data/job scheduling system, more s3 storage use

cephFS/s3/fast storage.

kubernetes - for containarization, cambridge using it over virtualized layer

SDNs?

alternate load balancing logic? read/write only subsets of the cluster

DPUs? - useful when preprocessing data, maybe for OSDs?

tokens at the job/storage level.

Notes from planning meeting 22-04-2024

Notes from 16/01