/
2025-01-16 Meeting Notes

2025-01-16 Meeting Notes

 Date

Jan 16, 2025

 Participants

 

  • @Thomas Byrne

  • @Thomas, Jyothish (STFC,RAL,SC)

  • Lancs: Matt

  • Glasgow:

Apologies:

 

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues
Gateways and WNs:
- Current status and upcoming changes

 

 

Upgrades of GWs complete

. AAA federation issues (SVC20, monitoring running out of sessions).

remains spotty; related to the 5.7.2 bug of FDs?

Worker Nodes (5.6 clients have issues accessing the servers). Maybe related page-read message size?

 

Checksums issue with an ATLAS file

 

[XrdCks] Checksum request during transfer locks partial file checksum into metadata for Ceph · Issue #2388 · xrootd/xrootd

GGUS /login

Checksum requested before whole file is updated. No ability to do stale checksum check in ceph, so original checksum ‘sticks’ to the file.

fix in place RAL side by clearing checksums after a write is complete

 

cms-aaa naming convention

 

cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names


Separate naming convention would be more appropriate, to have main/supporting

(not so urgent).

CC created, and sandbox is prepared.

 

 

XRootD Managers De-VMWareification

@Thomas, Jyothish (STFC,RAL,SC)

Option 2 preferred for efficiency, but Option 1 decided on

Option 1 would be simpler to implement for a temporary fix, as the move would be reversed

antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James
lfsw30 (UPS room) decided on destination

hosts moved to rack

 

Compilation and rollout status with XrdCeph and rocky 8: 5.7.x

@Thomas, Jyothish (STFC,RAL,SC)

5.7.2 published.
Investigating xrootd.redirect for write operations.

5.7.2 skipped on farm due to pfc bug,

possible RAL release 5.7.3 equivalent with a fix for that and 5.6.0 client compatibility

 

Shoveler

@Katy Ellis

Shoveler installation and monitoring

 

 

On the fly Checksums
https://stfc.atlassian.net/browse/XRD-98

@Ian Johnson

 

Simple PoC calculating Adler32 in the XrdCeph plugin mostly working. Neglible reduction in write rate compared to not calculating Adler32 on-the-fly.

 

 

Deletions

https://stfc.atlassian.net/browse/XRD-83

NTR

 

XRootD Writable Workernode  Gateway Hackaton

 

@Thomas, Jyothish (STFC,RAL,SC)

XRootD Writable Workernode  Gateway Hackaton (XWWGH)


Hackaton writeable workernode

sandbox with fixes present, ready for testing

https://stfc.atlassian.net/browse/GSTSM-284

 

Xrootd testing framework

 

XRootD Site Testing Framework

Discussion in Storage Meeting in how to integrate the various testing structures within the UK

 

100 GbE Gateway testing:
SKA / Tier-1

@James Walder @Thomas, Jyothish (STFC,RAL,SC)

UKSRC - XRootD used for SRCNet testing

Teir-1 cabled, but awaiting some work to progress on the Swtich.

 

 

 

UKSRC Storage Architecture

 

 

 

Tokens Status

 

  • Operational

  • Technical

  • Accounting

 

 

 

 

on GGUS:

Site reports

 

Lancaster: On this week’s Lancaster Rant: We had a period of storage sadness last night. Atlas deleted ~20k files in a space of about 30 minutes, and whilst Ceph was recovering LSST jobs came from behind and gave the storage a wedgie with high IOPs. Cephfs got slow, xrootd servers got sad, some fell over, cephfs got more unhappy. It was a whole thing, and Gerard spent the morning restarting xroot servers with his new scripts.

The point of my ranting is it seems half our problems could be solved if we could get xrootd to rate/connection limit so things didn’t get in so bad a state that we required to reboot things. We don’t think xroot has this functionality in itself (James reminded me of the throttle plugin but we don’t know if this works with http). The preliminary thought would be something on the redirector that, if detecting problems or high load, rather then redirect to the least-worst-off xroot server just returned a polite “try again later” (503 ?).

In other news as discussed on Wednesday we’ve been looking at ways we could remove TLS from internal transfers and how to xroot-plumb that together, but Jyothish may have crushed our hopes there by pointing out that scitokens require tls to be enabled, so such a move wouldn’t be future proof - or would have to have extra plumbing (as these ponderings are accompanied by the idea of replacing internal auth for at least some users with something faster- this again is LSST driven with their teeny-tiny files causing hassle).


XrootD manages can send a wait response for requests if no server is available - this can be done by setting a lower maxload per server.

Scrub errors/bad sectors on some disks causing issues - recreating the OSDs and backfilling usually done at RAL.

for slow ops - read IO time and slow op distribution across PG/OSDs monitoring can indicate whether a bad disk is the issue.
Dan has a script that periodically writes and monitors data.

 

Glasgow

 

 Action items

How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.

 

  •  

  •  

 

 Decisions