/
2025-02-06 Meeting Notes

2025-02-06 Meeting Notes

 Date

Feb 6, 2025

 Participants

 

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Ian Johnson

  • @James Walder

  • @Alexander Rogovskiy

  • Lancs: Matt, Steven, Gerard

  • Glasgow:

Apologies:

 

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues
Gateways and WNs:
- Current status and upcoming changes

 

 

OPN broke

Friday manager overload

xrootd niceness seems like a good standard practice to guard against overloads.

Keepalived stall on ECHO XRootD managers - 31/01/25

 

XRootD Managers De-VMWareification

(Moving to physical hosts)

@Thomas, Jyothish (STFC,RAL,SC)

https://stfc.atlassian.net/wiki/spaces/GRIDPP/pages/872644647

XRootD Cluster Shuffle
you can have up to 16 managers in a cluster (doesn’t mean you should)

new managers added to manage the cluster, DNS IP switch on monday

 

XRootD collaboration Meeting

 

1 - streamed checksum validation

We're planning to implement a streamed checksum implementation in XrdCeph (computing checksum at the server and writing it into metadata directly instead of reading back) and give it a few months of validation testing comparing it to the read back checksum.

(archival copies, sparse files probably shouldn’t use streamed checksums - how to set this up - this should ONLY work for full file writes, does HTTPS allow sparse files? - standard webdav does not support partial writes)

Tests have already shown a good improvement in performance when done this way.

 

2 -xrdceph plans:

We're planning to move away from libradosstriper into using rados directly before HL-LHC for future proofing and performance improvements as well as merge our fork into core xrootd.

 

3- backward compatibility with 'broken' clients this was following an incident where ATLAS were trying to use 5.6.0 clients with their older analysis software, which had a TLS bug causing transfer failures against endpoints supporting tokens over IPv4. We mitigated it with the NOTLSOK environment variable, but it'd be good to have a consensus on how to deal with clients with known issues.

 

(To ensure is discussed: Shoveler).

 

cms-aaa naming convention

@Thomas, Jyothish (STFC,RAL,SC)

cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names


Separate naming convention would be more appropriate, to have main/supporting

(not so urgent).

CC created, and sandbox is prepared and has been tested on a test host

 

 

cms-aaa jemalloc use

@Thomas, Jyothish (STFC,RAL,SC)

testing on svc20, some memory leak still present

 

Compilation and rollout status of RAL XRootD versions

@Thomas, Jyothish (STFC,RAL,SC)

5.7.2 published.
Investigating xrootd.redirect for write operations.

5.7.2 skipped on farm due to pfc bug,

5.7.3 released

 

Shoveler

@Katy Ellis

Shoveler installation and monitoring

 

 

On the fly Checksums
https://stfc.atlassian.net/browse/XRD-98

@Ian Johnson

 

Changing features in first implementation of on-the-fly checksumming:
Don't store OTF checksum unless configuration flag is set
Log OTF checksum value, algorithm name, path and timestamp
Not started:

Log the same details for readback checksum - haven't worked how to modify Xattr code to do this yet (separate file as readback may be on a different gateway)
Correlating file checksums by path, identify any mismatch

Additional CPU load during OTF checksum calculation

 

 

Deletions

https://stfc.atlassian.net/browse/XRD-83

NTR

 

XRootD Writable Workernode  Gateway Hackaton

 

@Thomas, Jyothish (STFC,RAL,SC)

 

The sandbox is deployed to the whole preprod farm. LHCb uploads look OK, Atlas and CMS have not tested the new setup yet.

8cadddb43216587cc8c6c29a7b53a423.png

 

Plan: file query system to summarize XRootD Logs

 

Plan to create a system to store info from across all gateways to search a filename and get creation time, last write time, last successful stat and deletion time in case of ‘lost’ files. Possible graduate sideproject.

 

100 GbE Gateway testing:
SKA / Tier-1

@James Walder @Thomas, Jyothish (STFC,RAL,SC)

UKSRC - Acting as source for SRCNet verification tests; not being stressed so far …

Teir-1 .

 

 

 

UKSRC Storage Architecture

 

Superspine connection review today. To discuss with Tom B re. Ceph configuration

 

Tokens Status

 

  • Operational

  • Technical

  • Accounting

 

 

 

 

on GGUS:

Site reports

 

Lancaster:

All on 5.7.3 for the last week, no issues.

Following on from last week, Steven has been upping our number of shoveller instances, we’ll see how that goes.

As mentioned in storage we’re shopping for some new gateways - getting quotes for single socket Dell boxes with quad-port 25Gb network cards (and looking at 100Gb options). Our plan is to have the external port on a single 25Gb and internal on a pair of bonded 25Gb NICs. It would be nice to save money and not have to cram these with RAM.

Also mentioned was Gerard's “xrootd restarter” service, which will shepherd regular rolling restarts of our xroot services (as a means to deal with xroot’s poor connection handling). The aim is to cleanly restart the services every couple of days. Gerard’s been working on making it as “unhacky” as he possibly can.

Side topic, in an email thread with Dan T he mentioned tls hardware offloading which piqued my interest. Anyone looked into this recently? AIUI it’s not a feature on all cards. I see there’s also tcp checksum offload features in cards too.

(have we discussed this here before? It rings a bell…)

Finally, we have a continued pattern of high load on a Friday evening, trying to track the culrpit (looks like atlas rucio transfers, who’s kicking them off just before the weekend). Considering switching scrubbing off if we can’t fix it.

 

 

 


 

 

Glasgow -

 

 Action items

How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.

 

  •  

  •  

 

 Decisions

Related content

2025-01-30 Meeting Notes
2025-01-30 Meeting Notes
More like this
2025-01-23 Meeting Notes
2025-01-23 Meeting Notes
More like this