2024-01-18 Meeting Notes

 Date

Jan 18, 2024

 Participants

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Alexander Rogovskiy

  • @Thomas Byrne

  • Lancs: Steven, Gerard, Matt

  • Glasgow: Sam

Apologies:

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues

@Thomas, Jyothish (STFC,RAL,SC)

packet loss on perfsonar?

 

 

 

Gateways and WNs:
- Current status and upcoming changes

@Thomas, Jyothish (STFC,RAL,SC)

stable status currently

  • tokens have been deployed for cms/atlas (additional patch for restricting scope foo \ foobar rejection )

  • checksum library

  • prefetch off on WNs

To resist installing 5.6.4; before the break, one sets of sets (TPC transfers) was failing against another site. To repeat the tests and see

Rocky 8 for the Gateways (@Thomas, Jyothish (STFC,RAL,SC) working on a initial setup).

 

bugfix for calculating striper objects in direct reads

 

https://github.com/stfc/xrootd-ceph/pull/50

passed test on gw8 and code reviewed

 

ECHO File transfer / throughput studies

@Katy Ellis

Tests of per-file transfer writes into Echo.
A new Jira is set up to track these changes: https://stfc.atlassian.net/browse/XRD-80
Updates presented at Liaison meeting yesterday.
Preliminary results from iperf3 testing:

iperfcomp.png

tests ongoing on svc20

 

Checksums fixes

@Alexander Rogovskiy

Status and plans for improving Checksumming work …

https://github.com/alex-rg/xrd_ckslib/tree/main
Ihttps://stfc.atlassian.net/browse/XRD-56

(Sandbox prepared and applied to GW8)

 

Prefetch studies and WN changes

@Alexander Rogovskiy

Sandbox ready and applied to 1 WN, pending envroinment variable for timeout increase

 

Deletion studies through RDR

@Ian Johnson

continuining with mixed results,

previous set was 500 files

5000 files could not get uploaded, wasn’t completed after 20+ hrs (seems to have been a bad time - last Tuesday)

100 X 1GB deletion in 5 s

check with Alessandra on rucio deletion concurrency (for DC24)

ceph is performing better at the moment

 

Tokens testing

@Thomas, Jyothish (STFC,RAL,SC) @Katy Ellis

https://stfc.atlassian.net/browse/XRD-63
https://stfc.atlassian.net/browse/XRD-78
https://github.com/xrootd/xrootd/pull/2152

https://github.com/xrootd/xrootd/pull/2151/files

 

Understanding CMSD Loadbalancing

@Thomas Byrne

explore different load balancing scheme (weighted placement)

testing in internal cluster? how to measure improvements? more instrumented current version to measure improvement

things look ~ok at the moment so lower in priority

 

SKA Gateway box

@James Walder

 

Architectural review ‘hackathon’

All

Plan the process for the Architectural planning of XRootD across the External Gateways and WNs

 

2024 Planning

 

JW to prepare a summary of the plans for 2024

 

 

on GGUS:

Site reports

Lancaster - Unbalanced redirectors/load causing ceph mounts to be dropped, Sam suggests network QOS to monitor traffic (ceph mds kickout timeout’s 5 min and prefer not to increase) write locks from unresponsive clients can cause pileups on 'healthy' clients. making mds less aggressive would mitigate this but not solve the underlying issue (not reccommended) local read from cephfs mount. mostly coincident with slow osd ops( pg slow > msd slow ops on metadata > osd slow ops > gw issues). osd perf output might have more info. long smart healthcheck?

Glasgow - relatively stable, few network issues. OS/Ceph version update to do.

 

 

 

 Action items

  • @James Walder to schedule a ‘hackathon’ within a F2F to have a session on architectural planning.

  • @James Walder to prepare an outline of the expected roadmap for XRootD developments in 2024.

  •  

 

 Decisions