2024-05-09 Meeting Notes

 Date

May 9, 2024

 Participants

  • @Thomas, Jyothish (STFC,RAL,SC)

  •  

Apologies:

@James Walder

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues
Gateways and WNs:
- Current status and upcoming changes

@Thomas, Jyothish (STFC,RAL,SC)

5.6.9 deployment failed,

deployed 5.5.4 with case insensitive headers

Sam - sl7 doesn’t crash but is less stable, rocky8/9 crashes with xrdceph

glibc version?

https://ggus.eu/index.php?mode=ticket_info&ticket_id=166729

 

CHEP Abstract ideas

@Thomas, Jyothish (STFC,RAL,SC)

Deadline extended to 17th;

one CHEP paper for load balancing, one from Matt, gerard/lancs for ceph monitoring

SKA high throughput abstract

Abstracts

 

XrootD Workshop plan

@Alastair Dewhurst

Registrations are open (pending prettification)

 

Rocky 8 and 9 migration planning

 

CC passed, 4 gateways are planned to be deployed today

remaining upgrades to be done after next week

batch farm will be undergoing simultaneous updates

preprod farm upgraded

gridftp ones will stay until june 3rd

 

Shoveller

@Katy Ellis

Shoveler installation and monitoring

 

Future developments ideas planning work

@Ian Johnson @Thomas, Jyothish (STFC,RAL,SC)

Notes from planning meeting 22-04-2024

 

Deletion studies through RDR

@Ian Johnson

 

.Preliminary deletion rate plot from DC24 “dip-stick” sampling (ATLAS VO):

 

deletion-rate.png

plot taken for data during DC

steven looking for cephfs deletion studies

lancs-timestamp/filesizes vs number of slow ops comparison?
(are deletions causing slow ops?)

 

Deletions

https://stfc.atlassian.net/browse/XRD-83

what’s the theoretical limit?
are the bottlenecks in xrootd?
are async deletions required?

 

Planning for ALICE CMSD redirection

@Thomas, Jyothish (STFC,RAL,SC)

keepalived setup is working

1 dev gateway is being added to this cluster

 

Checksums fixes

@Alexander Rogovskiy @Thomas, Jyothish (STFC,RAL,SC)

'21 generation is being rolled out

'21 are being swapped with '22 job types as the '22s have separate os and disk drives, additional HW for '21 in a month

on the WNs the checksums are forwarded to the prod cluster

the preprod farm contributed ~5% of the total checksums

 

Prefetch studies and WN changes

@Alexander Rogovskiy

Some more data from the overload event, namely efficiency and error rates of the on 2021 gen (prefetch on):

pf_on_err.png

It would be interesting to compare this to the “prefetch off” configuration. So far meaningful comparison is not possible, since the number of WGProduction jobs is nowhere near the numbers during the overload event (23.04.2024):

stress test for WNs?

preprod CEs targetting internal xrootd cluster

 

Tokens Status

@Thomas, Jyothish (STFC,RAL,SC) @Katy Ellis

 

 

CMSD Load balancing

@Thomas Byrne @Thomas, Jyothish (STFC,RAL,SC)

PR:
https://github.com/stfc/xrootd/pull/8/files


 

 

SKA Gateway box

@James Walder

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

2 new servers racked up; awaiting netbox configuration.
To discuss with James A, AQ configuration

 

 

 

 

 

Xrootd testing framework

@Mariam Demir

 

 

 

on GGUS:

Site reports

Lancaster - a week of living on Reef hasn’t yielded many operational issues, the only one of note was that Reef wasn’t happy with the “trimming settings” for the MDS, the Pacific defaults needed to be cranked up - otherwise smooth sailing. Dealing with fallout from a lot metrics being renamed and changes to the logging which has reduced our monitoring capabilities a bit, but that’s a niggle. Next task is update all the clients.

 

Glasgow:

redirector black holing -5.6.9, some crashes on the server

Do we want to talk about Durham?

having interesting ceph problems (files with read lock) transfers hanging and locking files after a while. Paul switching to match lancs and glasgow cephfs to see if it fixes things

 

 Action items

 

  •  

  •  

 

 Decisions