2024-02-15 Meeting Notes

 Date

Feb 15, 2024

 Participants

  • @James Walder

  • @Ian Johnson

  • @Alexander Rogovskiy

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Thomas Byrne

  • Lancs: Steven, Gerard, Matt

  • Glasgow: Sam

Apologies: @Katy Ellis

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues

@Thomas, Jyothish (STFC,RAL,SC)

 

 

Gateways and WNs:
- Current status and upcoming changes

@Thomas, Jyothish (STFC,RAL,SC)

CMSD load balancing changes - reporting frequency increased to 3s (decision cycle of 6s)
fuzz reduced from 15 to10

resulting load pattern converged more, even under DC24 level loads

 

Data Challenge Status / observations

 

LHCB:
EOS->ECHO link:

echo.png

ECHO->Antares link:

antares.png

Individual gateway throughput, EOS->ECHO link:

Individual gateway performance (ECHO → Antares link):

New vs old network gateways and throughput/Number of Transfers, EOS-ECHO link:

 

 

Rocky 8 migration planning

 

Status and issues wit Rocky 8 migration

 

Summary from S&C week Discussions with Andy H and Brian B

@Thomas, Jyothish (STFC,RAL,SC)

XRootD workshop agenda to be confirmed

high system load is unexpected, try monitoring running processes during this period. One possibility might be disk IO locking. Suggested using the throttle plugin and turn off the limits to get the printouts

officially supported OSes: el7-8, alma9 openssl3 was causing some issues, but the package is available

load balancing: Brian uses static weighting with global sharing (each server gets a fixed share of total transfers). Another possibility is to use heartbeat skew as a metric

Checksums - in flight checksums ongoing, Andy remarked these would not confirm the integrity of the file at destination.
The reason why default checksums didn’t work at RAL might have been due to the stale checksum check - confirmed with testing.

tokens - token redaction ETA by spring (march). PR with bugfixes sent to xrootd, to be included in 5.7.0
https://github.com/xrootd/xrootd/pull/2184

https://github.com/xrootd/xrootd/pull/2152

weird behaviour under investigation - http stops being responsive under small bursts of high load

gws randomly stopping authenticating - error seen was malformed CA. next time it happens, check vomsdata::check_from_file in gdb to debug. likely issues with loading CA chains

 

 

Planning for ALICE CMSD redirection

 

redirection needed for the 3 ALICE gws, to move away from the current DNS RR.

options:

replicating the general setup (keepalived 2 host redundancy) - we’d have 2 managers for 3 servers
CMS AAA approach - 1 manager only - no redundancy. CMS can get away with it due to redundancy trough other sites’ managers

 

 

ECHO File transfer / throughput studies

@Katy Ellis

 

 

 

Checksums fixes

@Alexander Rogovskiy @Thomas, Jyothish (STFC,RAL,SC)

patch ignoring the stale checksum check performs similar to the checksum library plugin.

 

Prefetch studies and WN changes

@Alexander Rogovskiy

 

 

Deletion studies through RDR

@Ian Johnson

Compared to last measurements (two weeks ago), concurrent test deletions are taking longer during DC24 (as expected). Quite a variation in timings observed. Deletion rates decreased (all rates below are for 1000 files with 10 deletion threads):

ATLAS wish to have 3 GIB deletion rate of 4 Hz, current test rate is still faster (c.f. 21 Hz last before DC24.).

Deletion times distributions for 1, 3, and 6 GiB files:

Timings using the same Y axis scale:

With the longest deletion times from the sample above being 42.9s for 1 GiB files and 97.4s for 6 GiB files, I expect to see many deletions timing-out depending on the ECHO loading.

From 01/02 mtg:

Check timings for slow deletions. As above.

Measure deletion rate for 6 GiB files. As above.

Testing with 1 KiB files works up to 4000 files (deletion client on two VMs), but fails with 10000 files. Investigating. Not done yet.

 

Tokens Status

@Thomas, Jyothish (STFC,RAL,SC) @Katy Ellis

wlcg token create and modify scopes must include permissions to create and stat superfolders of that path. (xrootd fix included in the stat permissions patch)

'timeout' errors in token auth (permission denied-timeout was reached) during DC24. Seems to be caused by overloading the IAM servers during token deserialization in scitokens-cpp. (it fetches the public key too often)
https://github.com/scitokens/scitokens-cpp/issues/80   - it reattempts to get a failed public key every attempt
https://github.com/scitokens/scitokens-cpp/issues/119 - same problem on missing caches
https://github.com/scitokens/scitokens-cpp/issues/97 - on short lived caches being the reason nodes need connectivity

 

WLCG IAM testbed

@Katy Ellis

WLCG IAM testbed appears quite ‘limited’ in some of the VO based testing.
Also this runs only on dev-gw4.
Should we aim to move this to production host, or keep testing on dev-gw4 for now

 

Understanding CMSD Loadbalancing

@Thomas Byrne

 

 

SKA Gateway box

@James Walder

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

 

 

 

JW to prepare a summary of the plans for 2024

 

 

on GGUS:

Site reports

Lancaster - Moved to 5.6.7, leaving fireflies on. No issues (touch wood). Added more gateways into the “xroot cluster” - up to 7 now. The notes about cms.sched/perf settings from January were very useful to fall back on so thanks!

 

Glasgow - 5.6.3 will move to 5.6.7 post-DC.
Firefly activity switched on for DC (not specifically Glasgow) caused segfaults; now disabled.

Ceph cluster work will wait until the end of DC24.

 

 

 

 Action items

  • @James Walder to schedule a ‘hackathon’ within a F2F to have a session on architectural planning.

  • @James Walder to prepare an outline of the expected roadmap for XRootD developments in 2024.

  •  

 

 Decisions

  1. @Thomas, Jyothish (STFC,RAL,SC) to look at implementing CMSD for the ALICE Gateways and to document / identify and bottlenecks in the process