2024-02-15 Meeting Notes
Date
Feb 15, 2024
Participants
@James Walder
@Ian Johnson
@Alexander Rogovskiy
@Thomas, Jyothish (STFC,RAL,SC)
@Thomas Byrne
Lancs: Steven, Gerard, Matt
Glasgow: Sam
Apologies: @Katy Ellis
CC:
Goals
List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity
Discussion topics
Current status of Echo Gateways / WNs testing
Recent sandbox’s for review / deployments:
Item | Presenter | Notes |
|
---|---|---|---|
Operational Issues | @Thomas, Jyothish (STFC,RAL,SC) |
|
|
Gateways and WNs: | @Thomas, Jyothish (STFC,RAL,SC) | CMSD load balancing changes - reporting frequency increased to 3s (decision cycle of 6s) resulting load pattern converged more, even under DC24 level loads |
|
Data Challenge Status / observations |
| LHCB: ECHO->Antares link: Individual gateway throughput, EOS->ECHO link: Individual gateway performance (ECHO → Antares link): New vs old network gateways and throughput/Number of Transfers, EOS-ECHO link:
|
|
Rocky 8 migration planning |
| Status and issues wit Rocky 8 migration |
|
Summary from S&C week Discussions with Andy H and Brian B | @Thomas, Jyothish (STFC,RAL,SC) | XRootD workshop agenda to be confirmed high system load is unexpected, try monitoring running processes during this period. One possibility might be disk IO locking. Suggested using the throttle plugin and turn off the limits to get the printouts officially supported OSes: el7-8, alma9 openssl3 was causing some issues, but the package is available load balancing: Brian uses static weighting with global sharing (each server gets a fixed share of total transfers). Another possibility is to use heartbeat skew as a metric Checksums - in flight checksums ongoing, Andy remarked these would not confirm the integrity of the file at destination. tokens - token redaction ETA by spring (march). PR with bugfixes sent to xrootd, to be included in 5.7.0 https://github.com/xrootd/xrootd/pull/2152 weird behaviour under investigation - http stops being responsive under small bursts of high load gws randomly stopping authenticating - error seen was malformed CA. next time it happens, check vomsdata::check_from_file in gdb to debug. likely issues with loading CA chains
|
|
Planning for ALICE CMSD redirection |
| redirection needed for the 3 ALICE gws, to move away from the current DNS RR. options: replicating the general setup (keepalived 2 host redundancy) - we’d have 2 managers for 3 servers
|
|
ECHO File transfer / throughput studies | @Katy Ellis |
|
|
Checksums fixes | @Alexander Rogovskiy @Thomas, Jyothish (STFC,RAL,SC) | patch ignoring the stale checksum check performs similar to the checksum library plugin. |
|
Prefetch studies and WN changes | @Alexander Rogovskiy |
|
|
Deletion studies through RDR | @Ian Johnson | Compared to last measurements (two weeks ago), concurrent test deletions are taking longer during DC24 (as expected). Quite a variation in timings observed. Deletion rates decreased (all rates below are for 1000 files with 10 deletion threads): ATLAS wish to have 3 GIB deletion rate of 4 Hz, current test rate is still faster (c.f. 21 Hz last before DC24.). Deletion times distributions for 1, 3, and 6 GiB files: Timings using the same Y axis scale: With the longest deletion times from the sample above being 42.9s for 1 GiB files and 97.4s for 6 GiB files, I expect to see many deletions timing-out depending on the ECHO loading. From 01/02 mtg: Check timings for slow deletions. As above. Measure deletion rate for 6 GiB files. As above. Testing with 1 KiB files works up to 4000 files (deletion client on two VMs), but fails with 10000 files. Investigating. Not done yet. |
|
Tokens Status | @Thomas, Jyothish (STFC,RAL,SC) @Katy Ellis | wlcg token create and modify scopes must include permissions to create and stat superfolders of that path. (xrootd fix included in the stat permissions patch) 'timeout' errors in token auth (permission denied-timeout was reached) during DC24. Seems to be caused by overloading the IAM servers during token deserialization in scitokens-cpp. (it fetches the public key too often) |
|
WLCG IAM testbed | @Katy Ellis | WLCG IAM testbed appears quite ‘limited’ in some of the VO based testing. |
|
Understanding CMSD Loadbalancing | @Thomas Byrne |
|
|
SKA Gateway box | @James Walder |
| |
|
| JW to prepare a summary of the plans for 2024 |
|
on GGUS:
Site reports
Lancaster - Moved to 5.6.7, leaving fireflies on. No issues (touch wood). Added more gateways into the “xroot cluster” - up to 7 now. The notes about cms.sched/perf settings from January were very useful to fall back on so thanks!
Glasgow - 5.6.3 will move to 5.6.7 post-DC.
Firefly activity switched on for DC (not specifically Glasgow) caused segfaults; now disabled.
Ceph cluster work will wait until the end of DC24.
Action items
@James Walder to schedule a ‘hackathon’ within a F2F to have a session on architectural planning.
@James Walder to prepare an outline of the expected roadmap for XRootD developments in 2024.
Decisions
- @Thomas, Jyothish (STFC,RAL,SC) to look at implementing CMSD for the ALICE Gateways and to document / identify and bottlenecks in the process