Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Item

Presenter

Notes

Operational Issues

Thomas, Jyothish (STFC,RAL,SC)

Gateways and WNs:
- Current status and upcoming changes

Thomas, Jyothish (STFC,RAL,SC)

CMSD load balancing changes - reporting frequency increased to 3s (decision cycle of 6s)
fuzz reduced from 15 to10

resulting load pattern converged more, even under DC24 level loads

Data Challenge Status / observations

LHCB:
EOS->ECHO link:

echo.png

ECHO->Antares link:

antares.png

Individual gateway throughput, EOS->ECHO link:

eos_echo_gw_by_net_dis.pngeos_echo_gw_old_net_dis.png

Individual gateway performance (ECHO → Antares link):

echo_antares_gw_by_net_dis.pngecho_antares_gw_old_net_dis.png

New vs old network gateways and throughput/Number of Transfers, EOS-ECHO link:

eos_echo_thr_by_net.pngeos_echo_not_by_net.png

Rocky 8 migration planning

Status and issues wit Rocky 8 migration

Summary from S&C week Discussions with Andy H and Brian B

Thomas, Jyothish (STFC,RAL,SC)

XRootD workshop agenda to be confirmed

high system load is unexpected, try monitoring running processes during this period. One possibility might be disk IO locking. Suggested using the throttle plugin and turn off the limits to get the printouts

officially supported OSes: el7-8, alma9 openssl3 was causing some issues, but the package is available

load balancing: Brian uses static weighting with global sharing (each server gets a fixed share of total transfers). Another possibility is to use heartbeat skew as a metric

Checksums - in flight checksums ongoing, Andy remarked these would not confirm the integrity of the file at destination.
The reason why default checksums didn’t work at RAL might have been due to the stale checksum check - confirmed with testing.

tokens - token redaction ETA by spring (march). PR with bugfixes sent to xrootd, to be included in 5.7.0
https://github.com/xrootd/xrootd/pull/2184

https://github.com/xrootd/xrootd/pull/2152

weird behaviour under investigation - http stops being responsive under small bursts of high load

gws randomly stopping authenticating - error seen was malformed CA. next time it happens, check vomsdata::check_from_file in gdb to debug. likely issues with loading CA chains

Planning for ALICE CMSD redirection

redirection needed for the 3 ALICE gws, to move away from the current DNS RR.

options:

replicating the general setup (keepalived 2 host redundancy) - we’d have 2 managers for 3 servers
CMS AAA approach - 1 manager only - no redundancy. CMS can get away with it due to redundancy trough other sites’ managers

ECHO File transfer / throughput studies

Katy Ellis

Checksums fixes

Alexander Rogovskiy Thomas, Jyothish (STFC,RAL,SC)

image-20240214-094143.pngimage-20240214-094135.pngimage-20240214-094154.png

patch ignoring the stale checksum check performs similar to the checksum library plugin.

Prefetch studies and WN changes

Alexander Rogovskiy

Deletion studies through RDR

Ian Johnson

Compared to last measurements (two weeks ago), concurrent test deletions are taking longer during DC24 (as expected). Quite a variation in timings observed. Deletion rates decreased (all rates below are for 1000 files with 10 deletion threads):

Rates.png

Deletion times distributions for 1, 3, and 6 GiB files:

1-3-6-merged.png

Timings using the same Y axis scale:

1-3-6-lim-100-merged.png

With the longest deletion times from the sample above being 42.9s for 1 GiB files and 97.4s for 6 GiB files, I expect to see many deletions timing-out depending on the ECHO loading.

From 01/02 mtg:

Check timings for slow deletions. As above.

Measure deletion rate for 6 GiB files. As above.

Testing with 1 KiB files works up to 4000 files (deletion client on two VMs), but fails with 10000 files. Investigating. Not done yet.

Tokens Status

Thomas, Jyothish (STFC,RAL,SC) Katy Ellis

wlcg token create and modify scopes must include permissions to create and stat superfolders of that path. (xrootd fix included in the stat permissions patch)

'timeout' errors in token auth (permission denied-timeout was reached) during DC24. Seems to be caused by overloading the IAM servers during token deserialization in scitokens-cpp. (it fetches the public key too often)
https://github.com/scitokens/scitokens-cpp/issues/80   - it reattempts to get a failed public key every attempt
https://github.com/scitokens/scitokens-cpp/issues/119 - same problem on missing caches
https://github.com/scitokens/scitokens-cpp/issues/97 - on short lived caches being the reason nodes need connectivity

WLCG IAM testbed

Katy Ellis

WLCG IAM testbed appears quite ‘limited’ in some of the VO based testing.
Also this runs only on dev-gw4.
Should we aim to move this to production host, or keep testing on dev-gw4 for now

Understanding CMSD Loadbalancing

Thomas Byrne

SKA Gateway box

James Walder

/wiki/spaces/UK/pages/215941180

JW to prepare a summary of the plans for 2024

...