2022-08-17 Meeting notes

Date

Aug 17, 2022

Participants

@James Walder
@Alison Packer
@Thomas, Jyothish (STFC,RAL,SC)
@Alastair Dewhurst
Lancs: Matt Gerard Steven
Glasgow: Sam

Goals

List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity

Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item	Presenter	Notes

Item	Presenter	Notes
5.5.0-rcX status / problems in testing	@Thomas, Jyothish (STFC,RAL,SC) Sam	Sam - appears ok; (localgroupdisk permissions issues at Glasgow). Jyothish 5.5.0 appears ok sofar. XrdCeph bug discovered https://stfc.atlassian.net/wiki/spaces/CD/pages/30998699 while setting up unit tests: errors in xrdceph related to the cluster object do not cause a clean restart, locking the service into an invisible crash - status still reports OK but all requests sent fail and cause the socket to go into close_wait
Centos 8: outstanding items?		Continuing to progess. (See Lancs Site report for more details on vector size problems, but resolved in 5.5.0-rc2).
space info reporting: https://stfc.atlassian.net/browse/XRD-21		Alice testing, dev-gw2 (Confirmed from (Alice) running tests on dev-GW2 JW review PR, and will confirm from Atlas once on a non-Slice node. With Rob reviewed the aqualon config, with some updates. Alice tests OK - sandbox ready and deployment in prep (aim Monday).
https://stfc.atlassian.net/browse/XRD-26		(Not running at the moment due to dev-gw2 testing).
https://stfc.atlassian.net/browse/XRD-27	James	Scripts created to extract problematic files
Deletions / current understanding / necessary changes		Options: Horizontal scaling, i.e. spread deletes across all available gateways Add (when ready) CMSD to ensure load balancing (at what extra cost), Offload to external tooling to manage the deletes asynchronously Ideas below (JW: currently ‘playing' with test implementation).
“Unified” xrootd status / problems?		Avoid TPC for xroot transfers on the webdav gateways
xrdtpc.sh script without a proxy (is it possible?) xrd.tpc redirect (to xrootd hosts, or some other proxy … )		Did not find a solution yet (see above).
What are we going to do with Vector Reads now?	All	Removal of locking did not appear to help significantly. Work ongoing with EL8 and recent 5.X releases (both recommendations from Rob’s document). Rob C’s script is currently the only metric for LHCb to compare / verify against. Options: Write a proper striper-vector-read method Implement one of Andreas’ options Fall back on buffering / range coalesce in the short term, until correct fix available. Collect data on what really is needed The socket timeouts (etc) should give and propagate up (sensible) error messages i.e. concentrate on the cases that are ‘broken’ Jyothish’s work on 'close-wait + connection resilience ' quantify the ‘slowness’ of the vector reads (and impact).

Site reports

Glasgow:

Lancaster

Concerning the vector::reserve crashes, we had submitted two pull requests:

Modify vector's size instead of capacity to avoid bounds-checking failures #1630 - This replaced several uses of std::vector::reserve with resize, avoiding the assertion failure. Three places fixed:
1. macaroons
2. TPC streams
3. throttle manager
Vector cleared after use so it can be shrunk. #1639 - This fixed a consequence of #1630, in that some buffers might not relinquish their allocations as soon as expected (so a non-functional failure).

Order of application:

Tag v5.4.0 had neither of these.
v5.4.1 included someone else's fix just for macaroons. One byte is pushed back, so that an expression that accesses [0] doesn't fail the assertion.
v5.4.3 includes #1639, even though it's only needed after #1630. Probably harmless.
v5.5.0-rc2 includes #1630. The other fix in 5.4.1 has been removed.

As a result, 5.4.1 got us past the macaroon problem, but it wouldn't for TPCs.

-Matt (now having sound issues)

-We do have a “new” rocky8 testbed node in place to test this out, but haven’t had a chance to try out with our own build.

-Redirector work almost done, but on hold until after the rucio settings change has settled down (see https://its.cern.ch/jira/browse/ADCINFR-239)

currently the Symlink rucio change is looking perilous.

-errors like: Failed to stage-out file: mc16_13TeV:log.29979569._000296.job.log.tgz.1 to UKI-NORTHGRID-LANCS-HEP-CEPH_DATADISK, module 'rucio.rse.protocols.posix' has no attribute 'Symlink'")]:failed to transfer files using copytools=['rucio']

Action items

Reminder from Sam to update GGUS tickets

James to review space reporting PR

XRootD