2022-08-17 Meeting notes

2022-08-17 Meeting notes


Aug 17, 2022


  • @James Walder

  • @Alison Packer

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Alastair Dewhurst

  • Lancs: Matt Gerard Steven

  • Glasgow: Sam


  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 Discussion topics







5.5.0-rcX status / problems in testing

@Thomas, Jyothish (STFC,RAL,SC) Sam

Sam - appears ok; (localgroupdisk permissions issues at Glasgow).
Jyothish 5.5.0 appears ok sofar.
XrdCeph bug discovered https://stfc.atlassian.net/wiki/spaces/CD/pages/30998699 while setting up unit tests:
errors in xrdceph related to the cluster object do not cause a clean restart, locking the service into an invisible crash - status still reports OK but all requests sent fail and cause the socket to go into close_wait

Centos 8: outstanding items?


Continuing to progess.

(See Lancs Site report for more details on vector size problems, but resolved in 5.5.0-rc2).

space info reporting:



Alice testing, dev-gw2 (Confirmed from (Alice) running tests on dev-GW2
JW review PR, and will confirm from Atlas once on a non-Slice node.
With Rob reviewed the aqualon config, with some updates.

Alice tests OK - sandbox ready and deployment in prep (aim Monday).



(Not running at the moment due to dev-gw2 testing).



Scripts created to extract problematic files

Deletions / current understanding / necessary changes



  • Horizontal scaling, i.e. spread deletes across all available gateways

  • Add (when ready) CMSD to ensure load balancing (at what extra cost),

  • Offload to external tooling to manage the deletes asynchronously

    • Ideas below

    • (JW: currently ‘playing' with test implementation).


“Unified” xrootd status / problems?


Avoid TPC for xroot transfers on the webdav gateways

xrdtpc.sh script without a proxy (is it possible?)
xrd.tpc redirect (to xrootd hosts, or some other proxy … )


Did not find a solution yet (see above).

What are we going to do with Vector Reads now?


Removal of locking did not appear to help significantly.

Work ongoing with EL8 and recent 5.X releases (both recommendations from Rob’s document).

Rob C’s script is currently the only metric for LHCb to compare / verify against.


  • Write a proper striper-vector-read method

  • Implement one of Andreas’ options

  • Fall back on buffering / range coalesce in the short term, until correct fix available.

    • Collect data on what really is needed


  • The socket timeouts (etc) should give and propagate up (sensible) error messages

    • i.e. concentrate on the cases that are ‘broken’

    • Jyothish’s work on 'close-wait + connection resilience '

  • quantify the ‘slowness’ of the vector reads (and impact).



Site reports



Concerning the vector::reserve crashes, we had submitted two pull requests:

  1. Modify vector's size instead of capacity to avoid bounds-checking failures #1630 - This replaced several uses of std::vector::reserve with resize, avoiding the assertion failure.  Three places fixed:

    1. macaroons

    2. TPC streams

    3. throttle manager

  2. Vector cleared after use so it can be shrunk. #1639 - This fixed a consequence of #1630, in that some buffers might not relinquish their allocations as soon as expected (so a non-functional failure).

Order of application:

  1. Tag v5.4.0 had neither of these.

  2. v5.4.1 included someone else's fix just for macaroons.  One byte is pushed back, so that an expression that accesses [0] doesn't fail the assertion.

  3. v5.4.3 includes #1639, even though it's only needed after #1630.  Probably harmless.

  4. v5.5.0-rc2 includes #1630.  The other fix in 5.4.1 has been removed.

As a result, 5.4.1 got us past the macaroon problem, but it wouldn't for TPCs.


-Matt (now having sound issues)

-We do have a “new” rocky8 testbed node in place to test this out, but haven’t had a chance to try out with our own build.

-Redirector work almost done, but on hold until after the rucio settings change has settled down (see https://its.cern.ch/jira/browse/ADCINFR-239)

  • currently the Symlink rucio change is looking perilous. 

-errors like: Failed to stage-out file: mc16_13TeV:log.29979569._000296.job.log.tgz.1 to UKI-NORTHGRID-LANCS-HEP-CEPH_DATADISK, module 'rucio.rse.protocols.posix' has no attribute 'Symlink'")]:failed to transfer files using copytools=['rucio'] 



 Action items

Reminder from Sam to update GGUS tickets
James to review space reporting PR


Related content

Meeting notes
Meeting notes
Read with this
2023-03-09 Meeting Notes
2023-03-09 Meeting Notes
More like this
2022-08-10 Meeting notes
2022-08-10 Meeting notes
Read with this
2023-06-29 Meeting Notes
2023-06-29 Meeting Notes
More like this
2023-03-02 Meeting Notes
2023-03-02 Meeting Notes
More like this
2023-09-07 Meeting Notes
2023-09-07 Meeting Notes
More like this