2022-08-17 Meeting notes

 Date

Aug 17, 2022

 Participants

  • @James Walder

  • @Alison Packer

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Alastair Dewhurst

  • Lancs: Matt Gerard Steven

  • Glasgow: Sam

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item

Presenter

Notes

Item

Presenter

Notes

5.5.0-rcX status / problems in testing

@Thomas, Jyothish (STFC,RAL,SC) Sam

Sam - appears ok; (localgroupdisk permissions issues at Glasgow).
Jyothish 5.5.0 appears ok sofar.
XrdCeph bug discovered https://stfc.atlassian.net/wiki/spaces/CD/pages/30998699 while setting up unit tests:
errors in xrdceph related to the cluster object do not cause a clean restart, locking the service into an invisible crash - status still reports OK but all requests sent fail and cause the socket to go into close_wait

Centos 8: outstanding items?

 

Continuing to progess.

(See Lancs Site report for more details on vector size problems, but resolved in 5.5.0-rc2).

space info reporting:

https://stfc.atlassian.net/browse/XRD-21

 

Alice testing, dev-gw2 (Confirmed from (Alice) running tests on dev-GW2
JW review PR, and will confirm from Atlas once on a non-Slice node.
With Rob reviewed the aqualon config, with some updates.

Alice tests OK - sandbox ready and deployment in prep (aim Monday).

https://stfc.atlassian.net/browse/XRD-26

 

(Not running at the moment due to dev-gw2 testing).

https://stfc.atlassian.net/browse/XRD-27

James

Scripts created to extract problematic files

Deletions / current understanding / necessary changes

 

Options:

  • Horizontal scaling, i.e. spread deletes across all available gateways

  • Add (when ready) CMSD to ensure load balancing (at what extra cost),

  • Offload to external tooling to manage the deletes asynchronously

    • Ideas below

    • (JW: currently ‘playing' with test implementation).

 

“Unified” xrootd status / problems?

 

Avoid TPC for xroot transfers on the webdav gateways

xrdtpc.sh script without a proxy (is it possible?)
xrd.tpc redirect (to xrootd hosts, or some other proxy … )

 

Did not find a solution yet (see above).

What are we going to do with Vector Reads now?

All

Removal of locking did not appear to help significantly.

Work ongoing with EL8 and recent 5.X releases (both recommendations from Rob’s document).

Rob C’s script is currently the only metric for LHCb to compare / verify against.

Options:

  • Write a proper striper-vector-read method

  • Implement one of Andreas’ options

  • Fall back on buffering / range coalesce in the short term, until correct fix available.

    • Collect data on what really is needed

 

  • The socket timeouts (etc) should give and propagate up (sensible) error messages

    • i.e. concentrate on the cases that are ‘broken’

    • Jyothish’s work on 'close-wait + connection resilience '

  • quantify the ‘slowness’ of the vector reads (and impact).

 

 

Site reports

Glasgow:

Lancaster

Concerning the vector::reserve crashes, we had submitted two pull requests:

  1. Modify vector's size instead of capacity to avoid bounds-checking failures #1630 - This replaced several uses of std::vector::reserve with resize, avoiding the assertion failure.  Three places fixed:

    1. macaroons

    2. TPC streams

    3. throttle manager

  2. Vector cleared after use so it can be shrunk. #1639 - This fixed a consequence of #1630, in that some buffers might not relinquish their allocations as soon as expected (so a non-functional failure).

Order of application:

  1. Tag v5.4.0 had neither of these.

  2. v5.4.1 included someone else's fix just for macaroons.  One byte is pushed back, so that an expression that accesses [0] doesn't fail the assertion.

  3. v5.4.3 includes #1639, even though it's only needed after #1630.  Probably harmless.

  4. v5.5.0-rc2 includes #1630.  The other fix in 5.4.1 has been removed.

As a result, 5.4.1 got us past the macaroon problem, but it wouldn't for TPCs.

 

-Matt (now having sound issues)

-We do have a “new” rocky8 testbed node in place to test this out, but haven’t had a chance to try out with our own build.

-Redirector work almost done, but on hold until after the rucio settings change has settled down (see https://its.cern.ch/jira/browse/ADCINFR-239)

  • currently the Symlink rucio change is looking perilous. 

-errors like: Failed to stage-out file: mc16_13TeV:log.29979569._000296.job.log.tgz.1 to UKI-NORTHGRID-LANCS-HEP-CEPH_DATADISK, module 'rucio.rse.protocols.posix' has no attribute 'Symlink'")]:failed to transfer files using copytools=['rucio'] 

 

 

 Action items

Reminder from Sam to update GGUS tickets
James to review space reporting PR

 Decisions