2022-08-10 Meeting notes

 Date

Aug 10, 2022

 Participants

  • @James Walder

  • @Alison Packer

  • @Ian Johnson

  • @Thomas, Jyothish (STFC,RAL,SC)

  • Lancs: Matt Gerard Steven

  • Glasgow: Sam

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item

Presenter

Notes

Item

Presenter

Notes

5.4.3 releases in Centos 7

  • Problems observed: (pgRead / pgWrites)

@Thomas, Jyothish (STFC,RAL,SC)

5.5.0-rc1 testing on dev VM, appears to be working ok.
and feedback to xrootd devs ?

Sam also testing building 5.5.0; finds python2 first?
- some updates to cmake required to make it find the relevant python3
- Sam will create an issue once changes finalised …

discovered https://stfc.atlassian.net/wiki/spaces/CD/pages/30998699 while setting up unit tests:
errors in xrdceph related to the cluster object do not cause a clean restart, locking the service into an invisible crash - status still reports OK but all requests sent fail and cause the socket to go into close_wait

Centos 8: outstanding items?

 

Workernodes with EL8 being prepared for deployment.
Some items like cephsum to be placed into EL8 rpms.

space info reporting:

https://stfc.atlassian.net/browse/XRD-21

 

functionality implemented in xrdceph, current usage info coming from ceph,
2 additional configurable parameters, ceph.quotapath and ceph.poolnames
quota info coming from local json file read from quotapath (with similar format as s3.echo.stfc.ac.uk/srr/storagesummary.json).
Default set to /etc/xrootd/storagesummary.json.
pool names defined as comma seperated string with trailing comma, read from ceph.poolnames.
Tested as working in dev setup, PR & RPM generated - plan to roll this out as v5.3.8 of xrootdceph

  • Perhaps less relevant for Glasgow.

https://stfc.atlassian.net/browse/XRD-26

 

5.5.0-rc1 is still not compliant with the WLCG testbed:
Missing HTTP error code mapping for storage.create · Issue #1752 · xrootd/xrootd

map kXR_ItExists to HTTP 409 by ffurano · Pull Request #1753 · xrootd/xrootd

Deletions update

 

Ian looking at 1-8 GiB deletions against gwX.
Single file writes and deletes

No deletion timeouts observed.
Previous tests (with more parallel deletions) do show longer deletion times

https://stfc.atlassian.net/browse/XRD-27

James

Occasional failed deletes for Atlas due to 0-byte files with partial striper metadata.
(Likely triggered by recent consistency check).

Future of Deletions?

 

Options:

  • Horizontal scaling, i.e. spread deletes across all available gateways

  • Add (when ready) CMSD to ensure load balancing (at what extra cost),

  • Offload to external tooling to manage the deletes asynchronously

    • Ideas below

    • (JW: currently ‘playing' with test implementation).

 

What are we going to do with Vector Reads now?

All

Removal of locking did not appear to help significantly.

Rob C’s script is currently the only metric for LHCb to compare / verify against.

Options:

  • Write a proper striper-vector-read method

  • Implement one of Andreas’ options

  • Fall back on buffering / range coalesce in the short term, until correct fix available.

    • Collect data on what really is needed

 

Site reports

Glasgow:
With some testing, appears that the namelib must be present on the proxy (via pss.namelib) to enable to smoke-tests to work.

Lancaster

Redirector adventures are going quite well - except when I tried to put in our new rocky 8 server into our test xroot cluster. It crashes on the http smoke test…

It is actually a little different then the old libmacaroon problems - the crash occurs  when it gets to the second batch of tests (the TPC tests). The stack trace in the syslog also doesn’t mention libmacaroons at all.

This is xroot 5.4.3 on rocky 8.6

 

 Action items

Reminder from Sam to update GGUS tickets
James to review space reporting PR

 Decisions