2022-12-01 Meeting Notes

 Date

Oct 27, 2022

 Participants

  • @James Walder

  • @Ian Johnson

  • Glasgow: Sam

  • Lancs: Steven, Matt, Gerard

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Combining xrootd and webdav aliases gateways

 

Change made on Monday; gw4,5 kept as single-use hosts
https://elog.gridpp.rl.ac.uk/Tier1/10686
'Coincided' with large transfer requests:

Next step to slowly release the FTS limits for ATLAS

 

 

Alice space token development updates

@Ian Johnson

Code review found the need for better error reporting when the figure for the amount of disk space allocated to a pool cannot be retrieved from the extended attribute on the object “pool:__spaceinfo__” in target pools. This is now in place.

The StatLS code also needs to convert from the “raw” figures for disk space allocation and usage to VO-relevant figures by accounting for the erasure coding overhead. Assuming this is approx 8/11 for the ECHO and DEV clusters for now, but will allow a config file setting for other cluster layouts.

Costin from ALICE had a look at the output of ‘xrdfs query space’ two weeks ago and appears keen to assist with the trials, e.g. when the functionality is moved to ceph-dev-gw2 (this is an ALICE “testbed” for some purposes).

 

Slow stats

 

As previously mentioned; the ‘slow stats’ issue from LHCb appears to be more related to slow checksums

Plot showing the time for 100 files to run through the ‘lhcb’ stat (+checksum) code, comparing SARA and RAL

James; constructed a Proof-of-concept client-server checksum (metadata) tooling.

Server:
* Python, with multi-threaded TCP server, and keeping multiple rados clusters open in a pool
* Hopefully, can keep this as multithreaded only (and not move to a multi-processing model).
* Could use most of the existing codebase (but not yet applied)

Client:
* c++ based simple client (for speed reasons)

Functionality: Currently only does:
* HMAC md5 authentication with shared secret key
* simple metatadata checksum retrieval, ‘ping’, ‘health monitoring' and ‘wait <delay>’ mode.

On ceph-dev instance; code runs in ~ 15ms; and < 1s for 100 checksum retrievals:
(reminder current tools runs in ~ 200ms, but with additional stats / checks / namelib conversions).

Todo:

  • Test in-situ in xrootd (on dev-gateway); spot any other bottle-necks ?

  • confirm if file-based checksumming is efficient with multi-threaded approach (concurrent zlib.adler32 calculations can use multiple cores)

 

Deletes

https://stfc.atlassian.net/browse/XRD-52

 

 

Alex notes still some cases of long deletes (e.g. beyond the 20s timeout).

Added additional macaroon logging to the prod hosts: https://elog.gridpp.rl.ac.uk/Tier1/10679

Spotted one case where macaroon was generated, but no further evidence of a connection from the client, leading to a timeout of the client side …

 

https://stfc.atlassian.net/browse/XRD-53

https://stfc.atlassian.net/browse/XRD-50

 

No time yet to properly investigate, but should be considered urgent.

 

 

Vector reads

 

Alex suggestion of restricting the client’s ability (from the server side) to send large numbers of readv segments in a request, appears to work, but without some ‘(X)caching, performance is slow’.

Tested with adding a Buffer behind the readV requests (e.g. readV reads from a buffered read of data from ceph).
Performance is much better (still to be quantified / tuned), but does imply wasted bandwidth.
Further work needed to ensure it’s production ready

This is still ‘just’ a mitigation of the any underlying code improvements that could be made.

 

tokens testbed

 

Tokens testbed; updated tests mean that RAL no longer passes all tests.

Some are likely to need xrootD updates; others might be due to the tests doing Directory-level stuff. And some might need configuration updates.

Intend to try and add SKA issuer for some functional tests.

 

 

GGUS:

 

Site reports

 

 Action items

@Thomas Byrne to be made aware of the big red warning on https://docs.ceph.com/en/quincy/cephadm/upgrade/ (but is hopefully fixed now).

 

 

 

 Decisions