2022-12-08 Meeting Notes

 Date

Dec 8, 2022

 Participants

  • @James Walder

  • Glasgow: Sam

  • Lancs: Steven, Matt, Gerard

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Combining xrootd and webdav aliases gateways

 

Change made on Mon Nov 28; gw4,5 kept as single-use hosts

 

 

 

Next step to slowly release/optimise the FTS limits for ATLAS (back to nominal levels).

 

 

Alice space token development updates

In progress

 

Slow stats

 

‘slow checksums’

Test of functional cephsum client / server code:

Alex’s/LHCb code; stat + checksum; 100 files => ~ 100ms / file
Current implementation is O(80)s for 100 files

RTT (from lxplus ~ 20ms)

Timestamp, Execution time [s] 1670238676.49,14.4777889252,0 1670238709.95,12.2421181202,0 1670238722.25,11.8010079861,0 1670238734.12,11.6060519218,0 1670238745.79,11.357049942,0 1670238757.22,11.6894021034,0 1670238768.98,11.5300111771,0 1670238780.59,11.1914060116,0 1670238791.85,10.7543468475,0 1670238802.69,11.7126610279,0 1670238814.48,11.1290979385,0

https://github.com/snafus/cephsum-client

https://github.com/snafus/cephsum-server

Plan to continue to tests with this implementation.

 

Large checksums

 

CMS transferring (handful) of > 50GB files.
FTS failed the transfers due to timeouts on the checksum side: RAL checksums @ 10s/GB;
FTS nominally should have 1800s timeout, however, the HTTP client library was applying a 5min timeout.

FTS devs managed to find a way to override that timeout, and transfers now succeeding.

James - had observed difference in own tests between lxplus and RAL initiate checksum requests; Would like to confirm if same / different set of timeouts.

Also discussed with XrootD devs; as FTS devs brought up (non-)ability to use 100 continue header

 


Yes, XRootD does support "Expect: 100-continue" headers but this isued for a very limited purpose. When the http front-end is filling a buffer in the presence of read segmentation and the header was present, it will send a keepalive. Notice that this is not extended to checksum handling. However, it would be relatively easy to do this. However, we need to look at the best place for this to occur. It may be in the front end or it may be in the XRootD backend. In any case, could you cut a github ticket requesting that expect continue headers also apply to checksumming?

Had (previously) already implemented ‘concurrent’ checksumming code, but never implemented (and still not sure if it’s ideal).

 

 

Deletes

https://stfc.atlassian.net/browse/XRD-52

 

 

Alex notes still some cases of long deletes (e.g. beyond the 20s timeout).

Also spotted case where macaroon was generated, but no further evidence of a connection from the client, leading to a timeout of the client side …

https://indico.cern.ch/event/1217518/contributions/5121757/attachments/2562916/4417797/pres_liaisons.pdf

 

 

https://stfc.atlassian.net/browse/XRD-53

https://stfc.atlassian.net/browse/XRD-50

 

No time yet to properly investigate, but should be considered urgent.

 

 

Vector reads

 

https://indico.cern.ch/event/1217518/contributions/5121757/attachments/2562916/4417797/pres_liaisons.pdf

Alex looked at:

  • Reduced max number of segments per readv

  • 'Buffered' reads

  • Direct reads from ceph via librados

... (small snippet of code) + librados::AioCompletion* cmpl; + ceph::bufferlist* bl; + ReadOpData tup; + + cmpl = librados::Rados::aio_create_completion(); + if (0 == cmpl) { + logwrapper((char*)"Can not create completion for read (%lu, %lu)", offset, size); + return -1; + } + + try { + bl = new ceph::bufferlist(); + } catch (std::bad_alloc&) { + logwrapper((char*)"Can not allocate buffer for read (%lu, %lu)", offset, size); + cmpl->release(); + return -1; + } + + tup = std::make_tuple(cmpl, bl, out_buf); + operations.push_back(tup); + + return context->aio_read(fname, cmpl, bl, size, offset); + }; ...

 

 

 

tokens testbed

 

Lost dev-gw2 to Alice GW testing of space tokens.

 

 

Planning

 

Todo: Discuss / plan main roadmap for 2023

 

 

GGUS:

Deletion problem at RAL

Slow stat calls at RAL

Problem accessing some LHCb files at RAL

Site reports

 

 Action items

JW to add issue to xrootd GitHub to request Expect: 100 continue functionality for XrootD checksumming.

 Decisions