Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

\uD83D\uDC65 Participants

...

Item

Presenter

Notes

Combining xrootd and webdav aliases gateways

Change made on Mon Nov 28; gw4,5 kept as single-use hosts
https://elog.gridpp.rl.ac.uk/Tier1/10686
'Coincided' with large transfer requests:

Next step to slowly release/optimise the FTS limits for ATLAS (back to nominal levels).

Alice space token development updates

In progress

Slow stats

‘slow checksums’

Test of functional cephsum client / server code:

Alex’s/LHCb code; stat + checksum; 100 files => ~ 100ms / file
Current implementation is O(80)s for 100 files

RTT (from lxplus ~ 20ms)

Code Block
Timestamp,    Execution time [s]
1670238676.49,14.4777889252,0
1670238709.95,12.2421181202,0
1670238722.25,11.8010079861,0
1670238734.12,11.6060519218,0
1670238745.79,11.357049942,0
1670238757.22,11.6894021034,0
1670238768.98,11.5300111771,0
1670238780.59,11.1914060116,0
1670238791.85,10.7543468475,0
1670238802.69,11.7126610279,0
1670238814.48,11.1290979385,0

https://github.com/snafus/cephsum-client

https://github.com/snafus/cephsum-server

Plan to continue to tests with this implementation.

Large checksums

CMS transferring (handful) of > 50GB files.
FTS failed the transfers due to timeouts on the checksum side: RAL checksums @ 10s/GB;
FTS nominally should have 1800s timeout, however, the HTTP client library was applying a 5min timeout.

FTS devs managed to find a way to override that timeout, and transfers now succeeding.

Image Added

James - had observed difference in own tests between lxplus and RAL initiate checksum requests; Would like to confirm if same / different set of timeouts.

Also discussed with XrootD devs; as FTS devs brought up (non-)ability to use 100 continue header


Yes, XRootD does support "Expect: 100-continue" headers but this isued for a very limited purpose. When the http front-end is filling a buffer in the presence of read segmentation and the header was present, it will send a keepalive. Notice that this is not extended to checksum handling. However, it would be relatively easy to do this. However, we need to look at the best place for this to occur. It may be in the front end or it may be in the XRootD backend. In any case, could you cut a github ticket requesting that expect continue headers also apply to checksumming?

Had (previously) already implemented ‘concurrent’ checksumming code, but never implemented (and still not sure if it’s ideal).

Deletes

Jira Legacy
serverSystem JIRA
serverId929eceee-34b0-3928-beeb-a1a37de31a8b
keyXRD-52

Alex notes still some cases of long deletes (e.g. beyond the 20s timeout).

Added additional macaroon logging to the prod hosts: https://elog.gridpp.rl.ac.uk/Tier1/10679

Spotted one Also spotted case where macaroon was generated, but no further evidence of a connection from the client, leading to a timeout of the client side …

https://indico.cern.ch/event/1217518/contributions/5121757/attachments/2562916/4417797/pres_liaisons.pdf

Image Added

Jira Legacy
serverSystem JIRA
serverId929eceee-34b0-3928-beeb-a1a37de31a8b
keyXRD-53

Jira Legacy
serverSystem JIRA
serverId929eceee-34b0-3928-beeb-a1a37de31a8b
keyXRD-50

No time yet to properly investigate, but should be considered urgent.

Vector reads

https://indico.cern.ch/event/1217518/contributions/5121757/attachments/2562916/4417797/pres_liaisons.pdf

Alex looked at:

  • Reduced max number of segments per readv

  • 'Buffered' reads

  • Direct reads from ceph via librados

Code Block
... (small snippet of code)
+     librados::AioCompletion* cmpl;
+     ceph::bufferlist* bl; 
+     ReadOpData tup;
+
+     cmpl = librados::Rados::aio_create_completion();
+     if (0 == cmpl) {
+       logwrapper((char*)"Can not create completion for read (%lu, %lu)", offset, size);
+       return -1;
+     }
+
+     try {
+       bl = new ceph::bufferlist();
+     } catch (std::bad_alloc&) {
+       logwrapper((char*)"Can not allocate buffer for read (%lu, %lu)", offset, size);
+       cmpl->release();
+       return -1;
+     }
+
+     tup = std::make_tuple(cmpl, bl, out_buf);
+     operations.push_back(tup);
+
+     return context->aio_read(fname, cmpl, bl, size, offset);
+  };
...
Image Added

tokens testbed

Lost dev-gw2 to Alice GW testing of space tokens.

Planning

Todo: Discuss / plan main roadmap for 2023

GGUS:

Deletion problem at RAL

Slow stat calls at RAL

Problem accessing some LHCb files at RAL

Site reports

✅ Action items

Thomas Byrne to be made aware of the big red warning on https://docs.ceph.com/en/quincy/cephadm/upgrade/ (but is hopefully fixed now)JW to add issue to xrootd GitHub to request Expect: 100 continue functionality for XrootD checksumming.

⤴ Decisions