• Rough draft
  • 2022-11-24 Meeting notes

     Date

    Oct 27, 2022

     Participants

    • @James Walder

    • @Thomas Byrne

    • Glasgow: Sam

    • Lancs: Steven, Gerard, Matt

     Goals

    • List of Epics

    • New tickets

    • Consider new functionality / items

    • Detailed discussion of important topics

    • Site report activity

     

     Discussion topics

    https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

    Item

    Presenter

    Notes

     

    Item

    Presenter

    Notes

     

    Combining xrootd and webdav aliases gateways

     

    Likely to be done on Monday; gw4,5 to be kept as single-use hosts

    https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=477591

    SEGV workaround, and additional macaroon logging

    http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-gateway-xrootd-nobind550-3

     

    Slow stats

     

    Alex investigated; Slow stat issue appears to be more a slow checksum issue (ignoring cases of very long stats for ‘other reasons’.

    This is for checksum requests that retrieve from metadata. ~ 200ms typical time; using the cephsum python script
    Profiling the code, shows that the cluster_connect and shutdown are the most significant operations.

    James would like to construct a server-client tool with a pool of open cluster objects in the server:
    Tom suggests to review also whether to ‘do it properly’ in xrootd should be implemented now.

     

     

    Checksumming

     

    Quick review of recent checksumming, shows typical times, and times where significant operations are being undertaking in Ceph

     

     

    Deletes

    https://stfc.atlassian.net/browse/XRD-52

     

     

    Alex notes still some cases of long deletes (e.g. beyond the 20s timeout), and also failures for ‘other reasons’. Possibility that some might be ‘user error’, but needs to be clarified, either way

     

     

    https://stfc.atlassian.net/browse/XRD-53

    https://stfc.atlassian.net/browse/XRD-50

     

    No time yet to investigate, but should be considered urgent.

     

    https://stfc.atlassian.net/browse/XRD-51

     

    Now urgent …

     

    Vector reads

     

    Alex suggestion of restricting the client’s ability (from the server side) to send large numbers of readv segments in a request:

    diff --git a/src/XProtocol/XProtocol.hh b/src/XProtocol/XProtocol.hh 2index eb9af2c..da75f8a 100644 3--- a/src/XProtocol/XProtocol.hh 4+++ b/src/XProtocol/XProtocol.hh 5@@ -683,7 +683,7 @@ struct read_list { 6 }; 7 static const int rlItemLen = sizeof(read_list); 8 static const int maxRvecln = 16384; 9-static const int maxRvecsz = maxRvecln/rlItemLen; 10+static const int maxRvecsz = 16; 11 } 12

    Testing on (non prod external) gateways with / without fix.

    Has been tested on LHCb job (from lxplus and more locally). Rob C. script is not applicable here as that script deliberately forces the number of readv segments per request.

    Current statement is that the performance it terrible, but with the patch, the timeouts are currently avoided (Alex increased the timeouts on the client side, but they would have been avoided, if they were there).
    No ‘buffering’ / Xcache / range coallesence on the Gateway side.

    Need to test against a WN type set up and try to add some ‘buffering’ for the readV requests, similar to the read request buffering but with much smaller size (say 1–2 MiB) to avoid read amplification.

     

     

    GGUS:

     

     

     

     

     

     

    Site reports

     

     Action items

     

     

     Decisions