2022-11-24 Meeting notes
Date
Oct 27, 2022
Participants
@James Walder
@Thomas Byrne
Glasgow: Sam
Lancs: Steven, Gerard, Matt
Goals
List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity
Discussion topics
Item | Presenter | Notes |
|
---|---|---|---|
Combining xrootd and webdav aliases gateways |
| Likely to be done on Monday; gw4,5 to be kept as single-use hosts https://helpdesk.gridpp.rl.ac.uk/Ticket/Display.html?id=477591 SEGV workaround, and additional macaroon logging http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-gateway-xrootd-nobind550-3 |
|
Slow stats |
| Alex investigated; Slow stat issue appears to be more a slow checksum issue (ignoring cases of very long stats for ‘other reasons’. This is for checksum requests that retrieve from metadata. ~ 200ms typical time; using the cephsum python script James would like to construct a server-client tool with a pool of open cluster objects in the server:
|
|
Checksumming |
| Quick review of recent checksumming, shows typical times, and times where significant operations are being undertaking in Ceph
|
|
Deletes https://stfc.atlassian.net/browse/XRD-52
|
| Alex notes still some cases of long deletes (e.g. beyond the 20s timeout), and also failures for ‘other reasons’. Possibility that some might be ‘user error’, but needs to be clarified, either way
|
|
| No time yet to investigate, but should be considered urgent. |
| |
| Now urgent … |
| |
Vector reads |
| Alex suggestion of restricting the client’s ability (from the server side) to send large numbers of readv segments in a request: diff --git a/src/XProtocol/XProtocol.hh b/src/XProtocol/XProtocol.hh
2index eb9af2c..da75f8a 100644
3--- a/src/XProtocol/XProtocol.hh
4+++ b/src/XProtocol/XProtocol.hh
5@@ -683,7 +683,7 @@ struct read_list {
6 };
7 static const int rlItemLen = sizeof(read_list);
8 static const int maxRvecln = 16384;
9-static const int maxRvecsz = maxRvecln/rlItemLen;
10+static const int maxRvecsz = 16;
11 }
12
Testing on (non prod external) gateways with / without fix. Has been tested on LHCb job (from lxplus and more locally). Rob C. script is not applicable here as that script deliberately forces the number of readv segments per request. Current statement is that the performance it terrible, but with the patch, the timeouts are currently avoided (Alex increased the timeouts on the client side, but they would have been avoided, if they were there). Need to test against a WN type set up and try to add some ‘buffering’ for the readV requests, similar to the read request buffering but with much smaller size (say 1–2 MiB) to avoid read amplification. |
|
GGUS:
Site reports
Action items