2022-10-27 Meeting notes

 Date

Oct 27, 2022

 Participants

  • @James Walder

  • @Thomas Byrne

  • Glasgow: Sam

  • ECDF: Rob

  • Lancs: Gerard, Matt, Steven

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

https://stfc.atlassian.net/jira/software/c/projects/XRD/boards/26/roadmap

Item

Presenter

Notes

 

Item

Presenter

Notes

 

5.5.1 released

 

xrootd/docs/ReleaseNotes.txt at v5.5.1 · xrootd/xrootd

+ **Major bug fixes** **[XrdFfs]** Fix a bug in xrootdfs reported by issue #1777 **[Server]** Avoid SEGV when client tries to access file after deferred closed. **[XrdHttp]** The server certificate is renewed by the Refresh thread of the XrdTlsContext object. **[XrdHttp]** Fix a segv happening when a client sends a first line starting with a space. **[XrdTls]** Shutdown the socket if a SSL error happens when trying to accept a connection. + **Minor bug fixes** **[Apps]** Avoid SEGV when asking for help. **[XrdCl]** copy job: fix memory leak (buffers not queued on error). **[Server]** Add O_RDWR open flag when creating file to avoid fs issue. **[Server]** Properly handle opaque info for fsctl operations. **[XrdHttp]** Allow VO names with spaces and other quoted chars. **[XrdCl]** LocalFileHandler: fail gracefuly on overloaded machines. + **Miscellaneous** **[XrdCl]** Introduce new error code for handling local errors. **[XrdCl]** local file handler: obtain error code with aio_error. **[XrdCl]** xrdfs ls: sanitize ls entry. **[CMake]** Add ENABLE_ switch for scitokens and macaroons, closes #1728. **[XrdTls]** Start the CRLRefresh thread in XrdTlsContext constructor. **[XrdTls]** Changed the bit set for the activation of the Refresh thread. **[XrdTls]** The CRL refresh thread logic only starts when there is a need for it. **[XrdTls]** Free current context when a new context is generated. **[XrdHttpTpc]** Pass src size to OFS via occ.asize.

 

Xcache 5.5.X problems

 

Stuck xroot transfers with gfal2 and XCache · Issue #1808 · xrootd/xrootd

(relates to? Stuck xroot transfers with gfal2 and XCache · Issue #1808 · xrootd/xrootd )

 

Thoughts on combining the xrootd and webdav aliased hosts ?

 

All the XrootD and Webdav hosts are now configured with the ‘unified’ configuration.
It is therefore possible to place each machine under the xrootd and webdav aliases. Should we:

  • Pros:

    • More machines, better sharing of (webdav) load

  • Cons:

    • If one host has a problem, the site is observed to have a problem

    • (naive prob. model). P(problem) ~ 1 - (1-p)^(n). (where n = number of hosts, p = probability of a host to have a problem). ie. P ~ n x p.

    • So, need the probability for a given host to exhibit a problem, to fall quickly with increasing number of hosts

 

CMSD

 

‘working’ CMSD setup in the non-production framework. Starting various tests (See slides, if prepared in time).
How to provide a ‘resilient’ frontend of the Redirector / Manager nodes ?
(And, if we can do that, do we really still need CMSD?)

Matt notes that using a VM(ware?) setup might provide the necessary resilience.

Slow stats

 

Alex observes that gfal stats via root:// protocol makes openSSL key generation calls. This (because of entropy requirements ?) takes a ‘random’ amount of time O(100s ms - few seconds).
While this might explain the ‘slow’ root:// based stats, not sure it explains the observed LHCb slowness (which is more likely just from a bad xrootd host).
Also - why doesn’t this appear from my python api calls - i.e. is it in the context creation, or the stat itself?

 

Vector Read requests on Echo Gateways

 

Observe a very low level of vector Read requests on the External Gateways (possibly due to Virtual Placement, ‘rogue’ users, and … ? )
Plan to add small buffer in readV calculations to avoid the pathological use cases.

Have a ‘simple’ range coalescence algorithm for python, but probably too slow for sensible use.

 

 

 

 

 

 

GGUS:

Slow Stats:

Average stats times in ‘good’ period (simple stat tests from lcgui machine)

 

Average Stat times in ‘bad’ period:

 

 

 

 

 

Site reports

Glasgow

Thinking about Ceph + Rocky 8

Lancaster:

When to go to 5.5?
Have now a dev cluster.

ECDF:
Xcache testing ongoing

Manchester:
Starting to revisit Ceph installation (with VMs, Rocky 8/9)

 Action items

 

 

 Decisions