2023-08-17 Meeting Notes

 Date

Aug 17, 2023

 Participants

  • @Alexander Rogovskiy

  • @Thomas, Jyothish (STFC,RAL,SC)

  • @Ian Johnson

  • @Matthias Mayer

  • @James Walder

  • Lancaster: Gerard, Matt

  • Glasgow:

Apologies:

Sam

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

XRootD Releases

 

No news on 5.6.2 yet

 

Prefetch studies

Alex

Status of tests / deployment without prefetch

 

Deletion studies through RDR

Ian

Results gathered over three-week period, uploading and deleting 8 GiB files. Interleaved uploads

and deletions by alternating between the redirector and the DNS alias. Results showing 90th percentile deletion times:

(In the following, ‘redirector deletion’ means deletion via a gateway selected using the rdr.echo.stfc.ac.uk alias, ‘non-rdr deletion’ means deletion via a gateway from the echo.stfc.ac.uk alias.)

Out of the 15 sets of complete data, there are

  • Three occasions when the 90th %ile time for redirector deletion is lower than non-rdr deletion;

  • Three occasions when the 90th %ile time for redirector deletion is close to non-rdr deletion;

  • Nine occasions when the 90th %ile time for redirector deletion is greater than non-rdr deletion (in one case, ~10x greater).

Questions arising:

  • Is the measuring method appropriate?

  • Do intermittent, but long-lasting, cluster problems, reflect badly on deletion times? (Note that there are nine out of the 30 deletions taking longer than 20 seconds, which I haven’t seen happen this frequently.)

  • Was this a good period to attempt comparing deletion times?

Added after meeting:

Plotting the mean deletion times shows less of a variation between redirector and non-rdr deletions.

 

 

CMSD rollout

 

https://stfc.atlassian.net/browse/XRD-41

Status:

root protocol in testing from CMS, change finalization early Sep

 

 

CMSD operations: observations

 

gateways appear highly ‘loaded’ recently (e.g. system load and number of connections)

  • might be correlated to ongoing echo issues

 

CMSD outstanding items

 

Icinga / nags callout tests changes.

Improved load balancing / server failover triggering

better 'rolling server restart script'

Documentation; setup / configuration / operations / troubleshooting / testing

Review of Sandbox and deployment to prod.

 

Tokens testing

 

NTR

 

AAA Gateways

 

Sandbox ready for review:

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3

 

Floating IPs for VM test cluster

 

Does not appear to be viable (explanation?)

 

SKA Gateway box

 

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

numerous network changes needed / being done

  • Requires core router change for routes

  • sandbox to be deployed for ceph dev hosts routes

  • Aquilonisation of bonded vlans

 

‘high-performance ceph tests’

JW

What is the maximum sustainable rate that reads / writes to ceph can be managed for single file access:

  1. Simple python based testing

  2. Use C++/C api to try and use various completion objects.

  3.  

    1. Consider how to implement into OSS/OFS plugin (for demo purposes)

    2. Test if XRootD can support such client connection speeds …

Test 1):

  • Multi-threaded client calls with simple synchronous read operations:

    • Variations of read chunk size, and number of threads

    • Test on dev and prod cluster

    • single 10GB file

 

 

 

extra gateways deployment

 

certs pending, built in aquilon

 

on GGUS:

Site reports

 Action items

  •  

 

 Decisions