2023-10-19 Meeting Notes

ย Date

Oct 19, 2023

ย Participants

  • @Thomas, Jyothish (STFC,RAL,SC)

  • Glasgow:

  • Lancaster:

Apologies:

@James Walder

ย 

ย Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

ย 

ย Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandboxโ€™s for review / deployments:

ย 

Item

Presenter

Notes

ย 

Item

Presenter

Notes

ย 

XRootD Releases

ย 

5.6.2-2 is out
- Plan for deployment

  • Is passing on pre-prod

  • upgrade of CMake version, exhibiting โ€˜stack smashingโ€™ ?? (or different compile versions)

  • Features for XrdCeph that need to be included ?

Aim for prod testing next week. (aim for 1 week of testing, then deploy if ok).

ย 

Sam notes:
if you're building xrootd 5.6.2 from source, the tagged 5.6.2 in the git repo does not have the bugfix for authdb parsing, as I just discovered to my cost (that's 2 commits later...)

Lancs works (off the shelf 562-2)

ย 

Checksums fixes

ย 

ย 

ย 

Prefetch studies and WN changes

Alex

(temporarily rolled back, with the ongoing work in batch farm WNs)

  • the LD_Preload for lockless reads were removed (compiled against โ€˜olderโ€™ version of ceph, and superseded by the recent readV work.

  • WN containers need updated Ceph rpms (for 14.2.22)

  • @Alexander Rogovskiy to present โ€˜finalโ€™ status of testing in next meeting on the prefetch work

  • @James Walder to add work items to the queue in jira,

  • failures on latest prefetch (timeout) might be due to ceph version

ย 

Deletion studies through RDR

Ian

ย 

ย 

CMSD rollout

ย 

https://stfc.atlassian.net/browse/XRD-41
New diagram required ?

svc01,02,17,18 stay as internal WN gateways for now.
The other svc hosts 03,05, 11,13-16 to added to CMSD production cluster.

svc19 (designated for Alice gateway)

ย 

Gateways on new network plan

ย 

ipv6 sorted, firewall rules change in progress

LHCONE issue sorted

fermilab canโ€™t be reached trough v6 but tracepath gets to lhcopn cern router

To consider the TPC instance port

ย 

Gateways: observations

ย 

WN gateways showed a spike in memory, the 2 gateways with swap enable filled in a few 100GB in swap, the other 2 crashed at the poller

ย 

ย 

CMSD outstanding items

ย 

Icinga / nagios callout tests changes. - live and available

  • ping test for the floating ips and getaway hosts needs some more refinement

Improved load balancing / server failover triggering -

better 'rolling server restart script'

Documentation; setup / configuration / operations / troubleshooting / testing

Review of Sandbox and deployment to prod:
- Awaiting time from Tom for load balancer test

Sandbox has been reviewed awaiting @Thomas Byrne for final confirmation

ย 

Tokens testing

ย 

To Liaise with the TTT Taskforce (aka. @Matt Doidge )

no update

ย 

AAA Gateways

ย 

Sandbox ready for review:

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3
Needs a bigger discussion regarding Tokens, and deployment in production hosts.

ย 

SKA Gateway box

ย 

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

now working using ska pool on ceph dev

Initial Iperf3 tests: (see table and plots below).

  • Actions

    • Ensure Xrootd01 is tuned correct, according to the Nvidia / mellanox instructions

    • Repeat the iperf tests

  • Xrootd tests against:

    • dev-echo

    • cephfs (Deneb dev)

    • cephfs (openstack; permissions/routing issues)?

    • local disk / mem

  • Frontend routing is also being worked on

ย 

ย 

containerised gateways (kubernetes cluster)

ย 

identified an issue on workernode gateways where ceph nautilus 14.2.15 libraries were loaded (from a previous libradosstriper lockless read implementation) overriding the container installed ceph version

working on ingress setup, had a 'cannot allocate port' error on setting up service (port forwarding), google suggests issue with cluster, will try rebuilding from scratch to see if fixes the issue

ย 

ย 

on GGUS:

Site reports

Lancaster - moved to 5.6.2-2, all ok

Glasgow - gateways setting up, ceph disk node using lot of swap (one osd using large virtual memory ) 562-2 testing TBD later version of nautilus are more aggressive in cache, recommendation is turning swap off

ย 

ย 

ย Action items

  • ย 

ย 

ย Decisions