2023-10-12 Meeting Notes

 Date

Oct 5, 2023

 Participants

  • @Alexander Rogovskiy

  • @Thomas, Jyothish (STFC,RAL,SC)

  • Glasgow: Sam

  • Lancaster: @Matt Doidge Gerard

Apologies:

@James Walder

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

XRootD Releases

 

5.6.2-2 is out
- Plan for deployment

  • Is passing on pre-prod

  • upgrade of CMake version, exhibiting ‘stack smashing’ ?? (or different compile versions)

  • Features for XrdCeph that need to be included ?

Aim for prod testing next week. (aim for 1 week of testing, then deploy if ok).

 

Sam notes:
if you're building xrootd 5.6.2 from source, the tagged 5.6.2 in the git repo does not have the bugfix for authdb parsing, as I just discovered to my cost (that's 2 commits later...)

Lancs works (off the shelf 562-2)

 

Checksums fixes

 

 

 

Prefetch studies and WN changes

Alex

(temporarily rolled back, with the ongoing work in batch farm WNs)

  • the LD_Preload for lockless reads were removed (compiled against ‘older’ version of ceph, and superseded by the recent readV work.

  • WN containers need updated Ceph rpms (for 14.2.22)

  • @Alexander Rogovskiy to present ‘final’ status of testing in next meeting on the prefetch work

  • @James Walder to add work items to the queue in jira,

  • failures on latest prefetch (timeout) might be due to ceph version

 

Deletion studies through RDR

Ian

 

 

CMSD rollout

 

https://stfc.atlassian.net/browse/XRD-41
New diagram required ?

svc01,02,17,18 stay as internal WN gateways for now.
The other svc hosts 03,05, 11,13-16 to added to CMSD production cluster.

svc19 (designated for Alice gateway)

 

Gateways on new network plan

 

ipv6 sorted, firewall rules change in progress

LHCONE issue sorted

fermilab can’t be reached trough v6 but tracepath gets to lhcopn cern router

To consider the TPC instance port

 

Gateways: observations

 

WN gateways showed a spike in memory, the 2 gateways with swap enable filled in a few 100GB in swap, the other 2 crashed at the poller

 

 

CMSD outstanding items

 

Icinga / nagios callout tests changes. - live and available

  • ping test for the floating ips and getaway hosts needs some more refinement

Improved load balancing / server failover triggering -

better 'rolling server restart script'

Documentation; setup / configuration / operations / troubleshooting / testing

Review of Sandbox and deployment to prod:
- Awaiting time from Tom for load balancer test

Sandbox has been reviewed awaiting @Thomas Byrne for final confirmation

 

Tokens testing

 

To Liaise with the TTT Taskforce (aka. @Matt Doidge )

no update

 

AAA Gateways

 

Sandbox ready for review:

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=jw-xrootd-aaa-5.5.4-3
Needs a bigger discussion regarding Tokens, and deployment in production hosts.

 

SKA Gateway box

 

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

now working using ska pool on ceph dev

Initial Iperf3 tests: (see table and plots below).

  • Actions

    • Ensure Xrootd01 is tuned correct, according to the Nvidia / mellanox instructions

    • Repeat the iperf tests

  • Xrootd tests against:

    • dev-echo

    • cephfs (Deneb dev)

    • cephfs (openstack; permissions/routing issues)?

    • local disk / mem

  • Frontend routing is also being worked on

 

 

containerised gateways (kubernetes cluster)

 

identified an issue on workernode gateways where ceph nautilus 14.2.15 libraries were loaded (from a previous libradosstriper lockless read implementation) overriding the container installed ceph version

working on ingress setup, had a 'cannot allocate port' error on setting up service (port forwarding), google suggests issue with cluster, will try rebuilding from scratch to see if fixes the issue

 

 

on GGUS:

Site reports

Lancaster - moved to 5.6.2-2, all ok

Glasgow - gateways setting up, ceph disk node using lot of swap (one osd using large virtual memory ) 562-2 testing TBD later version of nautilus are more aggressive in cache, recommendation is turning swap off

 

 

 Action items

  •  

 

 Decisions