2024-12-05 Meeting Notes

2024-12-05 Meeting Notes


Dec 5, 2024



  • @Alexander Rogovskiy



  • Lancs: Gerard, Matt, Steven

  • Glasgow: Sam



  • @James Walder





  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity


 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:










Operational Issues
Gateways and WNs:
- Current status and upcoming changes

(Gateway Auth failures)



Upgrades of GWs complete


AAA gateways with large numbers of connections:


37931 ESTAB

~3k ESTAB from remote hosts, 2.8k CLOSE_WAIT from remote hosts
xrd.timeout idle 60m read 10
in current config

throttle increase seems to have fixed it


cms-aaa naming convention


cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names

Separate naming convention would be more appropriate, to have main/supporting

(not so urgent).

CC created, but due to be reviewed December



XRootD Managers De-VMWareification

@Thomas, Jyothish (STFC,RAL,SC)

Option 2 preferred for efficiency, but Option 1 decided on

Option 1 would be simpler to implement for a temporary fix, as the move would be reversed

antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James


Compilation and rollout status with XrdCeph and rocky 8: 5.7.x

@Thomas, Jyothish (STFC,RAL,SC)




@Katy Ellis

Shoveler installation and monitoring



Deletion studies through RDR

@Ian Johnson








XRootD Writable Workernode  Gateway Hackaton


@Thomas, Jyothish (STFC,RAL,SC)

XRootD Writable Workernode  Gateway Hackaton (XWWGH)

Tues 12th Nov 1600
Hackaton writeable workernode



Xrootd testing framework


XRootD Site Testing Framework



100 GbE Gateway testing:
SKA / Tier-1

@James Walder




UKSRC Storage Architecture




Tokens Status


  • Operational

  • Technical

  • Accounting






on GGUS:

Site reports


Lancaster: Mostly a lot of wailing and gnashing of teeth.

Is discussed in many meetings this week, we had a short power outage last Friday night knocked out a small chunk of our cluster. We came out of that looking okay, with just a few degraded PGs, but keep tripping up as Ceph goes readonly due to falsely thinking an OSD is full (when it’s got 25% free space) until Gerard kicks it. Gerard tracked it to so existing (since ~Pacific) CEPH bugs.

(We’re not actually 100% sure that recovering from the power outage is the root cause of this issue or just an event that created a need for data shuffling around the OSDs, but it certainly didn’t help).




 Action items

How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.




