2024-12-05 Meeting Notes

 Date

Dec 5, 2024

 Participants

 

  • @Alexander Rogovskiy

  •  

  •  

  • Lancs: Gerard, Matt, Steven

  • Glasgow: Sam

Apologies:

  •  

  • @James Walder

CC:

 

 

 Goals

  • List of Epics

  • New tickets

  • Consider new functionality / items

  • Detailed discussion of important topics

  • Site report activity

 

 Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

 

Item

Presenter

Notes

 

Item

Presenter

Notes

 

Operational Issues
Gateways and WNs:
- Current status and upcoming changes

(Gateway Auth failures)

 

 

Upgrades of GWs complete

 

AAA gateways with large numbers of connections:

image-20241128-123546.png

(gw10):
 3305 CLOSE-WAIT
37931 ESTAB

~3k ESTAB from remote hosts, 2.8k CLOSE_WAIT from remote hosts
xrd.timeout idle 60m read 10
in current config

throttle increase seems to have fixed it

 

cms-aaa naming convention

 

cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names


Separate naming convention would be more appropriate, to have main/supporting

(not so urgent).

CC created, but due to be reviewed December

 

 

XRootD Managers De-VMWareification

@Thomas, Jyothish (STFC,RAL,SC)

Option 2 preferred for efficiency, but Option 1 decided on

Option 1 would be simpler to implement for a temporary fix, as the move would be reversed

antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James

 

Compilation and rollout status with XrdCeph and rocky 8: 5.7.x

@Thomas, Jyothish (STFC,RAL,SC)

 

 

Shoveler

@Katy Ellis

Shoveler installation and monitoring

 

 

Deletion studies through RDR

@Ian Johnson

 

 

 

Deletions

https://stfc.atlassian.net/browse/XRD-83

 

 

XRootD Writable Workernode  Gateway Hackaton

 

@Thomas, Jyothish (STFC,RAL,SC)

XRootD Writable Workernode  Gateway Hackaton (XWWGH)

Tues 12th Nov 1600
Hackaton writeable workernode

Outcomes

 

Xrootd testing framework

 

XRootD Site Testing Framework

 

 

100 GbE Gateway testing:
SKA / Tier-1

@James Walder

https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180

  •  

 

UKSRC Storage Architecture

 

 

 

Tokens Status

 

  • Operational

  • Technical

  • Accounting

 

 

 

 

 

on GGUS:

Site reports

 

Lancaster: Mostly a lot of wailing and gnashing of teeth.

Is discussed in many meetings this week, we had a short power outage last Friday night knocked out a small chunk of our cluster. We came out of that looking okay, with just a few degraded PGs, but keep tripping up as Ceph goes readonly due to falsely thinking an OSD is full (when it’s got 25% free space) until Gerard kicks it. Gerard tracked it to so existing (since ~Pacific) CEPH bugs.

(We’re not actually 100% sure that recovering from the power outage is the root cause of this issue or just an event that created a need for data shuffling around the OSDs, but it certainly didn’t help).

 

Glasgow

 

 Action items

How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.

 

  •  

  •  

 

 Decisions