2024-12-05 Meeting Notes
Date
Dec 5, 2024
Participants
@Alexander Rogovskiy
Lancs: Gerard, Matt, Steven
Glasgow: Sam
Apologies:
@James Walder
CC:
Goals
List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity
Discussion topics
Current status of Echo Gateways / WNs testing
Recent sandbox’s for review / deployments:
Item | Presenter | Notes |
|
---|---|---|---|
Operational Issues (Gateway Auth failures) |
| Upgrades of GWs complete
AAA gateways with large numbers of connections: (gw10): throttle increase seems to have fixed it |
|
cms-aaa naming convention |
| cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names Separate naming convention would be more appropriate, to have main/supporting (not so urgent). CC created, but due to be reviewed December |
|
XRootD Managers De-VMWareification | @Thomas, Jyothish (STFC,RAL,SC) |
Option 2 preferred for efficiency, but Option 1 decided on Option 1 would be simpler to implement for a temporary fix, as the move would be reversed antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James |
|
Compilation and rollout status with XrdCeph and rocky 8: 5.7.x | @Thomas, Jyothish (STFC,RAL,SC) |
|
|
Shoveler | @Katy Ellis | Shoveler installation and monitoring
|
|
Deletion studies through RDR | @Ian Johnson
|
|
|
Deletions |
|
| |
XRootD Writable Workernode Gateway Hackaton
| @Thomas, Jyothish (STFC,RAL,SC) | XRootD Writable Workernode Gateway Hackaton (XWWGH) Tues 12th Nov 1600 Outcomes |
|
Xrootd testing framework |
|
|
|
100 GbE Gateway testing: | @James Walder | https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180 |
|
UKSRC Storage Architecture |
|
|
|
Tokens Status |
|
|
|
on GGUS:
Site reports
Lancaster: Mostly a lot of wailing and gnashing of teeth.
Is discussed in many meetings this week, we had a short power outage last Friday night knocked out a small chunk of our cluster. We came out of that looking okay, with just a few degraded PGs, but keep tripping up as Ceph goes readonly due to falsely thinking an OSD is full (when it’s got 25% free space) until Gerard kicks it. Gerard tracked it to so existing (since ~Pacific) CEPH bugs.
(We’re not actually 100% sure that recovering from the power outage is the root cause of this issue or just an event that created a need for data shuffling around the OSDs, but it certainly didn’t help).
Glasgow
Action items
How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.