2024-11-28 Meeting Notes
Date
Nov 28, 2024
Participants
@Alexander Rogovskiy
@James Walder
@Thomas Byrne
Lancs: Gerard, Matt, Steven
Glasgow: Sam
Apologies:
@Thomas, Jyothish (STFC,RAL,SC)
CC:
Goals
List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity
Discussion topics
Current status of Echo Gateways / WNs testing
Recent sandbox’s for review / deployments:
Item | Presenter | Notes |
|
---|---|---|---|
Operational Issues (Gateway Auth failures) |
| Upgrades of GWs this week; Main GWs complete; Alice in progress
AAA gateways with large numbers of connections: (gw10): |
|
cms-aaa naming convention |
| cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names Separate naming convention would be more appropriate, to have main/supporting (not so urgent). CC created, but due to be reviewed December |
|
XRootD Managers De-VMWareification | @Thomas, Jyothish (STFC,RAL,SC) |
Option 2 preferred for efficiency, but Option 1 decided on Option 1 would be simpler to implement for a temporary fix, as the move would be reversed antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James |
|
Compilation and rollout status with XrdCeph and rocky 8: 5.7.x | @Thomas, Jyothish (STFC,RAL,SC) |
|
|
Shoveler | @Katy Ellis | Shoveler installation and monitoring
|
|
Deletion studies through RDR | @Ian Johnson
|
|
|
Deletions |
|
| |
XRootD Writable Workernode Gateway Hackaton
| @Thomas, Jyothish (STFC,RAL,SC) | XRootD Writable Workernode Gateway Hackaton (XWWGH) Tues 12th Nov 1600 Outcomes |
|
Xrootd testing framework |
|
|
|
100 GbE Gateway testing: | @James Walder | https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180 |
|
UKSRC Storage Architecture |
|
|
|
Tokens Status |
|
|
|
on GGUS:
Site reports
Lancaster: Day 2 of our mini-DC run was a bit spoilt by Ceph having a wobbly, presumably because of some sick OSDs. As Matt understands it, we hit the last little bit of backfilling backlog we had to do but rather then be a weight lifted off our cluster things “cramped up” focussing operations on a small number of PGs. (Gerard can correct me if I have the wrong end of the wrong stick). To top it off one of our OSDs keeled over physically.
We also had an (unrelated to the mini-DC, but maybe caused by a large burst of LSST jobs) a bunch of WNs had their cephfs mounts enter a horrid state, for the third time this has happened (discussed in storage yesterday and tracked to a likely bug in the kernel), from dmesg:
[Tue Nov 26 16:02:51 2024] libceph: wrong peer, want (1)10.41.12.56:6929/2269030683, got (1)10.41.12.56:6929/324345577
[Tue Nov 26 16:02:51 2024] libceph: osd435 (1)10.41.12.56:6929 wrong peer at address
Our bindfs tests were disappointing, our bonnie tests have read rates through the bind mount ~1/8th that of what we see through the regular one. Write seems less effected, and latency doesn’t appear to be impacted. Using the bindfs “multithread” option didn’t help much. Probably culprit is the tiny (and AFAICS unchangable) default bindfs blocksize But these were noddy bonnie tests, and maybe this is the wrong tool? Would noddy dd be better?
Glasgow
Action items
How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.