2024-11-28 Meeting Notes

Date

Nov 28, 2024

Participants

@Alexander Rogovskiy
@James Walder
@Thomas Byrne
Lancs: Gerard, Matt, Steven
Glasgow: Sam

Apologies:

@Thomas, Jyothish (STFC,RAL,SC)

CC:

Goals

List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity

Discussion topics

Current status of Echo Gateways / WNs testing

Recent sandbox’s for review / deployments:

Item	Presenter	Notes

Item	Presenter	Notes
Operational Issues Gateways and WNs: - Current status and upcoming changes (Gateway Auth failures)		Upgrades of GWs this week; Main GWs complete; Alice in progress AAA gateways with large numbers of connections: (gw10): 3305 CLOSE-WAIT 37931 ESTAB ~3k ESTAB from remote hosts, 2.8k CLOSE_WAIT from remote hosts xrd.timeout idle 60m read 10 in current config
cms-aaa naming convention		cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names Separate naming convention would be more appropriate, to have main/supporting (not so urgent). CC created, but due to be reviewed December
XRootD Managers De-VMWareification	@Thomas, Jyothish (STFC,RAL,SC)	Option 2 preferred for efficiency, but Option 1 decided on Option 1 would be simpler to implement for a temporary fix, as the move would be reversed antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James
Compilation and rollout status with XrdCeph and rocky 8: 5.7.x	@Thomas, Jyothish (STFC,RAL,SC)
Shoveler	@Katy Ellis	Shoveler installation and monitoring
Deletion studies through RDR	@Ian Johnson
Deletions	https://stfc.atlassian.net/browse/XRD-83
XRootD Writable Workernode Gateway Hackaton	@Thomas, Jyothish (STFC,RAL,SC)	XRootD Writable Workernode Gateway Hackaton (XWWGH) Tues 12th Nov 1600 Hackaton writeable workernode Outcomes
Xrootd testing framework		XRootD Site Testing Framework
100 GbE Gateway testing: SKA / Tier-1	@James Walder	https://stfc.atlassian.net/wiki/spaces/UK/pages/215941180
UKSRC Storage Architecture
Tokens Status		Operational Technical Accounting

on GGUS:

Site reports

Lancaster: Day 2 of our mini-DC run was a bit spoilt by Ceph having a wobbly, presumably because of some sick OSDs. As Matt understands it, we hit the last little bit of backfilling backlog we had to do but rather then be a weight lifted off our cluster things “cramped up” focussing operations on a small number of PGs. (Gerard can correct me if I have the wrong end of the wrong stick). To top it off one of our OSDs keeled over physically.

We also had an (unrelated to the mini-DC, but maybe caused by a large burst of LSST jobs) a bunch of WNs had their cephfs mounts enter a horrid state, for the third time this has happened (discussed in storage yesterday and tracked to a likely bug in the kernel), from dmesg:

[Tue Nov 26 16:02:51 2024] libceph: wrong peer, want (1)10.41.12.56:6929/2269030683, got (1)10.41.12.56:6929/324345577
[Tue Nov 26 16:02:51 2024] libceph: osd435 (1)10.41.12.56:6929 wrong peer at address

Our bindfs tests were disappointing, our bonnie tests have read rates through the bind mount ~1/8th that of what we see through the regular one. Write seems less effected, and latency doesn’t appear to be impacted. Using the bindfs “multithread” option didn’t help much. Probably culprit is the tiny (and AFAICS unchangable) default bindfs blocksize But these were noddy bonnie tests, and maybe this is the wrong tool? Would noddy dd be better?

Glasgow

Action items

How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.