2022-08-17 Meeting notes
Date
Aug 17, 2022
Participants
@James Walder
@Alison Packer
@Thomas, Jyothish (STFC,RAL,SC)
@Alastair Dewhurst
Lancs: Matt Gerard Steven
Glasgow: Sam
Goals
List of Epics
New tickets
Consider new functionality / items
Detailed discussion of important topics
Site report activity
Discussion topics
Item | Presenter | Notes |
---|---|---|
5.5.0-rcX status / problems in testing | @Thomas, Jyothish (STFC,RAL,SC) Sam | Sam - appears ok; (localgroupdisk permissions issues at Glasgow). |
Centos 8: outstanding items? |
| Continuing to progess. (See Lancs Site report for more details on vector size problems, but resolved in 5.5.0-rc2). |
space info reporting: |
| Alice testing, dev-gw2 (Confirmed from (Alice) running tests on dev-GW2 Alice tests OK - sandbox ready and deployment in prep (aim Monday). |
| (Not running at the moment due to dev-gw2 testing). | |
James | Scripts created to extract problematic files | |
Deletions / current understanding / necessary changes |
| Options:
|
“Unified” xrootd status / problems? |
| Avoid TPC for xroot transfers on the webdav gateways |
xrdtpc.sh script without a proxy (is it possible?) |
| Did not find a solution yet (see above). |
What are we going to do with Vector Reads now? | All | Removal of locking did not appear to help significantly. Work ongoing with EL8 and recent 5.X releases (both recommendations from Rob’s document). Rob C’s script is currently the only metric for LHCb to compare / verify against. Options:
|
Site reports
Glasgow:
Lancaster
Concerning the vector::reserve crashes, we had submitted two pull requests:
Modify vector's size instead of capacity to avoid bounds-checking failures #1630 - This replaced several uses of std::vector::reserve with resize, avoiding the assertion failure. Three places fixed:
macaroons
TPC streams
throttle manager
Vector cleared after use so it can be shrunk. #1639 - This fixed a consequence of #1630, in that some buffers might not relinquish their allocations as soon as expected (so a non-functional failure).
Order of application:
Tag v5.4.0 had neither of these.
v5.4.1 included someone else's fix just for macaroons. One byte is pushed back, so that an expression that accesses [0] doesn't fail the assertion.
v5.4.3 includes #1639, even though it's only needed after #1630. Probably harmless.
v5.5.0-rc2 includes #1630. The other fix in 5.4.1 has been removed.
As a result, 5.4.1 got us past the macaroon problem, but it wouldn't for TPCs.
-Matt (now having sound issues)
-We do have a “new” rocky8 testbed node in place to test this out, but haven’t had a chance to try out with our own build.
-Redirector work almost done, but on hold until after the rucio settings change has settled down (see https://its.cern.ch/jira/browse/ADCINFR-239)
currently the Symlink rucio change is looking perilous.
-errors like: Failed to stage-out file: mc16_13TeV:log.29979569._000296.job.log.tgz.1 to UKI-NORTHGRID-LANCS-HEP-CEPH_DATADISK, module 'rucio.rse.protocols.posix' has no attribute 'Symlink'")]:failed to transfer files using copytools=['rucio']