...
Lancs: Matt, Steven, Gerard
Glasgow: Sam
Apologies:
CC:
\uD83E\uDD45 Goals
...
Item | Presenter | Notes | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Operational Issues | mitigations have been communicated to ATLAS for the jobs using 5.6.0 clients reboot campaignbrief hiccup on icinga due to ipv6 issues | ||||||||||
XRootD Managers De-VMWareification (Moving to physical hosts) | /wiki/spaces/GRIDPP/pages/872644647
CC due today due to a better but slightly more complicated aquilon procedure being used for the first time | ||||||||||
Release of 5.7.3 (May expect an 5.8.X prior to 6.X?) | https://github.com/xrootd/xrootd/releases/tag/v5.7.3
| ||||||||||
Checksums issue with an ATLAS file | https://github.com/xrootd/xrootd/issues/2388 https://ggus.eu/index.php?mode=ticket_info&ticket_id=169360 Checksum requested before whole file is updated. No ability to do stale checksum check in ceph, so original checksum ‘sticks’ to the file. fix in place RAL side by clearing checksums after a write is complete | ||||||||||
cms-aaa naming convention | cms-aaa is the only remaining personality to use proxy/ceph as the xrootd service names Separate naming convention would be more appropriate, to have main/supporting (not so urgent). CC created, and sandbox is prepared and has been tested on a test host | ||||||||||
XRootD Managers De-VMWareificationcms-aaa jemalloc use |
Option 2 preferred for efficiency, but Option 1 decided on Option 1 would be simpler to implement for a temporary fix, as the move would be reversed antares tpc nodes to be moved to an echo leafsw, to confirm ipv4 real estate with James hosts moved to rack, renamed and IP assigned. pending DI advertisement | Compilation and rollout status with XrdCeph and rocky 8: 5.7.x testing on svc20, some memory leak still present | |||||||||
Compilation and rollout status of RAL XRootD versions | 5.7.2 published. 5.7.2 skipped on farm due to pfc bug, possible RAL release 5.7.3 equivalent with a fix for that and 5.6.0 client compatibilityto be realed soon | ||||||||||
Shoveler | Shoveler installation and monitoring Katy Ellis to feed back Lancaster (slow rate) observations to shoveler / CERN devs (possibly impacted by the infrastructure behind the Collector). To consider mitigations if unable to progress. | ||||||||||
On the fly Checksums
| Added configuration to PoC: option to turn on/off Adler32 on-the-fly calculation. Proved ability to set XrdCks.adler32 attribute from “standalone” code (running from the command line), will incorporate this into PoC code next. (Wasted time looking for attribute in wrong file…) also to measure - trougput pattern (does this replicate the double troughput seen currently on first checksum request?) discussed on possible implementation as plugin/base xrootd crc32 also implemented here, noted that any new communities should use straight crc32 variants | Integrated checksum attribute storage into PoC. Measured time to transfer 10x3GiB files in parallel to a dev gateway with xrdcp verifying the source checksum. (lower bar is from the on-the-fly checksumming) Next steps: add in optional CRC32C calculation. Conduct larger-scale performance tests, ideally against a gateway machine which is more representative of production GWs. Sufficient testing will be critical. To be discussed. | |||||||||
Deletions |
| NTR | |||||||||
XRootD Writable Workernode Gateway Hackaton | XRootD Writable Workernode Gateway Hackaton (XWWGH) sandbox with fixes present, tested on lhcb workernode, reading works fine as is, writes still need testing to let jobs only write on that WN
first write completed! | ||||||||||
Xrootd testing framework | Discussion in Storage Meeting in how to integrate the various testing structures within the UK. container with the testing framework TBD | ||||||||||
Plan: file query system to summarize XRootD Logs | Plan to create a system to store info from across all gateways to search a filename and get creation time, last write time, last successful stat and deletion time in case of ‘lost’ files. Possible graduate sideproject. | ||||||||||
100 GbE Gateway testing: | UKSRC - XRootD used Acting as source for SRCNet testingverification tests; not being stressed so far … Teir-1 cabled, but awaiting some work to progress on the Swtich . | ||||||||||
UKSRC Storage Architecture | Through discussions, need to change the DNS entries for the data and mgmt interfaces, update netbox and reconfigure in AQ. Data network will be (exclusively) for the DTN / data traffic. mgmt for ancillary needs (icinga, AQ). Host will be known via its mgmt dns name (the canonical name). | ||||||||||
Tokens Status |
|
...
on GGUS:
Site reports
Lancaster: Following on from last week, we were looking at the load reported by the (default) cmsd load reporting scripts, and they didn’t seem to match up to any numbers we could pull from the servers. We got distracted by other things before we could dive deeper.
LSST planning to use small files for read/writes, planning to remove TLS on pure xrootd, these seems to be intermediate files, but might need to be made available for quality conrol? looking at object store route (s3 for internal use) maybe?
combination of uid/host based auth resulted in the following error on curl:
unknown.2:28@comp21-04.private.dns.zone Unable to open /cephfs/grid/dteam/curltest; permission deniedThanks for Jyothish for sharing the load script, we’re looking at deploying it at Lancaster. To alleviate Matt’s worries about the cmsd load balancing not ignoring overloaded nodes he replaced one hosts script with one that just returnes “99” 5 times. Happily this node was ignored by the redirector, so that works (see green plot below between 10 and 10.30).
We have one host on 5.7.3, nothing exploded. Will roll out to the rest of our machines shortly.
We’ve had a bunch of issues with Shoveller not keeping up.
...
Glasgow - Brief failures to authenticate internally - some of the lsc files for atlas iam were out of date despite using RPM. (possible issue on cron job), looking forward to the on the streamed checksums
...