on GGUS:
Site reports
Lancaster: On this week’s Lancaster Rant: We had a period of storage sadness last night. Atlas deleted ~20k files in a space of about 30 minutes, and whilst Ceph was recovering LSST jobs came from behind and gave the storage a wedgie with high IOPs. Cephfs got slow, xrootd servers got sad, some fell over, cephfs got more unhappy. It was a whole thing, and Gerard spent the morning restarting xroot servers with his new scripts.
The point of my ranting is it seems half our problems could be solved if we could get xrootd to rate/connection limit so things didn’t get in so bad a state that we required to reboot things. We don’t think xroot has this functionality in itself. The preliminary thought would be something on the redirector that, if detecting problems or high load, rather then redirect to the least-worst-off xroot server just returned a polite “try again later” (503 ?).
In other news as discussed on Wednesday we’ve been looking at ways we could remove TLS from internal transfers and how to xroot-plumb that together, but Jyothish may have crushed our hopes there by pointing out that scitokens require tls to be enabled, so such a move wouldn’t be future proof - or would have to have extra plumbing (as these ponderings are accompanied by the idea of replacing internal auth for at least some users with something faster- this again is LSST driven with their teeny-tiny files causing hassle).
✅ Action items
How to replace the original functionality of fstream monitoring, now opensearch has replaced existing solutions.