June 2023: Vector Read rollout (with XCache)
Document to checklist, record and monitor the steps leading up-to, during, and post rollout campaign.
Concept
The following resource will identify as many steps in to the process for deployment of a new XRootD configuration / architecture or release, which is considered significant enough to have a change control meeting associated to it.
The form should be specific enough to capture all relevant details, flexible enough to mange changes to schedules, etc, and sufficient to provide all salient details should a post-mortem be deemed necessary.
Roles:
Coordinator: Responsible to ensuring this document is followed and updated (or amended)
Key to the communication between different Stakeholders
Stakeholders:
Overview
Coordinator | @Thomas, Jyothish (STFC,RAL,SC) |
---|---|
Implementation start date | Jun 12, 2023 : Agreed by: |
Status | Rollout out XRootD 5.5.4 to the Worker Nodes with XCache enabled and latest XrdCeph updates including the updated readV code. |
notes |
Involved teams
Group / experts | Coordinator | contributors | Notes |
---|---|---|---|
Storage | @Robert Appleyard | @Aidan McComb @Thomas, Jyothish (STFC,RAL,SC) | Ceph team; confirms ceph state, any external constraints problems, or extraordinary operations. Reports on the state of the ceph before, during and after change |
Batch farm | @Thomas Birkett |
| Batch farm control; |
XRootD Development | @Thomas, Jyothish (STFC,RAL,SC) | @James Walder @Alexander Rogovskiy | Ensures the appropriate software is being deployed |
VOs / users | @Alexander Rogovskiy @Katy Ellis @James Walder |
| Feedback from the user communities / spotting of problems, etc … |
Success criteria
Challenges | Outcomes | Requirements |
---|---|---|
|
|
|
|
|
|
Indicators of Problems / Success
What metrics should be evaluated to determine if there is an issue.
Identifiers of problems | Quantifiable metrics | Actions required |
---|---|---|
|
|
|
| Identified a critical failure, of failure rate > 20% (??) across affected nodes |
|
Project tracker
Milestone | Description | Notes (Add any detailed notes to the section at the end of document, and link here) | Responsible | Participants | Expected / actual start date (add comment) | Status IN PROGRESS / DONE / OTHER | Action items | Completed (and signed off) + link to documentations |
---|---|---|---|---|---|---|---|---|
Development testing | Finalisation of any development work |
| @Thomas, Jyothish (STFC,RAL,SC) | Jun 2, 2023 (expected) | IN PROGRESS / DONE / OTHER | Development done Testing complete |
| |
RPMs | Identify rpm versions (and src commits) for deployed version | xrootd-5.5.4-2 xrootd-ceph-buffered-5.5.4-3 | @Thomas, Jyothish (STFC,RAL,SC) | @Alexander Rogovskiy |
|
| RPM built RPM tested RPM details added to Documentation |
|
Xrootd configuration | Confirm the correct XRootD configuration has be checked and is correct | http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=workernode-xrootd | @Thomas, Jyothish (STFC,RAL,SC) | @Alexander Rogovskiy |
|
| Configuration checked Configuration tested |
|
Docker image configuration | Confirm that the Docker image(s) are correct and have been checked | harbor.stfc.ac.uk/ral-tier1_public/echo-xrootd-workernode:2023-06-12.1 | @Thomas Birkett | @Alexander Rogovskiy@Thomas, Jyothish (STFC,RAL,SC) |
|
| Configuration checked Configuration tested |
|
Current state capture | Ensure that any relevant records / state of the system has been captured (could be by plots for example). |
cached reads: > 242MB/s | @Thomas, Jyothish (STFC,RAL,SC) |
|
|
| Current state has been documented |
|
Slack: Create new Channel; | advertise to GST and DS. Add relevant people | Remove at end of rollout | @Thomas, Jyothish (STFC,RAL,SC) |
|
|
| Done |
|
Farm Draining | Date to start draining the farm (if needed) and which tranche (may be repeated for multiple tranches) | Is needed ? |
|
|
|
|
| |
Synchronisation Step | All Stakeholders need to sign-off here, in order to proceed further *(complete the action, and add name) | All all stakeholder to raise concerns / allow to proceed. | All |
|
|
|
| |
deploy updated docker | Date, time, tranche, that the above docker image (etc) has been deployed | Actual rollout on set of farm:
Unpatched:
| @Thomas Birkett |
|
|
|
| |
Monitoring period | How long to monitor / keep with intermediate state |
| @Thomas, Jyothish (STFC,RAL,SC) |
|
|
|
| |
Synchronisation Step | All Stakeholders need to sign-off here, in order to proceed further |
| All |
|
|
|
|
|
Full rollout: Draining |
|
| @Thomas Birkett |
|
|
|
|
|
Full rollout: deploy updated docker |
|
| @Thomas Birkett |
|
|
|
|
|
Rollout complete | Changes applied across farm |
| @Thomas Birkett |
|
|
|
|
|
Monitoring period | Defined period of monitoring for issues. |
| @Thomas, Jyothish (STFC,RAL,SC) |
|
|
|
|
|
Synchronisation Step | All Stakeholders need to sign-off here, in order to proceed further |
| All |
|
|
|
|
|
Completion of change | finalisation date of the change |
| @Thomas, Jyothish (STFC,RAL,SC) |
|
|
|
|
|
Back-out strategy
If required, what is the planned mitigation and steps that are needed:
Plans may be needed / different whether the rollout is in progress, vs whether all changes have been made, for example:
Phase of rollout | Threshold for rollback | Actions required / time | Rough time to deploy | Was needed ? |
---|---|---|---|---|
Post draining of tranche | High IOPs / job failures (sustained) | Rollback to existing docker / configuration | < 1hr to deploy; ~ hrs to rollout across the farm |
|
|
|
|
|
|
Notes:
Additional notes relating to the above tables should be added here and linked appropriately.
request time taken:
atomic read requests take longer to process than normal reads, but contain more individual reads/request
95th percentile and 99th percentile of op times increased after the rollout but were in line with previous week’s data
reads per drive
11:00-13:00
11:00 - start of rollout
13:10
High amount of requests seems to have triggered callouts on gw6,7,14,15 and svc98.
Svc98 was restarted manually. gw7 and 14 sorted themselves out. failure mode was 'functioning but slow', no unexpected errors were found and requests were being completed, but the speed was slow.
high requests seems to match oxford transfers from ATLAS
15:00
gw15 and 6 were manually restarted. all checks are passing. n_connections on each gateway is <500 and stable. IOPS hovering around 50k and stable.