June 2023: Vector Read rollout (with XCache)

Document to checklist, record and monitor the steps leading up-to, during, and post rollout campaign.

Concept

The following resource will identify as many steps in to the process for deployment of a new XRootD configuration / architecture or release, which is considered significant enough to have a change control meeting associated to it.

The form should be specific enough to capture all relevant details, flexible enough to mange changes to schedules, etc, and sufficient to provide all salient details should a post-mortem be deemed necessary.

Roles:

Coordinator: Responsible to ensuring this document is followed and updated (or amended)
- Key to the communication between different Stakeholders
Stakeholders:

Overview

Coordinator	@Thomas, Jyothish (STFC,RAL,SC)
Implementation start date	Jun 12, 2023 : Agreed by:
Status	Rollout out XRootD 5.5.4 to the Worker Nodes with XCache enabled and latest XrdCeph updates including the updated readV code.
notes

Involved teams

Group / experts	Coordinator	contributors	Notes

Group / experts	Coordinator	contributors	Notes
Storage	@Rob	@Aidan McComb @Thomas, Jyothish (STFC,RAL,SC)	Ceph team; confirms ceph state, any external constraints problems, or extraordinary operations. Reports on the state of the ceph before, during and after change
Batch farm	@Thomas Birkett		Batch farm control;
XRootD Development	@Thomas, Jyothish (STFC,RAL,SC)	@James Walder @Alexander Rogovskiy	Ensures the appropriate software is being deployed
VOs / users	@Alexander Rogovskiy @Katy Ellis @James Walder		Feedback from the user communities / spotting of problems, etc …

Success criteria

Challenges	Outcomes	Requirements

Challenges	Outcomes	Requirements
Lack of support for readV in XrdCeph causing job failures (primarily for LHCb)	readV related job failures at a rate comparable with other T1s. Allow to run all expected workflows	Deploy patched readV code with atomic reads
High IOPs are a problem for Ceph	Maintain the IOPs that hit Ceph to stay below XXXk	Use of XCache on WNs to avoid hitting Ceph with too many read requests

Indicators of Problems / Success

What metrics should be evaluated to determine if there is an issue.

Identifiers of problems	Quantifiable metrics	Actions required

Identifiers of problems	Quantifiable metrics	Actions required
Significant IOPs on Ceph OSD, causing slow ops / stalled transfers	Sustained IOPs > XX on OSDs (or > YY for all IOPS) (TBD)	Meeting; prepare to roll-back if this is a sustained effect
Significant job failures on WNs	Identified a critical failure, of failure rate > 20% (??) across affected nodes	Meeting; prepare to roll-back if can be associated to the change

Project tracker

Milestone	Description	Notes (Add any detailed notes to the section at the end of document, and link here)	Responsible	Participants	Expected / actual start date (add comment)	Status IN PROGRESS / DONE / OTHER	Action items	Completed (and signed off) + link to documentations

Milestone	Description	Notes (Add any detailed notes to the section at the end of document, and link here)	Responsible	Participants	Expected / actual start date (add comment)	Status IN PROGRESS / DONE / OTHER	Action items
Development testing	Finalisation of any development work		@Thomas, Jyothish (STFC,RAL,SC)		Jun 2, 2023 (expected)	IN PROGRESS / DONE / OTHER	Development done Testing complete
RPMs	Identify rpm versions (and src commits) for deployed version	xrootd-5.5.4-2 xrootd-ceph-buffered-5.5.4-3	@Thomas, Jyothish (STFC,RAL,SC)	@Alexander Rogovskiy			RPM built RPM tested RPM details added to Documentation
Xrootd configuration	Confirm the correct XRootD configuration has be checked and is correct	https://github.com/stfc/grid-workernode/pull/38/files http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=workernode-xrootd	@Thomas, Jyothish (STFC,RAL,SC)	@Alexander Rogovskiy			Configuration checked Configuration tested
Docker image configuration	Confirm that the Docker image(s) are correct and have been checked	harbor.stfc.ac.uk/ral-tier1_public/echo-xrootd-workernode:2023-06-12.1	@Thomas Birkett	@Alexander Rogovskiy@Thomas, Jyothish (STFC,RAL,SC)			Configuration checked Configuration tested
Current state capture	Ensure that any relevant records / state of the system has been captured (could be by plots for example).	`====================================== 8233 passed, 1 xfailed in 647.21s (0:10:47)` raw read speed: 80MB/s with/without patch cached reads: > 242MB/s	@Thomas, Jyothish (STFC,RAL,SC)				Current state has been documented
Slack: Create new Channel;	advertise to GST and DS. Add relevant people	Remove at end of rollout	@Thomas, Jyothish (STFC,RAL,SC)				Done
Farm Draining	Date to start draining the farm (if needed) and which tranche (may be repeated for multiple tranches)	Is needed ?					Draining of tranche(s) started Draining of tranche(s) Completed
Synchronisation Step	All Stakeholders need to sign-off here, in order to proceed further *(complete the action, and add name)	All all stakeholder to raise concerns / allow to proceed.	All				Storage sign-off Batch sign-off “Users” sign-off ? Devs sign-off Coordinator sign-off
deploy updated docker	Date, time, tranche, that the above docker image (etc) has been deployed	Actual rollout on set of farm: 11:00 - wn-2022-lenovo 11:30 - wn-2021-xma 12:00 - wn-2018-xma 13:30 - wn-2017-xma Patched: wn-2022-lenovo wn-2021-xma wn-2018-xma wn-2017-xma Unpatched: wn-2017-dell wn-2019-dell wn-2020-xma	@Thomas Birkett				Done
Monitoring period	How long to monitor / keep with intermediate state		@Thomas, Jyothish (STFC,RAL,SC)				read IOPs OK …
Synchronisation Step	All Stakeholders need to sign-off here, in order to proceed further		All
Full rollout: Draining			@Thomas Birkett
Full rollout: deploy updated docker			@Thomas Birkett
Rollout complete	Changes applied across farm		@Thomas Birkett
Monitoring period	Defined period of monitoring for issues.		@Thomas, Jyothish (STFC,RAL,SC)
Synchronisation Step	All Stakeholders need to sign-off here, in order to proceed further		All
Completion of change	finalisation date of the change		@Thomas, Jyothish (STFC,RAL,SC)

Back-out strategy

If required, what is the planned mitigation and steps that are needed:

Plans may be needed / different whether the rollout is in progress, vs whether all changes have been made, for example:

Phase of rollout	Threshold for rollback	Actions required / time	Rough time to deploy	Was needed ?

Phase of rollout	Threshold for rollback	Actions required / time	Rough time to deploy	Was needed ?
Post draining of tranche	High IOPs / job failures (sustained)	Rollback to existing docker / configuration	< 1hr to deploy; ~ hrs to rollout across the farm

Notes:

Additional notes relating to the above tables should be added here and linked appropriately.

request time taken:

atomic read requests take longer to process than normal reads, but contain more individual reads/request

95th percentile and 99th percentile of op times increased after the rollout but were in line with previous week’s data

reads per drive

11:00-13:00

11:00 - start of rollout

13:10

High amount of requests seems to have triggered callouts on gw6,7,14,15 and svc98.
Svc98 was restarted manually. gw7 and 14 sorted themselves out. failure mode was 'functioning but slow', no unexpected errors were found and requests were being completed, but the speed was slow.