June 2023: Vector Read rollout (with XCache)

 

Document to checklist, record and monitor the steps leading up-to, during, and post rollout campaign.

Concept

The following resource will identify as many steps in to the process for deployment of a new XRootD configuration / architecture or release, which is considered significant enough to have a change control meeting associated to it.

The form should be specific enough to capture all relevant details, flexible enough to mange changes to schedules, etc, and sufficient to provide all salient details should a post-mortem be deemed necessary.

 

Roles:

  • Coordinator: Responsible to ensuring this document is followed and updated (or amended)

    • Key to the communication between different Stakeholders

  • Stakeholders:

 Overview

Coordinator

@Thomas, Jyothish (STFC,RAL,SC)

Implementation start date

Jun 12, 2023 : Agreed by:

Status

Rollout out XRootD 5.5.4 to the Worker Nodes with XCache enabled and latest XrdCeph updates including the updated readV code.

notes

 

 Involved teams

Group / experts

Coordinator

contributors

Notes

Group / experts

Coordinator

contributors

Notes

Storage

@Robert Appleyard

@Aidan McComb @Thomas, Jyothish (STFC,RAL,SC)

Ceph team; confirms ceph state, any external constraints problems, or extraordinary operations.

Reports on the state of the ceph before, during and after change

Batch farm

@Thomas Birkett

 

Batch farm control;

XRootD Development

@Thomas, Jyothish (STFC,RAL,SC)

@James Walder @Alexander Rogovskiy

Ensures the appropriate software is being deployed

VOs / users

@Alexander Rogovskiy @Katy Ellis @James Walder

 

Feedback from the user communities / spotting of problems, etc …

 

 Success criteria

Challenges

Outcomes

Requirements

Challenges

Outcomes

Requirements

  • Lack of support for readV in XrdCeph causing job failures (primarily for LHCb)

  • readV related job failures at a rate comparable with other T1s. Allow to run all expected workflows

  • Deploy patched readV code with atomic reads

  • High IOPs are a problem for Ceph

  • Maintain the IOPs that hit Ceph to stay below XXXk

  • Use of XCache on WNs to avoid hitting Ceph with too many read requests

 Indicators of Problems / Success

What metrics should be evaluated to determine if there is an issue.

Identifiers of problems

Quantifiable metrics

Actions required

Identifiers of problems

Quantifiable metrics

Actions required

  • Significant IOPs on Ceph OSD, causing slow ops / stalled transfers

  • Sustained IOPs > XX on OSDs (or > YY for all IOPS) (TBD)

  • Meeting; prepare to roll-back if this is a sustained effect

  • Significant job failures on WNs

Identified a critical failure, of failure rate > 20% (??) across affected nodes

  • Meeting; prepare to roll-back if can be associated to the change

 Project tracker

Milestone

Description

Notes

(Add any detailed notes to the section at the end of document, and link here)

Responsible

Participants

Expected / actual start date (add comment)

Status

IN PROGRESS / DONE / OTHER

Action items

Completed (and signed off) + link to documentations

Milestone

Description

Notes

(Add any detailed notes to the section at the end of document, and link here)

Responsible

Participants

Expected / actual start date (add comment)

Status

IN PROGRESS / DONE / OTHER

Action items

Completed (and signed off) + link to documentations

Development testing

Finalisation of any development work

 

@Thomas, Jyothish (STFC,RAL,SC)

Jun 2, 2023 (expected)

IN PROGRESS / DONE / OTHER

Development done
Testing complete

 

RPMs

Identify rpm versions (and src commits) for deployed version

xrootd-5.5.4-2

xrootd-ceph-buffered-5.5.4-3

@Thomas, Jyothish (STFC,RAL,SC)

@Alexander Rogovskiy

 

 

RPM built
RPM tested
RPM details added to Documentation

 

Xrootd configuration

Confirm the correct XRootD configuration has be checked and is correct

https://github.com/stfc/grid-workernode/pull/38/files

http://aquilon.gridpp.rl.ac.uk/sandboxes/diff.php?sandbox=workernode-xrootd

@Thomas, Jyothish (STFC,RAL,SC)

@Alexander Rogovskiy

 

 

Configuration checked
Configuration tested

 

Docker image configuration

Confirm that the Docker image(s) are correct and have been checked

harbor.stfc.ac.uk/ral-tier1_public/echo-xrootd-workernode:2023-06-12.1

@Thomas Birkett

@Alexander Rogovskiy@Thomas, Jyothish (STFC,RAL,SC)

 

 

Configuration checked
Configuration tested

 

Current state capture

Ensure that any relevant records / state of the system has been captured (could be by plots for example).

====================================== 8233 passed, 1 xfailed in 647.21s (0:10:47)

raw read speed: 80MB/s with/without patch

cached reads: > 242MB/s

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

Current state has been documented

 

Slack: Create new Channel;

advertise to GST and DS. Add relevant people

Remove at end of rollout

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

Done

 

Farm Draining

Date to start draining the farm (if needed) and which tranche (may be repeated for multiple tranches)

Is needed ?

 

 

 

 

 

Synchronisation Step

All Stakeholders need to sign-off here, in order to proceed further

*(complete the action, and add name)

All all stakeholder to raise concerns / allow to proceed.

All

 

 

 

 

deploy updated docker

Date, time, tranche, that the above docker image (etc) has been deployed

Actual rollout on set of farm:

  • 11:00 - wn-2022-lenovo

  • 11:30 - wn-2021-xma

  • 12:00 - wn-2018-xma

  • 13:30 - wn-2017-xma
    Patched:

  • wn-2022-lenovo

  • wn-2021-xma

  • wn-2018-xma

  • wn-2017-xma

Unpatched:

  • wn-2017-dell

  • wn-2019-dell

  • wn-2020-xma

@Thomas Birkett

 

 

 

 

Monitoring period

How long to monitor / keep with intermediate state

 

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

 

Synchronisation Step

All Stakeholders need to sign-off here, in order to proceed further

 

All

 

 

 

 

 

Full rollout:

Draining

 

 

@Thomas Birkett

 

 

 

 

 

Full rollout:

deploy updated docker

 

 

@Thomas Birkett

 

 

 

 

 

Rollout complete

Changes applied across farm

 

@Thomas Birkett

 

 

 

 

 

Monitoring period

Defined period of monitoring for issues.

 

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

 

 

Synchronisation Step

All Stakeholders need to sign-off here, in order to proceed further

 

All

 

 

 

 

 

Completion of change

finalisation date of the change

 

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

 

 

 

 

Back-out strategy

If required, what is the planned mitigation and steps that are needed:

Plans may be needed / different whether the rollout is in progress, vs whether all changes have been made, for example:

 

Phase of rollout

Threshold for rollback

Actions required / time

Rough time to deploy

Was needed ?

Phase of rollout

Threshold for rollback

Actions required / time

Rough time to deploy

Was needed ?

Post draining of tranche

High IOPs / job failures (sustained)

Rollback to existing docker / configuration

< 1hr to deploy; ~ hrs to rollout across the farm

 

 

 

 

 

 


Notes:

Additional notes relating to the above tables should be added here and linked appropriately.

request time taken:

atomic read requests take longer to process than normal reads, but contain more individual reads/request

95th percentile and 99th percentile of op times increased after the rollout but were in line with previous week’s data

reads per drive

 

11:00-13:00

11:00 - start of rollout

13:10

High amount of requests seems to have triggered callouts on gw6,7,14,15 and svc98.
Svc98 was restarted manually. gw7 and 14 sorted themselves out. failure mode was 'functioning but slow', no unexpected errors were found and requests were being completed, but the speed was slow.

 

high requests seems to match oxford transfers from ATLAS

 

15:00

gw15 and 6 were manually restarted. all checks are passing. n_connections on each gateway is <500 and stable. IOPS hovering around 50k and stable.

 Resources