SWIFT-HEP QoS planning Doc
Project Outline (abstract)
Introduction
LHC experiments were in the long shut down before Run 3, which led for ATLAS a doubling in data rate. Therefore high data rate storage is needed to enable the data to be ingested into sites. However this high performance storage is expensive and it is unfeasible to be used exclusively. Therefore, after data has been ingested into a site it should be possible to move the data to less expensive and/or less performant storage alternatives, called Quality of Service (QoS).
With the increase in data size and rate planned to be coming out of the LHC for High Luninosity Run SSD storage is looking likely to play a increasingly critical role in data storage requirements to allow sites to have the throughput required to ingest data, and have it accessible at a rate that the batch farm requires to have efficient running jobs.
Tape archives currently require SSD caches to provide the data throughput needed for the write and read speed of tape.
QoS has the potential to be a list of different properties, and labels assigned to sites and their various storage endpoints to help users understand the properties of the storage endpoint. This has the potential to become extremely complex and difficult for users to understand as labels and tags are added to storage endpoints. Therefore it should be considered to keep the QoS labels and tags as simple but meaningful as possible to facilitate concise tagging of storage to provide performance expectations to users.
QoS in terms of storage is a description of moving data between differently performing storage systems, in an attempt to maximise budgets for storage, scale data accessibility with access requests, and read efficiency in terms of the jobs being run on the data, as well as to minimise the cost of the storage solutions needed to ensure the data is quickly accessible, and archived for short and long term storage. There are an increasing range of QoS available to sites, that are being used to different degrees at different sites. These range from traditional spinning disk (HDD) clusters, solid state drives (SSD) storage, caches made from either HDD or SSD, and tape. Each of these storage solutions provides its own benefits and drawbacks. When combined at a site the different storage solutions provide different leves of QoS that allow the benefits of each storage type to make up for the down sides of another.
WLCG paper describes their intention to look at QoS and not catagorise by storage media, but by characteristics instead, as there is a large variety in deployment options for disk data durability, performance, and cost per GB.
QoS in general
Action: Define a list of storage properties and then define the most common types of storage in terms of these properties.
Capacity - Usable capacity in TB
Durability - Independent hardware failures that are tolerable without data loss.
Total Throughput internal: Total available throughput to local compute resources
Total Throughput external: Total available throughput to other sites
Total IOPs: Maximum random IOPs endpoint can sustain
Single Threaded throughput: Maximum transfer speed of individual transfers.
Cost / TB:
Volatile: Does the space clean itself up automatically?
Aim to deaggregate storage media with expectations, as some disk can be much higher performance than others depending on the setup
Disk
Low latency, cost varies, lower data durability compared to disk
Erasure coding - gives some durability with increase in cost (due to more disk being used), but lower cost compared to raid.
Tape / Archive
Very high latency, low cost, low data loss risk
Cache
Deployed by the site, with some kind of storage before the batch system
For data that is requested / used more often, useful and reduced the amount of data movement over the networks / between QoS for jobs to run.
For data that is requested once for a job, caching that data would be much less useful and take space in the cache, and therefore should be bypassing the cache.
Rucio QoS
Information Rucio needs to know in order to be able to make sensible decisions.
Test Plan
Get baseline performance disk to disk initially and get testing in process to iterate on it
Test types
To get a full idea of the effects of QoS on data movement there first needs to be a baseline set.
The initial tests to generate a baseline would involve running transfers of data from a disk storage to another disk storage
Increasing the complexity of the tests, assessing the time to complete datasets movement between sites from QoS to QoS in a matrix wise fashion.
E.g.
Site 1 ssd to site 2 disk
Site 1 disk to site 2 disk
Site 2 ssd to site 1 ssd
Site 2 ssd to site 2 disk etc.Tape to SSD / HDD?
Does SSD help with this, or is it still network bound?
Time to complete jobs when the data is on different QoS endpoints.
Investigate different mechanism for data transfers and their advantages / disadvantages:
What is the mechanism for movement?
Rucio/FTS?
Internal movement?
When should we use each different mechanism?
Is there a way to improve upon this?
End to end testing: Simulate jobs that use large datasets, and measure how long between start of data movement and completion is an OK time to start the job and not run into IO issues?
Remote reads
Running the job after datatransfer is complete gives datapoints towards the optimal time for the job to complete at a given site
Starting job at several time points from beginning of data transfer start e.g. same time, ¼ of the time, ½ of the time, ¾ of the time.
Can jobs ‘know’ how much of the data is present at the site before starting?
Jobs should be designed in a way that either allows caching to assist, or does not.
What is the rates at which caching at a site is useful?
Oxford and Birmingham have some data on this
What to measure
Data / Workflow types
Running these tests are all very well, but running on one sort of dataset/type will only help optimise / give data for one experiment and similar experiments output, therefore different data types should be thought about.
Large files - does the storage deal with few large files?
LHC datasets
Similar or mimicking workflow for the LHC experiments provides some information on how storage sites would act with normal Grid circumstances, typically sites are designed around providing storage and compute for the LHC experiments
Small files - does the storage deal with many small files? - larger number of connections
LSST datasets - 20mb files, how many do they need per job?
Looking at LSST datasets and their requirement to have many smaller files means that we can look at how jobs and data movement will differ at sites when compared to the typical LHC work loads.
Times of tests
Running tests at different times through the day may lead to different results, due to the workload, networking, and demands on a site. Therefore, running identical tests at various times of the day and week would give a better overall understanding of both the transfer times for datasets, as well as job running times.
Therefore each test should be run on each day of the week.
On top of each day of the week the tests should be run at several times: midnight, 3am, 6am, 9am, 12pm, 3pm, 6pm, 9pm to give an even spread of the tests throughout the day.
The tests should then also be run for several weeks to give more datapoints both over time, and week to week variance.
Total tests
Data transfer tests
Sites = 4
QoS per site = 2
QoS to QoS = QoS! (2)
Time points = 8
Dataset types = 2
Repeats = 4
Data points = 768
Job tests
Same conditions as above but with percentage expected time to complete the transfer before starting the job
Time points = 5
Though jobs can be run once the transfers are complete to look at efficiency from the data transferred for the tests above
Therefore 4 more timepoints
Sites | 4 |
QoS per Site | 2 |
Qos to QoS | 2 |
Time points | 8 |
Days of week | 7 |
Dataset types | 2 |
repeats | 4 |
Dataset Data points | 7168 |
|
|
Job time points | 4 |
Job test data points | 28672 |
Requirements for completion
Run tests
References
Notes from meeting with Katy and James
atlas hammercloud 350mb cern run service, ATLAS heavily reliant
Runs a suite of jobs emulating types of jobs, and looks for problems, and can turn a site off per 15-30 mins
production job may access data different to analysis job
ATLAS even 1 job fail can put site off, CMS requires days of failure
largest files 10GB
functional test 1mb at around 1hz
Can ask to use hammercloud
CMS tests are complete and sensative to various issues
filedumps for LSST
Rucio questions, what QoS for what Job, does Rucio failover to a
power considerations in QoS
ETF is the testing infrastructure for SAM tests
make the tests in ETF then CMS use SAM to visualise the results
instructions in CMS twiki on how to run the sam tests
James figured out a docker container to run tests
access CMS or ATLAS needs sorted
WLCG DOMA QoS task force
James would like to setup an ECho cache for HOT files using CephFS.