XrdCeph & Libradosstriper review

Architecture review date

Jun 21, 2022

Architecture review date

Jun 21, 2022

Project lead

@James Walder

On this page

 Overview

When the XRootD component was originally written for Ceph by Sebastien Ponce it was designed in two parts. He create a libradosstriper which is part of the Ceph code and broke large files into small objects which were written as separate Objects in Ceph. The first Object has some metadata attached which contains the checksum, the file size and other information to access the complete file. More details of how the striping works can be found in the Ceph documentationhttps://tracker.ceph.com/projects/ceph/wiki/Object_striping_in_librados .

The XrdCeph plugin is a Storage level plugin implementing an API in the spirit of the POSIX filesystem abstraction [ref]. XrdCeph relies on libradosstriper to split up write data into RADOS objects, and to re-assemble read data from RADOS objects.

When Echo was being built it was noted that as an object store it was better suited to streaming entire objects rather than performing many small reads (AKA Vector reads) on it. It was therefore decided to place XCaches infront of all the storage endpoints.

 

 Architecture issues

Architecture issue

Business impact

Priority

Notes

Architecture issue

Business impact

Priority

Notes

libradosstriper is not maintained by Redhat and there is no external development of the code.

If there is a problem we will have to fix it ourselves. We do not have experience in commiting code to the Ceph base.

medium

 

XCaches do not work as reliably as originally anticipated.

Jobs that use Vector reads have performance problems / higher failure rates. There is an extra level of complication both from our point of view maintain many XCaches and from a user point of view when it comes to reading/writing to different endpoints.

medium

 

Deletes of large files are slow because the chunks are deleted in series.

Currently large files can take several seconds to delete if not longer. This is not what VOs / software developers expect and causes problems.

Low

 

Checksuming is currently performed by a script that downloads the file after a transfer is complete.

For large files this can take a long time and possibly cause a time out. It slows down the perceived performance on the cluster.

LOW

 

XrdCeph does not support Vector reads, so any series of small reads is simply executed in series.

This appears to cause significant issues for jobs that use Vector reads as the Caching layer infront often doesn’t work as expected.

HIGH

 

Small read/writes against Echo are not efficient.

This results in very slow external transfers unless a buffer is added to allow the data to be buffered on the gateway before a larger transfers is done to Echo.

medium

 

Unstable under high load

Disruptive for VOs and can impact them even if load is caused by another user.

HIGH

 

 

 Stakeholders

Name

Role

Name

Role

@James Walder

@Ian Johnson

Architect

@Thomas, Jyothish (STFC,RAL,SC)

 

 

 

 Software quality attributes

 

Definition

Key success metrics

Notes

 

Definition

Key success metrics

Notes

Maintainability

The time it takes from identify a new feature or bug to implementing the fix.

Small fixes should be in production within 1 month.

 

Performance

XRootD should not be the bottleneck in the system, it should be the speed at which Ceph can transfer data to the gateways.

Transfers to and from Echo should be 100+ MB/s internally and 50+ MB/s to Tier-1s and 20+ MB/s to Tier-2s.

 

Reliability

The systems ability to operate under normal conditions and unexpected situations.

The service should be stable when dealing with expected load and fail transfers gracefully when under extreme load.

 

Scalability

The systems ability to handle multiple concurrent transfers.

Ceph should be the bottleneck in scalability.

 

 Goals

The goal is to have a single code base combining libradosStriper and XrdCeph, which could be easily maintained by RAL. We would like to maximise throughput by parallelising operations / transfers where possible.

 

 Next steps

Project

Description

Estimated effort required

Project

Description

Estimated effort required

1

Fork libradosStriper

Fork libradosStriper and compile it outside of Ceph to see that we can continue to use data stored in Echo.

1 Month

2

Investigate librados Sparse Reads

Assess and implement sparse reading in the forked libradosStriper (ceph/src/include/rados/librados.hpp at main · ceph/ceph ) for efficient vector read operation.

2 - 3 Months

3

Implement a many_aio_read in libradosstriper

Implement in forked libradosstriper a method to perform multi aio_reads in one call (see A. Peters: 26 Mar 21) for more efficient vector read operations (either/or with “Investigate librados Sparse Reads” project)

 

4

Understand how deletes work

The CERN FTS team measured the performance of deletes (https://codimd.web.cern.ch/UDL3fgqWT1a2HJC7LkWXNg# ) and don’t believe this is sufficient performance for expected data rates.