Investigating XRootD Slow Deletions
Update: Comparing deletion times between gateways.
Recreating CERN (Joao Lopes) deletion tests from Delete Test Final Document - these tests use WebDAV against the webdav.echo.stfc.ac.uk alias.
Test 1: 50 parallel clients / 1mb files - Pending
Test 2.a : Parallel GFAL2 clients / 1 MB files, varying number of clients
I ran deletion_test.py with 10 to 100 parallel clients each deleting a 1MB file. Plot of the results below:
Test 2.b: Parallel GFAL 2 clients / 10 MB files, varying number of clients
I ran deletion_test.py with 10 to 100 parallel clients each deleting a 10 MB file. Plot of the results below:
Comparing the deletion times for 1 MB and 10 MB files, varying number of clients
(Circles indicate times for deleting 1 MB files, squares are for 10 MB files).
Observations:
Deletion times for 1 MB and 10 MB files are quite close (barring outliers) until after 60 clients. The times diverge when 70 clients are in effect.
Large variations in deletion times for the same number of clients are apparent between successive runs of the test program, this is likely to be due to varying loading on ECHO.
Test 3: Parallel GFAL 2 client / 100 MB files
I ran deletion_test.py with 10 to 100 parallel clients each deleting a 100 MB file. Plot of the results below:
Observations:
Longest time taken to delete 100 MB files (~4s with 90 clients) is lower than the longest time taken to delete 10 MB files (on previous plot, ~5s with 90 clients). Likely to be due to varying load on ECHO.
Test 4: Deleting files of varying size
I ran https://gitlab.cern.ch/fts/scripts/-/blob/master/deletion-test/filesize_test.py with file sizes between 10 MB and 1000 MB. Plot of the results is below:
Comparative deletion times for Ceph gateway nodes
I measured the time to delete 50 files (1MiB, 10MiB, and 100MiB) from ceph-gw{1..9} using davs. I ran the deletion tests 10 times for each gateway, passing from one gateway to the next between passes. That is, the tests switch the gateway node most rapidly, they don’t linger on one node running 10 passes of 50 deletions.
I ran the deletion tests between ~ 13:00 and 15:00 on Tuesday 12th July.
Conclusions from these tests
Ceph-gw6 and ceph-gw7 showed the highest mean deletion times over all file sizes;
Ceph-gw7 showed the highest deletion time of all when deleting files of 1 MiB;
The mean deletion times scale less than the increase in file size, ranging from ~ 3-4s with 1 MiB and 10 MiB files to ~ 5-7s with 100 MiB files, that is ~ 2x time increase for 100x file size increase.
Questions
What is the statistical significance of these tests?
Is running tests over a two-hour period sufficient?
Ceph-gw7 shows the highest deletion time of around 21s. Its mean deletion time is also much higher than the other gateways, so perhaps loading on this node was a causing a reduction in performance.
Next, for 10 MiB files:
ceph-gw5 and ceph-gw7 are fairly level on their highest deletion time, although ceph-gw6 takes the lead for being slow…
For 100 MiB files:
Deletions times for 100 MiB files are lower than I expected.
Concurrent Deletions of Small Files with the Dual and Unified Daemon Configuration
I ran deletion tests on ceph-dev-gw2 with files of 1, 10, and 100 MB. The tests ran for 5 passes at each file size and used 50 concurrent threads and. A plot of the results is below, showing the mean deletion times for each file size:
This sample plot shows that the Unified daemon has lower mean deletion times compared to the Dual (Proxy+Ceph) daemon. There is a wide variation between the shortest and longest deletion times for each file size (not shown here).