Questions

These questions are a starting point for the project to help shape requirements.

Questions for Diamond

  1. It's our understanding that Diamond users are already downloading data from the datastore and uploading it to ECHO in a manual process.

    1. Who exactly is doing this?

      1. Which beamlines?

      2. Diamond staff, beamline scientists or facility users?

    2. How are they doing this?

    3. Why are they copying data to ECHO, what project need? programmatic access?

      1. Do individual files need to be preserved, or is a single zip (method for direct download) preferable?

    4. Is there any way of transferring to Echo before the data is only on tape? i.e. while it still exists on disk at Diamond? This would be technically much easier and avoid load on the ids.

  2. In a world where there is a button where users can “copy cart to ECHO”;

    1. What access requirements does Diamond need, who can copy stuff to echo?

    2. Is the ECHO storage based on user or investigation?

      1. If it's based on an investigation, then should we add the ECHO location to ICAT?

      2. If it's based on an investigation, then it wouldn’t be transferred to ECHO more than once.

    3. Whoever can copy data to ECHO, presumably these users would need ECHO access/credentials as well? Who manages these credentials?

      1. Currently there are 14 sets of credentials that have been issued.

  3. What space quotas are needed:

    1. What quotas are in place on ECHO at the moment?

      1. Diamond have a 2PB allocation via IRIS, of which 200TB is currently used.

    2. How might this change going forward?

  4. I’m aware that Diamond wants a “spinning disk cache” for quicker access to data.

    1. How does this relate to this project?

    2. How do we stop the abuse of the system where the whole catalogue starts creeping over to ECHO?

    3. Both terms have been used, unclear if people care about the distinctions between Object storage and a posix style file system approach, e.g. CephFS. For example, if they want to mount to DAaaS then we should be talking CephFS not CephOS. Likely answer is that Diamond already have the 2PB allocation on Echo so are using what they have already.

  5. How long does the data need to stay on ECHO for?

  6. As far as i’m aware, there is no data monitoring going on in ECHO. Would it make sense for users to be denied write access to ECHO, so only data from the catalogue ends up there?

    1. Currently whoever has the credentials already has write access to Echo, however if this is rolled out via e.g. DataGateway then credentials would probably be functional and users would not be given direct access.

  7. “Programmatic” API access for moving data has been mentioned. In principle DataGateway-Download-API is public facing but is not intended for users.

    1. Is this a firm requirement, and if so, what will it need that the existing solutions lack(ed)?

Internal to SCD

  1. This will likely be another piece of software that will need to be maintained, are we able to support this?

    1. what time frame?

  2. What type of software is preferred, API server or Python cron job?

  3. Do any metrics need to be gathered, what metrics and for whom?

  4. Do the logs need to be in any particular format?

  5. How does this differ from globus?

  6. Buy in from Data Services. Has been mentioned this to Chris and Tom Byrne, but as there are no formal requirements it’s hard to raise it formally with them. It is very likely that those responsible for running the underlying storage will raise objections to aspects of this proposal.

    1. Programmatic access: potential for additional load on the ids

    2. Volume of restores: not clear if people will be looking to circumvent the current size limits on downloads (e.g. if movement to S3 is done by diamond staff or beamline scientists, volumes could be much larger than individual user requests which can already exceed the limits in place)

    3. Policy: presumable those running Echo are the ones who have to worry about quotas filling and deletion policies being enforced, so their opinion on this matters. Also worth considering IRIS who allocate the storage to Diamond in the first place?

Technical approach

Difficult due to lack of clear requirements, but already there are definite similarities to the Facilities Data Pipeline project for EPAC. However the use of StorageD by Diamond is a major complicating factor, and so propose re-using the DatastoreAPI, developed for FDP, in one (or both) of two ways, neither of which actually talks directly to tape:

Diamond → Echo

  • Provision a CephFS running XRootD and mount at Diamond

  • Beamline writes data they want to transfer to Echo to this mount before it leaves disk

  • Beamline submits a transfer request from this mount to Echo to the DatastoreAPI

    • Transfers starting from disk NYI, but would not be complex to do

    • S3 support is a WIP, but should be developed by end of 2024

  • Data is accessible in Echo for whatever purposes Diamond want

Safer to give programmatic access and much more efficient as data not going to tape, but makes assumptions that Diamond can do this while data still on disk (and have the foresight to know what they want). Not suitable for historic data already collected and archived.

Tape → Echo

  • Provision a CephFS running XRootD and mount on the IDS machine, or as a pollcat plugin (as appropriate)

  • Restore data to the IDS/pollcat as normal

  • Once restoration complete, either automatically or manually trigger a transfer request request from this mount to Echo to the DatastoreAPI

    • Transfers starting from disk NYI, but would not be complex to do

    • S3 support is a WIP, but should be developed by end of 2024

  • Data is accessible in Echo for whatever purposes Diamond want

Do not propose giving programmatic access and relying on existing ids functionality with the transfer to S3 via an additional step.