Existing ICAT Architecture
This project builds on existing work carried out by the DSEG team to facilitate the provision of the ICAT data catalogue. The figure below shows the current architecture and how it allows for data ingestion and retrieval.
Â
Once the user has run their experiment, the data is stored on the internal DLC data store for 40 days. The metadata is then pushed to ICAT’s Oracle database, and the raw data is sent to the data centre for processing.
Data Ingestion
The process starts with the XML client (depicted below) transmitting file metadata to the File Aggregator. This undertakes the task of segmenting the incoming data into discrete chunks. Subsequently, it orchestrates requests to the Diamond file client, procuring these segmented portions. The metadata encapsulates pertinent details such as file chunk attributes, encompassing size and storage coordinates, and is stored within the Storage D component. This component is used to look up the location of the data for retrieval purposes. These segmented file fragments are then passed to a tape archive, facilitated by the CERN archival storage system. This archival strategy is employed for preserving data over extended durations, typically decades.
The user metadata, which includes dataset title, creators, and collaborators, gets stored in the ICAT database directly. The figure below shows how the different systems fit together to allow data files to be ingested into STFC’s servers.
Another component of interest is the IDS as it already interacts with the Storage D component. However, it only has read-only access and is unable to write to disk, nor can it transfer data to tape storage. The IDS is also legacy, unsupported software. For this reason, a new component will be developed. Any developed system will need to go through the File Aggregator to preserve the configuration needed for data retrieval and therefore will need to mimic the Diamond XML and File client.
However, internal agreement on architecture has proven more complicated than anticipated. In order to progress with the project, the files will be stored in object storage in the interim, and the new, purpose-built application will be updated at a later date once a consensus has been agreed upon between the various departments.