Concepts¶

Architecture¶

Koala consists of multiple separate processes (red) and middleware components (blue). arc

Middleware¶

Data Management (DB)¶

Primary Datastore. Contains Asset metadata, information regarding ingest and retrieval requests as well as known filetypes, schemas and user information. Currently supported is mysql

Message Queue (MQ)¶

The message queue is used to asynchronously route ingest and retrieval requests to loader/retriever applications. Currently supported is RabbitMQ

Archival Storage¶

Physically stores the assets. Multiple implementations of an Archival Storage client are available, for example TSM, filesystem.

Cache¶

Stores heartbeat informations about the running applications which is shown on the web application. Retrieves and caches statistics about the archived packages from the main database. Currently supported is redis

Applications¶

Scheduler¶

Monitors the upload directory for new SIPs and publishes the data for further processing to the message queue. Only one scheduler must be running at the same time.

Loader¶

Multiple Loaders can be started simultaneously which then connects to the message broker and processes the SIPs concurrently.
Processing includes among others the following steps:

unzipping
validating files and metadata
calculating checksums
storing metadata in DB and assets in Archival Storage.

The status of the ingest process is continuously updated and can be monitored through the Web UI or API.

Proxy¶

A reverse proxy for SSL offloading and serving of static content.

Web¶

Provides a http api for third party applications and a web ui for admin operations.

Retriever¶

Multiple Retriever can be started simultaneously which then connects to the message broker and processes retrieval requests from clients. AIPs are fetched from the Archival Storage system, afterwards converted to a DIP and saved in the http accessable downloadrea. The status of the retrieval process is continuously updated and can be monitored through the Web application.

Purger¶

Monitors the downloadarea filesystem size and removes old packages once the highwatermark is reached.

Statistics¶

Queries the main database, calculates statistics and saves the information in the cache.

Processes¶

Ingest¶

ingest

The ingest process is modeled as a state machine with done/error/fatal end states. As soon as the client gets a "done" ticket by the web application, the AIP is saved on stable storage (archival storage) and every participating system has the required data e.g. mdqi, data management. Every ingest of a SIP is also a "transaction" which means that a failing step leads to a recovery/rollback operation. The recovery is done by the loader error state or in case the operation was fatal by the purger application. During the recovery procedure the purger locks the corresponding workarea and blocks the loader application from starting.

Retrieve¶

retrieve

When the client requests a specific package, a retriever application saves the data to the http accessible downloadarea. The DIPs remain accessible until a configured threshold is reached. Typically 70% downloadarea usage. Then DIPs are being removed from oldest to newest by the purger application.