Mnemosyne -- a Cassandra-based Web Service for archived Panda Monitoring Data
Work in Progress
This page is work in progress subject to frequent updates. The Web service described here is in beta and is subject to changes and interruptions.
The goal of the project
There is a wide class of queries performed in the Panda Monitor that do not require RDBMS type of functionality that Oracle provides. On the other hand, ongoing optimization of Oracle queries has not resulted yet in demonstrable scalable performance and loosening of restraints on types and time windows of queries that are allowed to be run against the server without degradation of its performance. Using a noSQL solution to achieve very high performance and scalability, and to reduce the load on the main Panda Oracle server, is the topic of this R&D project. It includes an attached Web Service that will deliver data to the clients. The goal is to demonstrate production quality data service with performance significantly better than that of Oracle, for important classes of queries.
Overview of the technology platform and its components
Cassandra
We chose Cassandra as our noSQL platform due to existing experience with it, within ATLAS, and availability of a test platform at CERN in early stages of the project. Cassandra is a truly distributed, schema-less, key-value storage system that has been augmented with indexing capability in its latest versions. It can be configured to provide redundancy by keeping more than one copy of the data on the nodes that make up the cluster. Robustness of write operation is kept by keeping a commit log on each node and blocking until the write to the log succeeds. Data is the distributed across the cluster asynchronously. The nodes are automatically load-balanced and adding extra nodes can be done transparently, thus providing an extreme degree of horizontal scalability.
Cloud Storage
Currently the data is periodically extracted from the Oracle database (lagging by about 3 days since it was supposed to be final, to allow for delays in final updates) and deposited into Amazon S3 as csv-formatted files. This allows for transparent, universal and fast access to the data which is convenient for populating alternative storage such as an instance of Cassandra. There is also a local file cache where the files pulled of S3 can be stated for repetitive load testing, emergency reloads etc.
Apache and Django
Django provides a convenient way to create and efficiently maintain a Web service. We run a Django application under Apache Web Server to provide the foundation of the Web Service, which was tentatively name "Mnemosyne" for apparent reasons (i.e. in keeping with names from Greek mythology and as a reference to absolute memory). Currently there is no RDBMS involved in operating Mnemosyne, but in case a need arises (such as for light-weight tasks like administering user accounts with prepackaged Django modules), this can be easily achieved with sqlite etc.
Data Organization and Indexing
What do we do with Oracle column names in Cassandra?
Cassandra stores each "row" of data as a dictionary, which does include the keys. Due to convenient naming used in the Oracle schema, the names of columns (which would become keys in Cassandra) are quite long, in order to be self-documenting. The amount of data, in some columns, may be quite small compared to the key itself. It was decided therefore to use a hash table to encode the keys to short integers when writing to Cassandra, and decode them to a format identical to seen in Oracle. This results in a factor of 2 to 3 savings in disk space used by Cassandra, at the cost of moderate code complexity and extra CPU load on the Web server.
Composite Indexes
In order to create an efficient index for queries frequently done in the Panda Monitor, such index needs to be complex, i.e. contain more than one column. While there is a facility in Cassandra to make complex queries using several simple indexes, this is always sub-optimal as such queries will result in an inner loop and its requisite traversal, as opposed to a quick look-up and read, as is the case with a composite index.
The most straightforward way to create such composite index is to add columns to Cassandra rows, which contain data from a few columns of interest. This can be achieved in many ways, but concatenation of strings is the simplest and most transparent way to do that. The indexing solution was therefore to concatenate column names to form a new column key (for each individual set of columns meant to form an index), and to concatenate column values themselves, to form the new column value.
Decision for this approach was taken after lengthy experimentation with various data formats and indexes done by hand, i.e. by manipulating user-created tables (column families) in Cassandra. In the end, the additional complexity of the software and difficulty in maintaining referential integrity of the data upon deletes was not worth savings in disk space, however significant they were.
List of the Cassandra Indexes currently available
- PRODSOURCELABEL+DATE
- COMPUTINGSITE+PRODSOURCELABEL+DATE
- PRODUSERNAME+JOBDEFINITIONID+PRODSOURCELABEL
- PRODUSERNAME+DATE
- DATE
- TASKID
Prototype Web Service
A Web Service is run, on an experimental basis, using a web server and a Cassandra cluster located at BNL. This is an example of a URL that can be used to run queries against this service:
The query above demonstrates the following features of the API:
- date range feature -- multiple dates can be entered as interval separated by '-'
- mandatory use of the 'limit' parameter -- given the high volume of the data that can be requested, processed and transported to the browser by an inexperienced user or developer, there needs to be a safeguard to protect all elements of the system against such accidents, which have the potential to crash the browser, slow down the service etc. If the 'limit' parameter is omitted, the system will default to 10. Note that in queries involving multiple dates (date ranges) the limit applies to data for each day individually.
- selection of attributes -- in many cases, the end user and their application does not need ALL of the attributes of each object stored in the archive. Cutting down on the number of attributed being extracted results in savings in more than one place:
- network traffic inside the Cassandra cluster
- data load on the connection between the cluster and the Web server
- volume of data that needs to be transported to the browser
Other examples (some may not work at all times because of data rotation and obsoletion):
Appendix I: history of Cassandra stress testing
Testing was done with a total of 3 cluster configurations:
- a 4 VM cluster at CERN. This was a flaky installation subject to various instabilities, such as network lags, lack of adequate memory and crashes
- 3 hardware boxes at BNL with a total of 18 spindles of rotational media. Pretty nice rig but not capable of scaling up to Oracle level (Oracle instance has ~110 spindles). 48GB RAM and 24 cores (counting HT) on each node.
- Same boxes as above, but with 1TB SSDs each. Serious improvements in performance, handling loads comparable and potentially a few times higher than the Panda Oracle instance.
Appendix II: Design of data for the storage of filestable4
In Oracle, entries in the "files table" have a primary key that is simply a serial number. One of their attributes is Panda id. We are analyzing usage patterns to determine whether we can keep collection of file entries per Panda ID for speedy retrieval, or need to use some other way of packaging data.
The Oracle schema is as follows:
| Field |
Type |
Length |
| ROW_ID |
NUMBER |
11 |
| PANDAID |
NUMBER |
11 |
| MODIFICATIONTIME |
DATE |
7 |
| GUID |
VARCHAR2 |
64 |
| LFN |
VARCHAR2 |
256 |
| TYPE |
VARCHAR2 |
20 |
| DATASET |
VARCHAR2 |
255 |
| STATUS |
VARCHAR2 |
64 |
| PRODDBLOCK |
VARCHAR2 |
255 |
| PRODDBLOCKTOKEN |
VARCHAR2 |
250 |
| DISPATCHDBLOCK |
VARCHAR2 |
255 |
| DISPATCHDBLOCKTOKEN |
VARCHAR2 |
250 |
| DESTINATIONDBLOCK |
VARCHAR2 |
255 |
| DESTINATIONDBLOCKTOKEN |
VARCHAR2 |
250 |
| DESTINATIONSE |
VARCHAR2 |
250 |
| FSIZE |
NUMBER |
11 |
| MD5SUM |
VARCHAR2 |
36 |
| CHECKSUM |
VARCHAR2 |
36 |
Vast majority of the queries are by PandaID. Other examples based on actual Oracle data are:
- dataset=:dataset
- destinationDBlock=:destinationDBlock
- dispatchDBlock =:dispatchdblock
- dispatchDBlock=:dispatchDBlock
- lfn=:lfn AND type=:type
- dataset=:dataset AND type=:type
- destinationDBlock=:destinationDBlock AND status=:status
- PandaID=:PandaID AND status=:status
- dispatchDBlock=:dispatchDBlock AND type=:type AND (destinationDBlockToken IS NULL OR destinationDBlockToken<>:noshadow)
- PandaID=:PandaID AND type=:type
Major updates:
--
MaximPotekhin - 26 Oct 2011