r10 - 15 Nov 2012 - 16:41:34 - MaximPotekhinYou are here: TWiki >  AtlasSoftware Web > MnemosyneService

Mnemosyne -- a Cassandra-based Web Service for archived Panda Monitoring Data

Work in Progress

This page is work in progress subject to frequent updates. The Web service described here is in beta and is subject to changes and interruptions.

The goal of the project

There is a wide class of queries performed in the Panda Monitor that do not require RDBMS type of functionality that Oracle provides. On the other hand, ongoing optimization of Oracle queries has not resulted yet in demonstrable scalable performance and loosening of restraints on types and time windows of queries that are allowed to be run against the server without degradation of its performance. Using a noSQL solution to achieve very high performance and scalability, and to reduce the load on the main Panda Oracle server, is the topic of this R&D project. It includes an attached Web Service that will deliver data to the clients. The goal is to demonstrate production quality data service with performance significantly better than that of Oracle, for important classes of queries.

Overview of the technology platform and its components


We chose Cassandra as our noSQL platform due to existing experience with it, within ATLAS, and availability of a test platform at CERN in early stages of the project. Cassandra is a truly distributed, schema-less, key-value storage system that has been augmented with indexing capability in its latest versions. It can be configured to provide redundancy by keeping more than one copy of the data on the nodes that make up the cluster. Robustness of write operation is kept by keeping a commit log on each node and blocking until the write to the log succeeds. Data is the distributed across the cluster asynchronously. The nodes are automatically load-balanced and adding extra nodes can be done transparently, thus providing an extreme degree of horizontal scalability.

Cloud Storage

Currently the data is periodically extracted from the Oracle database (lagging by about 3 days since it was supposed to be final, to allow for delays in final updates) and deposited into Amazon S3 as csv-formatted files. This allows for transparent, universal and fast access to the data which is convenient for populating alternative storage such as an instance of Cassandra. There is also a local file cache where the files pulled of S3 can be stated for repetitive load testing, emergency reloads etc.

Apache and Django

Django provides a convenient way to create and efficiently maintain a Web service. We run a Django application under Apache Web Server to provide the foundation of the Web Service, which was tentatively name "Mnemosyne" for apparent reasons (i.e. in keeping with names from Greek mythology and as a reference to absolute memory). Currently there is no RDBMS involved in operating Mnemosyne, but in case a need arises (such as for light-weight tasks like administering user accounts with prepackaged Django modules), this can be easily achieved with sqlite etc.

Data Organization and Indexing

What do we do with Oracle column names in Cassandra?

Cassandra stores each "row" of data as a dictionary, which does include the keys. Due to convenient naming used in the Oracle schema, the names of columns (which would become keys in Cassandra) are quite long, in order to be self-documenting. The amount of data, in some columns, may be quite small compared to the key itself. It was decided therefore to use a hash table to encode the keys to short integers when writing to Cassandra, and decode them to a format identical to seen in Oracle. This results in a factor of 2 to 3 savings in disk space used by Cassandra, at the cost of moderate code complexity and extra CPU load on the Web server.

Composite Indexes

In order to create an efficient index for queries frequently done in the Panda Monitor, such index needs to be complex, i.e. contain more than one column. While there is a facility in Cassandra to make complex queries using several simple indexes, this is always sub-optimal as such queries will result in an inner loop and its requisite traversal, as opposed to a quick look-up and read, as is the case with a composite index.

The most straightforward way to create such composite index is to add columns to Cassandra rows, which contain data from a few columns of interest. This can be achieved in many ways, but concatenation of strings is the simplest and most transparent way to do that. The indexing solution was therefore to concatenate column names to form a new column key (for each individual set of columns meant to form an index), and to concatenate column values themselves, to form the new column value.

Decision for this approach was taken after lengthy experimentation with various data formats and indexes done by hand, i.e. by manipulating user-created tables (column families) in Cassandra. In the end, the additional complexity of the software and difficulty in maintaining referential integrity of the data upon deletes was not worth savings in disk space, however significant they were.

List of the Cassandra Indexes currently available

  • DATE

Mnemosyne Web Service (decommissioned in November 2012)

The Mnemosyne Web Service was run continuously for more than a year using a web server and a Cassandra cluster located at BNL. The operation proved to be smooth and stable.

This is an example of a URL that can be used to run queries against this service:

The query above demonstrates the following features of the API:

  • date range feature -- multiple dates can be entered as interval separated by '-'
  • mandatory use of the 'limit' parameter -- given the high volume of the data that can be requested, processed and transported to the browser by an inexperienced user or developer, there needs to be a safeguard to protect all elements of the system against such accidents, which have the potential to crash the browser, slow down the service etc. If the 'limit' parameter is omitted, the system will default to 10. Note that in queries involving multiple dates (date ranges) the limit applies to data for each day individually.
  • selection of attributes -- in many cases, the end user and their application does not need ALL of the attributes of each object stored in the archive. Cutting down on the number of attributed being extracted results in savings in more than one place:
    • network traffic inside the Cassandra cluster
    • data load on the connection between the cluster and the Web server
    • volume of data that needs to be transported to the browser

Other examples (some may not work at all times because of data rotation and obsoletion):

Appendix I: history of Cassandra stress testing

Testing was done with a total of 3 cluster configurations:
  • a 4 VM cluster at CERN. This was a flaky installation subject to various instabilities, such as network lags, lack of adequate memory and crashes
  • 3 hardware boxes at BNL with a total of 18 spindles of rotational media. Pretty nice rig but not capable of scaling up to Oracle level (Oracle instance has ~110 spindles). 48GB RAM and 24 cores (counting HT) on each node.
  • Same boxes as above, but with 1TB SSDs each. Serious improvements in performance, handling loads comparable and potentially a few times higher than the Panda Oracle instance.

Appendix II: Design of data for the storage of filestable4

In Oracle, entries in the "files table" have a primary key that is simply a serial number. One of their attributes is Panda id. We are analyzing usage patterns to determine whether we can keep collection of file entries per Panda ID for speedy retrieval, or need to use some other way of packaging data.

The Oracle schema is as follows:

Field Type Length

Vast majority of the queries are by PandaID. Other examples based on actual Oracle data are:

  • dataset=:dataset
  • destinationDBlock=:destinationDBlock
  • dispatchDBlock =:dispatchdblock
  • dispatchDBlock=:dispatchDBlock
  • lfn=:lfn AND type=:type
  • dataset=:dataset AND type=:type
  • destinationDBlock=:destinationDBlock AND status=:status
  • PandaID=:PandaID AND status=:status
  • dispatchDBlock=:dispatchDBlock AND type=:type AND (destinationDBlockToken IS NULL OR destinationDBlockToken<>:noshadow)
  • PandaID=:PandaID AND type=:type

Major updates:
-- MaximPotekhin - 15 Nov 2012
-- MaximPotekhin - 26 Oct 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback