Panda Data Host for general use (non-Atlas VOs)
Introduction
Our previously practiced approach to data transport for general use-VOs was based on simplified set of solutions and ideas used in the Atlas data movement system, and included
- use of a relatively sophisticated and complex file catalog
- significant part of the VDT related to data movement
- built-in Panda pre-emptive data movement capabilities
While this configuration is scalable and robust, as a result of recent OSG Engagement activity, we find that small and medium size VOs will be initially reluctant or unable to dedicate resources for installation and management of the software stack requisite for operating according to this scheme. This high cost of entry may lead to difficulty in attracting new research teams as potential OSG participants and/or adoption of OSG WMS (e.g. Panda) by such potential VOs'.
Proposed Solution
Data hosting
We propose to leverage the existing OSG Workload Management Systems (WMS's) by utilizing a "hosted VO service" paradigm. In order to insulate the users from details of data storage configuration and particulars of job submission, we would provide a Web service with a uniform interface, accessible both from GUI and command-line tools. The data belonging to end-users can be stored in various ways, including utilizing OSG Storage Elements (assigned to a particular VO) or, in simpler cases, local or distributed disk accessible to the server. Both Panda Pilot-based Framework and glideinWMS will be able to easily interface such data host (for example, it is trivial for Panda Pilot to download data to the Worker Node from a web service, and simple wrappers can be created for glideinWMS). Likewise, by using the
POST method over
HTTPS, the Pilot would communicate with the data host and upload the output of the calculation. In the simplest version, the users would be responsible for maintenance of their data on the Data Host, which they would access solely via a Web interface, with no additional software being required. The required UI functionality then must include
auth/auth, possibility to upload, list and delete files. The administrative capability would include user and group privilege management, establishment of quotas, etc.
A number of off-shelf solutions exist for fast data transfer from a Web service to the Worker Node via HTTP, such as
Tux,
LightHTTPD.
Optional - job submission
In addition to providing users with a Web service to handle the data transfer, we are looking at possibility to submit jobs to OSG managed facilities through the same Web interface. The submission itself would still be done using WMS-specific tools (Panda job submitter or glideinWMS Condor submission) -- it will just happen on the server and not on the user's desktop. The method of job submission can be easily made configurable for each user.
The motivation for this option lies in the fact that many end-users would prefer being able to control the job submission without having to install and manage the OSG software stack.
Note that since the users will be required to obtain and install a proper
X509 certificate in their browser, the security guidelines for job submission will still be followed.
System architecture (using Panda as an example)
The proposed architecture (using Panda as an example) is represented in the following diagram:
There are existing examples in industry (e.g.
SunGrid) where
HTTPS connection was used for data transport in cloud computing scenario.
Choice of Platform
Since the proposed Data Host is essentially a Web server with a layer of additional code performing auth/auth, necessary logic and creating a suitable UI, there are a large number of possible implementations. Most optimal ones are based on a comprehensive Web framework (as opposed to individual functional modules). Of these, we chose the Django system for prototype development. While detailed feature comparison of Django with other framework goes beyond the scope of this document, it's helpful to enumerate its features that will be immediately useful in the Panda Data Host implementation:
- ready-to use auth/auth system with full session support, automated set-up and pre-packaged admin interface, which uses a database table as its back end
- possibility of group-based and similar fine-grain privilege assignment to users
- optimal integration of the Web server and the database back-end for serving query results, in a way that encapsulates the server API
- basic built-in web security such as defenses againd cross-site scripting and injection of malicious SQL
- powerful yet simple to use template system, featuring include files and template inheritance which both help avoid errors and minimize the amount of HTML to be written
- in combination with the above, possibility of using stylesheets in HTML templates, which helps improve output pages appearance across site and facilitates
- messaging service (messages can be posted on Web pages for any or all of the users)
- provisions for using not one but a variety of caching options (including memcached server), with configurable granularity (from complete site down to a portion of a page) for enhancing performance while performing repeated database queries or serving other content repeatedly
- auto-generation of Python classes from database schemas facilitates migration of legacy web applications to Django
- customizable file storage classes provide additional convenience while hosting data
As mentioned above, efficient solutions exist for high-speed serving of data to the worker nodes, which do not (and should not) involve Django, and rely on an optimized server like TUX instead.
The Prototype
A functional Django-based prototype of the Data Host has been created, using a very small amount of Python code and template inheritance and inclusion. A few screenshots below:
- Datahost data management panel:
- DataHost Django admin screenshot (user management):
There has been no need to create any custom code for user
auth/auth, as this functionality is entirely handled by Django. The admin section relies on its database for
auth functionality, and the user passwords stored there are encrypted Unix-style. Same applies to the file upload pop-up panel and
all of the admin section.
Command-line clients
Command line with password authentication (no SSL)
In addition to Web-based access mode, we need to provide a suite of command-line clients for scripting by the user. Such utilities has been created (see the code in
SVN) as described below:
- datahost-login host user password
- datahost-logout host user
- datahost-ls host user
- datahost-rm host user filename
- datahost-put host user filename
- datahost-get host user filename
The functionality of these commands is self explanatory: after a login, the user can list the files (
ls), remove a file by name (
rm), put a file on the server (
put), and get it back (
get). Since these reply on
curl for communication with the host, note that in certain conditions (like operating on intranet) it's necessary to configure
curl to avoid using the http proxy (in case that proxy is too restrictive in routing requests). This is achieved by setting the environment variable
DATAHOST_NOPROXY to '*'. Also, for convenience, the user can set the environment variables
DATAHOST_ADDRESS,
DATAHOST_USER and
DATAHOST_PASSWORD, in which case corresponding arguments should be dropped from the command line examples as given above. The
DATAHOST_ADDRESS should include the port number in addition to the IP address, e.g.
osgdev.racf.bnl.gov:1234.
HTTPS (X.509 and SSL)
Note that the above set of commands reply on session support provided by Django, which makes use of cookies. Using of password authentication by agents such as payload jobs is not a secure option, therefore must be avoided in favor of certificate/proxy based approach. Here's how one may want to upload a data file (which may contain results of a calculation):
curl -k --cert valid_proxy_path --key valid_proxy_path -F "file2xmit=@myfile.txt" -F "username=me" https://osgdev.racf.bnl.gov:20006/datahost/transmit/
To make things simpler for the user, this functionality is wrapped in command line clients
- datahost-ssl-put filename
- datahost-ssl-get filename
The following environment variables should be set:
- DATAHOST_ADDRESS
- DATAHOST_USER
- DATAHOST_CERT (path to X.509 proxy)
- DATAHOST_KEY (path to X.509 proxy)
Alternatively, one can supply arguments:
- datahost-ssl-put host user cert key filename
- datahost-ssl-get host user cert key filename
For datahost-ssl-put, the host address should be
https://osgdev.racf.bnl.gov:20006/datahost/transmit/, and for datahost-ssl-get,
https://osgdev.racf.bnl.gov:20006/datahost/receive/
Download the command-line clients
The clients are available from
SVN repository of USATLAS
Deployment considerations
Auth/auth
Ultimately, the usefulness of the Data Host depends on the efficiency and security of data transfer between it and Grid jobs -- in particular, pilot jobs. Since including user IDs and passwords with the pilot submission is a bad idea, we are using
SSL for data transport between the server and the pilot job. However, we don't necessarily want to force users (especially early adopters) to start using Grid Certificates while uploading their data to the server and getting back the results. We therefore employ the usual
auth/auth mechanism with UserID and the password for end-user communication with the server, while using
SSL for pilot job interaction. This is achieved by appropriate configuration of the Apacher server, such as having a SSL-enabled
Virtual Host listening on one port, and a non-SSL instance on another. Since we prefer to use a Django handler for uploads from pilots, we must have two Django instances running on one Apache server. A general recipe for achieving this given here:
<VirtualHost port1>
ServerName www.example.com
# ...
SSL settings here...
<Location "/something">
SetEnv DJANGO_SETTINGS_MODULE mysite.settings
PythonInterpreter mysite
</Location>
</VirtualHost>
<VirtualHost port2>
<Location "/otherthing">
SetEnv DJANGO_SETTINGS_MODULE mysite.other_settings
PythonInterpreter othersite
</Location>
</VirtualHost>
Since the pre-packaged Django development server is not SSL-capable, and installing a secure tunnel is additional work, the SSL-enabled elements of the service are best tested with Apache itself.
Security Policy
We are in compliance with
JSPG Portal Policy), in the parts related to both
Data Processing and
Job Management portals, for the following reasons:
- The Strong Authentication (as defined in the above document) is built into the server by virtue of Grid User certificates being used
- The admin module of the service complies with requirements related to maintenance of user registration information
Serving data
Serving data using Django is not recommended due to unnecessary drain on resources. Instead, we rely on Apache to do that by defining the directories accessible via Apache itself, bypassing Django for that area. That can be done by specifying a separate
Location in the Apache configuration, which covers a range of URLs served by Django, while the rest will be handled directly by Apache. Example:
<Location "/media">
SetHandler None
</Location>
Another example providing explicit streaming of media files:
<Location "/">
SetHandler python-program
PythonHandler django.core.handlers.modpython
SetEnv DJANGO_SETTINGS_MODULE mysite.settings
</Location>
<Location "/media">
SetHandler None
</Location>
<LocationMatch "\.(jpg|gif|png)$">
SetHandler None
</LocationMatch>
Testing
A series of test jobs was run at NERSC, using real data from the experiment. In the process, each payload job downloaded a few hundred MB of data from the Datahost, performed necessary calculations and uploaded the resulting files back to a user's account on Datahost.
Technical Notes
Testing the throughput
In testing the throughput with large files, it is often useful to generate files of a given length, which is typically done with the command like:
dd if=/dev/urandom of=yourfilename bs=1M count=100
see man pages for_dd_ .
Database issues
The admin module needs a few database tables, which are created automatically by running the command
python manage.py syncdb
This is related not just to admin module per se, but to any
Models defined in Django by the developer.
Temporary File storage
Django by default places temporary file in /tmp, during upload (configurable). Regardless of location, there is a likely bug in Django whereby if there is more than one upload handler, the temporary files are not purged. Currently, the workaround is to leave only one handler (the one which is using temporary files explicitly), and give up on automatic spill-over.
To-Do List
- Error messages in client utilities
- Fixed location of the directory used for storage of cookies (so that the session is maintained regardless of the directory from where scripts are being launched)
Major updates:
--
TWikiAdminGroup - 07 Nov 2009
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments
DH_login_1.PNG (32.3K) |
MaximPotekhin, 05 May 2009 - 14:10 |
DataHost login screenshot
DH_admin_1.PNG (52.9K) |
MaximPotekhin, 05 May 2009 - 14:29 | DH admin screenshot
DH_diagram_1.PNG (60.1K) |
MaximPotekhin, 05 May 2009 - 15:10 |
DataHost Diagram
DH_list_2.png (39.6K) |
MaximPotekhin, 26 May 2009 - 12:40 | Datahost data management panel
DH_jobs_1.PNG (56.7K) |
MaximPotekhin, 23 Jun 2009 - 17:55 | DH jobs screen
DH_list_3.png (21.0K) |
MaximPotekhin, 26 Jun 2009 - 11:01 | DH data management panel