r20 - 11 Apr 2010 - 10:06:26 - MaximPotekhinYou are here: TWiki >  AtlasSoftware Web > DataHost

Panda Data Host for general use (non-Atlas VOs)


Introduction

Our previously practiced approach to data transport for general use-VOs was based on simplified set of solutions and ideas used in the Atlas data movement system, and included
  • use of a relatively sophisticated and complex file catalog
  • significant part of the VDT related to data movement
  • built-in Panda pre-emptive data movement capabilities
While this configuration is scalable and robust, as a result of recent OSG Engagement activity, we find that small and medium size VOs will be initially reluctant or unable to dedicate resources for installation and management of the software stack requisite for operating according to this scheme. This high cost of entry may lead to difficulty in attracting new research teams as potential OSG participants and/or adoption of OSG WMS (e.g. Panda) by such potential VOs'.

Proposed Solution

Data hosting

We propose to leverage the existing OSG Workload Management Systems (WMS's) by utilizing a "hosted VO service" paradigm. In order to insulate the users from details of data storage configuration and particulars of job submission, we would provide a Web service with a uniform interface, accessible both from GUI and command-line tools. The data belonging to end-users can be stored in various ways, including utilizing OSG Storage Elements (assigned to a particular VO) or, in simpler cases, local or distributed disk accessible to the server. Both Panda Pilot-based Framework and glideinWMS will be able to easily interface such data host (for example, it is trivial for Panda Pilot to download data to the Worker Node from a web service, and simple wrappers can be created for glideinWMS). Likewise, by using the POST method over HTTPS, the Pilot would communicate with the data host and upload the output of the calculation. In the simplest version, the users would be responsible for maintenance of their data on the Data Host, which they would access solely via a Web interface, with no additional software being required. The required UI functionality then must include auth/auth, possibility to upload, list and delete files. The administrative capability would include user and group privilege management, establishment of quotas, etc.

A number of off-shelf solutions exist for fast data transfer from a Web service to the Worker Node via HTTP, such as Tux, LightHTTPD.

Optional - job submission

In addition to providing users with a Web service to handle the data transfer, we are looking at possibility to submit jobs to OSG managed facilities through the same Web interface. The submission itself would still be done using WMS-specific tools (Panda job submitter or glideinWMS Condor submission) -- it will just happen on the server and not on the user's desktop. The method of job submission can be easily made configurable for each user. The motivation for this option lies in the fact that many end-users would prefer being able to control the job submission without having to install and manage the OSG software stack.

Note that since the users will be required to obtain and install a proper X509 certificate in their browser, the security guidelines for job submission will still be followed.

System architecture (using Panda as an example)

The proposed architecture (using Panda as an example) is represented in the following diagram:

DH_diagram_1.PNG

There are existing examples in industry (e.g. SunGrid) where HTTPS connection was used for data transport in cloud computing scenario.

Choice of Platform

Since the proposed Data Host is essentially a Web server with a layer of additional code performing auth/auth, necessary logic and creating a suitable UI, there are a large number of possible implementations. Most optimal ones are based on a comprehensive Web framework (as opposed to individual functional modules). Of these, we chose the Django system for prototype development. While detailed feature comparison of Django with other framework goes beyond the scope of this document, it's helpful to enumerate its features that will be immediately useful in the Panda Data Host implementation:
  • ready-to use auth/auth system with full session support, automated set-up and pre-packaged admin interface, which uses a database table as its back end
  • possibility of group-based and similar fine-grain privilege assignment to users
  • optimal integration of the Web server and the database back-end for serving query results, in a way that encapsulates the server API
  • basic built-in web security such as defenses againd cross-site scripting and injection of malicious SQL
  • powerful yet simple to use template system, featuring include files and template inheritance which both help avoid errors and minimize the amount of HTML to be written
  • in combination with the above, possibility of using stylesheets in HTML templates, which helps improve output pages appearance across site and facilitates
  • messaging service (messages can be posted on Web pages for any or all of the users)
  • provisions for using not one but a variety of caching options (including memcached server), with configurable granularity (from complete site down to a portion of a page) for enhancing performance while performing repeated database queries or serving other content repeatedly
  • auto-generation of Python classes from database schemas facilitates migration of legacy web applications to Django
  • customizable file storage classes provide additional convenience while hosting data

As mentioned above, efficient solutions exist for high-speed serving of data to the worker nodes, which do not (and should not) involve Django, and rely on an optimized server like TUX instead.

The Prototype

A functional Django-based prototype of the Data Host has been created, using a very small amount of Python code and template inheritance and inclusion. A few screenshots below:

  • Datahost data management panel:
    DH_list_3.png


  • DataHost Django admin screenshot (user management):
    DH_admin_1.PNG

There has been no need to create any custom code for user auth/auth, as this functionality is entirely handled by Django. The admin section relies on its database for auth functionality, and the user passwords stored there are encrypted Unix-style. Same applies to the file upload pop-up panel and all of the admin section.

Command-line clients

Command line with password authentication (no SSL)

In addition to Web-based access mode, we need to provide a suite of command-line clients for scripting by the user. Such utilities has been created (see the code in SVN) as described below:

  • datahost-login host user password
  • datahost-logout host user
  • datahost-ls host user
  • datahost-rm host user filename
  • datahost-put host user filename
  • datahost-get host user filename

The functionality of these commands is self explanatory: after a login, the user can list the files (ls), remove a file by name (rm), put a file on the server (put), and get it back (get). Since these reply on curl for communication with the host, note that in certain conditions (like operating on intranet) it's necessary to configure curl to avoid using the http proxy (in case that proxy is too restrictive in routing requests). This is achieved by setting the environment variable DATAHOST_NOPROXY to '*'. Also, for convenience, the user can set the environment variables DATAHOST_ADDRESS, DATAHOST_USER and DATAHOST_PASSWORD, in which case corresponding arguments should be dropped from the command line examples as given above. The DATAHOST_ADDRESS should include the port number in addition to the IP address, e.g. osgdev.racf.bnl.gov:1234.

HTTPS (X.509 and SSL)

Note that the above set of commands reply on session support provided by Django, which makes use of cookies. Using of password authentication by agents such as payload jobs is not a secure option, therefore must be avoided in favor of certificate/proxy based approach. Here's how one may want to upload a data file (which may contain results of a calculation):

curl -k --cert valid_proxy_path --key valid_proxy_path -F "file2xmit=@myfile.txt" -F "username=me" \\
https://osgdev.racf.bnl.gov:20006/datahost/transmit/

To make things simpler for the user, this functionality is wrapped in command line clients

  • datahost-ssl-put filename
  • datahost-ssl-get filename

The following environment variables should be set:

  • DATAHOST_ADDRESS
  • DATAHOST_USER
  • DATAHOST_CERT (path to X.509 proxy)
  • DATAHOST_KEY (path to X.509 proxy)

Alternatively, one can supply arguments:

  • datahost-ssl-put host user cert key filename
  • datahost-ssl-get host user cert key filename

For datahost-ssl-put, the host address should be https://osgdev.racf.bnl.gov:20006/datahost/transmit/, and for datahost-ssl-get, https://osgdev.racf.bnl.gov:20006/datahost/receive/

Download the command-line clients

The clients are available from SVN repository of USATLAS

Deployment considerations

Auth/auth

Ultimately, the usefulness of the Data Host depends on the efficiency and security of data transfer between it and Grid jobs -- in particular, pilot jobs. Since including user IDs and passwords with the pilot submission is a bad idea, we are using SSL for data transport between the server and the pilot job. However, we don't necessarily want to force users (especially early adopters) to start using Grid Certificates while uploading their data to the server and getting back the results. We therefore employ the usual auth/auth mechanism with UserID and the password for end-user communication with the server, while using SSL for pilot job interaction. This is achieved by appropriate configuration of the Apacher server, such as having a SSL-enabled Virtual Host listening on one port, and a non-SSL instance on another. Since we prefer to use a Django handler for uploads from pilots, we must have two Django instances running on one Apache server. A general recipe for achieving this given here:

<VirtualHost port1>
    ServerName www.example.com
    # ...
    SSL settings here...
    <Location "/something">
        SetEnv DJANGO_SETTINGS_MODULE mysite.settings
        PythonInterpreter mysite
    </Location>
</VirtualHost>

<VirtualHost port2>
    <Location "/otherthing">
        SetEnv DJANGO_SETTINGS_MODULE mysite.other_settings
        PythonInterpreter othersite
    </Location>
</VirtualHost>

Since the pre-packaged Django development server is not SSL-capable, and installing a secure tunnel is additional work, the SSL-enabled elements of the service are best tested with Apache itself.

Security Policy

We are in compliance with JSPG Portal Policy), in the parts related to both Data Processing and Job Management portals, for the following reasons:

  • The Strong Authentication (as defined in the above document) is built into the server by virtue of Grid User certificates being used
  • The admin module of the service complies with requirements related to maintenance of user registration information

Serving data

Serving data using Django is not recommended due to unnecessary drain on resources. Instead, we rely on Apache to do that by defining the directories accessible via Apache itself, bypassing Django for that area. That can be done by specifying a separate Location in the Apache configuration, which covers a range of URLs served by Django, while the rest will be handled directly by Apache. Example:

<Location "/media">
    SetHandler None
</Location>

Another example providing explicit streaming of media files:

<Location "/">
    SetHandler python-program
    PythonHandler django.core.handlers.modpython
    SetEnv DJANGO_SETTINGS_MODULE mysite.settings
</Location>

<Location "/media">
    SetHandler None
</Location>

<LocationMatch "\.(jpg|gif|png)$">
    SetHandler None
</LocationMatch>

Testing

A series of test jobs was run at NERSC, using real data from the experiment. In the process, each payload job downloaded a few hundred MB of data from the Datahost, performed necessary calculations and uploaded the resulting files back to a user's account on Datahost.

Technical Notes

Testing the throughput

In testing the throughput with large files, it is often useful to generate files of a given length, which is typically done with the command like:

dd if=/dev/urandom of=yourfilename bs=1M count=100
see man pages for dd .

Database issues

The admin module needs a few database tables, which are created automatically by running the command

python manage.py syncdb

This is related not just to admin module per se, but to any Models defined in Django by the developer.

Temporary File storage

Django by default places temporary file in /tmp, during upload (configurable). Regardless of location, there is a likely bug in Django whereby if there is more than one upload handler, the temporary files are not purged. Currently, the workaround is to leave only one handler (the one which is using temporary files explicitly), and give up on automatic spill-over.

To-Do List

  • Error messages in client utilities
  • Fixed location of the directory used for storage of cookies (so that the session is maintained regardless of the directory from where scripts are being launched)

Major updates:
-- TWikiAdminGroup - 18 Jul 2018

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png DH_login_1.PNG (32.3K) | MaximPotekhin, 05 May 2009 - 14:10 | DataHost login screenshot
png DH_admin_1.PNG (52.9K) | MaximPotekhin, 05 May 2009 - 14:29 | DH admin screenshot
png DH_diagram_1.PNG (60.1K) | MaximPotekhin, 05 May 2009 - 15:10 | DataHost Diagram
png DH_list_2.png (39.6K) | MaximPotekhin, 26 May 2009 - 12:40 | Datahost data management panel
png DH_jobs_1.PNG (56.7K) | MaximPotekhin, 23 Jun 2009 - 17:55 | DH jobs screen
png DH_list_3.png (21.0K) | MaximPotekhin, 26 Jun 2009 - 11:01 | DH data management panel
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback