r11 - 15 Jul 2011 - 14:41:06 - XinZhaoYou are here: TWiki >  Admins Web > LocalSiteMover

Local Site Mover

Introduction

The Pilot is moving files between the /scratch directory and the Storage Element. This page describes how the Pilot can delegate that to utilities provided by the Site itself.

Specification

Required commands are: put, get, df Optional (desired) additional commands are: size, checksum, mkdir, chmod Command name will start with lsm- (local site mover)

Open issues

This section is temporarily here but should disappear once the specification is complete

List:

  • which are the return values of lsm-put? Other SiteMovers have in put_data: return 0, pilotErrorDiag, dst_gpfn, fsize, fchecksum, self.arch_type
  • should lsm-put/get be idempotent? I.e. if the destination file is already there and has same size and checksum should the copy be successful?

Common

Specification common to all the above
  • commands are invoked as command lines (in the shell)
  • commands have to be in the path (at least once OSG_GRID has been setup)
  • commands are executed in a subshell
  • input parameter are provided as specified in the command syntax (options can be missing, the remaining are in the order expressed in the syntax specification)
  • output parameter are returned on stdout (except exit code, that is the exit code of the subprocess), one per line (\n is the separator), in the order defined in the specification below (again, skip exit_code)
  • error codes are not returned on stdout (exit code of the process must reflect the result of the copy: 0=successful, N->some error)
  • in the parameter list below NY means Not Yet, the parameter can be returned but it is not used or required yet
  • timeout is managed in the Pilot invoking the command
  • default permissions for directories are 0x0775 (managed production and dq2 user will share directories)
  • default permissions for files are 0x0664 (You want 664 to allow users to read their Pathena output and log files)
  • possibility to set directory permission [--perm_dir 0x755]
  • possibility to set file permission [--perm_file 0x644]

Some parameters:

  • checksum: checksum_type:value

Exit/Error codes

Exit codes must be compatible with the one specified in: http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Production/panda/pilot2/dmu/ErrorCodes.py?view=markup

Not to cause problems I'm choosing a non overlapping interval:

There is a pretty broad selection of messages. The pilot is not guaranteed to take different actions depending on the error but it is guaranteed to report upstream (in the log file) the exit code returned. So lsm-* are encouraged to be specific in the error messages (and avoid the generic 200/220). This will help debugging and troubleshooting. Anyway for sake of simplicity they could return only 0 (OK) or 200 (Fail). This anyway will be diagnostic that they will receive to resolve problems.

Exit Codes:

  • 200 - GENERIC failure
  • 201 - Copy command failed
  • 202 - Unsupported command
  • 203 - Unsupported option (e.g. Space token or checksum type)
  • 204 - Size comparison failed
  • 205 - Checksum comparison failed
  • 206 - Unable to write to destination (destination does not exist)
  • 207 - Unable to write to destination (permission problem)
  • 208 - Overload (copy failed for overload, retry later)
  • 209 - Size provided different from source file (this implies also 204)
  • 210 - Checksum provided different from source file (this implies also 205)
  • 211 - File already exist and is different (size/checksum).
  • 212 - File already exist and is the same as the source (same size/checksum)
  • 213 - Destination full (no space to write the output file)
  • 220 - GENERIC transient failure (a suggestion for the pilot to retry later)

Exit code explanation (clarification on some codes):

  • 200/220 - These are both generic error messages (something went wrong). 220 adds also a suggestion for the pilot to retry later while 200 is more a fatal failure. This does not imply that the pilot will obey to the suggested behavior.
  • 206/207 - If the command is unable to distinguish the cause it should return 206
  • 208 - The system provides an overload code. The command may retry itself and will be interrupted by the timeout of the invoking program
  • 209/210 - If the size/checksum verification for the destination file fails, if the size/checksum was provided as command line parameter it would be good if lsm-* compares the value provided with the one evaluated in the source file and escalate the error if this check fails as well. 204 would become 209, 205 would become 210.
  • 211 - This should not be confused with 204/205/209/210. Else we risk to mask cases where jobs overwrite the output of other jobs.
  • 212 - This may be ignored by the pilot (e.g. to allow job recovery)

LSM functions

lsm-put

Copy a file from /scratch to the storage element. The copy must be reliable: if the copy is successful file size and checksum (MD5 or ) of source and destination file must be the same. If file size and/or checksum are provided, the copy is successful only if file size and/or checksum of the destination is/are the same as the provided one/s (lsm-put can use the value provided as parameter if that is present). Copy may fail. There is no need to do multiple retry (but lsm-put is welcome to do multiple attempts). lsm-put must create all the directories in the path (if not already there). If the copy fails, lsm-put must remove files partially copied or with wrong attributes. It may leave the directories created for the copy. Syntax:
lsm-put [-t token] [--size N] [--checksum csum] source destination
In:
  • source: POSIX path of the source file. Relative path are relative to $PWD
  • destination: full URL (SURL/TURL) of the destination in the SE
    • If it exist and it is a file the operation should end with error
    • If it exits and it is a directory, it is the destination directory
    • If it does not exist, it is a directory name if it ends in '/', else it is a file name
  • token: space token to use
  • N: file size in bytes
  • csum: file checksum (string). The string may have a prefix specifying the checksum type followed by the value, e.g. "adler32:NNN", "md5:NNN". If there is no prefix it is assumed to be MD5
Out:
  • exit code: result of the transfer
  • url: full URL to retrieve the file
  • NY size: size in the SE
  • NY checksum: checksum in the SE

Currently no archival status support is request. If the SE supports archival status, files copied using lsm-put should be transient.

lsm-get

Copy a file from the storage element to destination directory (run directory of a job, POSIX file system). Copy must be reliable: file size and checksum (MD5 or ) of source and destination file must be the same if the copy is successful. If file size and/or checksum are provided, the copy is successful only if file size and/or checksum of the destination is/are the same as the provided one/s (lsm-get can use the value provided as parameter if that is present). Copy may fail. There is no need to do multiple retry (but lsm-get is welcome to do multiple attempts). Partially copied or broken files may be left in the destination directory (pilot can handle local files), but it is OK if lsm-get removes them. Syntax:
lsm-get [-t token] [--size N] [--checksum csum] [--guid guid] source destination
In:
  • source: full URL (SURL/TURL) of the source file in the SE (e.g. method://[host[:port]/full-dir-path/filename)
  • destination: POSIX path of the destination. Relative path are relative to $PWD.
    • If it exist and it is a file the operation should end with error
    • If it exits and it is a directory, it is the destination directory
    • If it does not exist, it is a directory name if it ends in '/', else it is a file name
  • token: space token to use (is space token useful to get files?)
  • N: file size in bytes
  • csum: file checksum (string). The string may have a prefix specifying the checksum type followed by the value, e.g. "adler32:NNN", "md5:NNN". If there is no prefix it is assumed to be MD5
  • guid : guid of the file
Out:
  • exit code: result of the transfer
  • NY size: file size in the SE
  • NY checksum: file checksum from the SE

lsm-rm

Remove a file or directory from the SE (and releases the space reservation if any) Syntax:
lsm-rm [-t token] url
In:
  • fully qualified URL (TURL/SURL) of the file/directory to remove
  • token - Do we need the token to remove a file? And for release the space?
Out:
  • exit code - exit code of the deletion

Do we need a delete command or an overwrite option in the put command would be better [--overwrite] ?

lsm-df

Report the space availability in the specific storage Syntax:
lsm-df [-t token] [storage_endpoint]
In:
  • token - space token of the area to check for space
  • storage_endpoint - path (If no path or token are given, the default space in the SE is given)
Out:
  • exit code - exit code of the df operation
  • available space in MegaBytes (1024^2 bytes)

Configuration of queues

copytool = lsm
copytoolin = lsm
envsetup = source the/path/to/your/setup.sh
envsetupin = source the/path/to/your/setup.sh

Example:

copytool = lsm
copytoolin = lsm
envsetup = source /afs/usatlas.bnl.gov/i386_redhat72/opt/lsm/lsm/setup.sh
envsetupin = source /afs/usatlas.bnl.gov/i386_redhat72/opt/lsm/lsm/setup.sh

Test Local Site Mover against python 2.6

The new DQ2 Client 0.1.36 or above requires python version 2.5 or above. So at some point, sites will need to test and make sure local site mover codes work with newer python versions.

At BNL, we did a test on a worker node, using the following procedure:

  • install python 2.6 rpm from EPEL repository, it will install in parallel to the existing python 2.4 in the system.
  • log onto the node as a user
  • make sure python26 is the default python in your environment, e.g. symlink ~/bin/python to /usr/bin/python26
  • make sure a voms proxy is present on the local node
  • run some local site mover commands, make sure they still work with python 2.6. For example, at BNL we did :
    • source /afs/usatlas.bnl.gov/i386_redhat72/opt/lsm/setup.test.sh
    • export X509_USER_PROXY=/tmp/x509up_uid
    • lsm-df
    • ... some other lsm commands

Once python 2.6 is installed on all sites, the new dq2 client install will have a symlink of "python" points to the /usr/bin/python26 in the dq2 install area, i.e. $OSG_APP/atlas_app/atlaswn/DQ2Clients/opt/dq2/bin/python -> /usr/bin/python26, so that python 2.6 will be used once the dq2 client setup file is sourced.

-- MarcoMambelli - 05 Nov 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback