r1 - 03 Apr 2012 - 20:35:50 - WeiYangYou are here: TWiki >  Admins Web > MinutesFedXrootdApr4

MinutesFedXrootdApr4

Coordinates

  • Attending:
  • Apologies:

Face-to-face meeting

April 11-12, 2012, Gleacher Center, University of Chicago (downtown). Recommended hotel: http://cwp.marriott.com/chifd/uchicago/ registration: http://mwt2.wufoo.com/forms/federated-xrootd-meeting-in-chicago/

Tentative Topics:

Enabling GSI security everywhere
	1) talk slides leading with instructions
	2) sites do it in real time

Enabling the monitoring
	1) talk slides leading with instructions
	2) sites do this in real time
	3) we need visibility as to activity and performance.  I presume we continue
	both with Matevz and Artem's - what specific requirements can we give
	them in advance?

Redirectors
	1) talk slides with instructions for regional redirection
	2) selected regional redirector (proposal MW regional)   Northeast? (BNL + BU + ...)
	3) demonstrate functional redirection modes, fall back to global

Demonstrate production-scale analysis
	1) code & instructions available in advance
	2) focus on WAN direct access
	3) demonstrate performance at scale exceeding a threshold from 
	a Tier 3 to a region
		~ 300 jobs
		>= 50% of local efficiency
	

Regarding including capabilities from Panda I can see two very useful
advances that shouldn't be too hard:

	1) A modified local site mover that falls back to the federation
	(Hiro and Shawn have both discussed variants of this. What
	specifically do we need to move it forward)

	2) Submitting a prun job as dataset rather than pfnlist with
	an argument of redirector. 

FAX Status Dashboard

Background

Meeting business

  • Twiki documentation locations
    • Some have difficulty to access certain CERN twiki, unknown why. Suggest to put at BNL twiki with link at CERN twiki to BNL (http, not https).
    • not done yet
    • RG will follow-up

Panda Movers

  • talking to Paul and Alden about setting up in SchedConfig? and Panda site movers to fill missing files.
  • Local site Movers
  • Panda needs to be able to schedule jobs to site that to doesn't have input data locally.

Xrootd release 3.1.0 deployment

Summary of previous meetings:
  • Xrootd releases come out with some functional validation by stakeholders and large sites. But lack a formal release validation process.
  • CMS abandoned Dcap plug-in for Xrootd OFS. They use dCache xrootd door directly or Xrootd overlap dCache.
  • Known issues:
    • RPM updates overwrite /etc/init.d/{xrootd,cmsd} which have LFC environment setup. Those setup should go to /etc/system/xrootd which survives rpm updates. Patrick will test it.
  • SWT2: N2N? crashing issue is understood (conflict of signal usage in regular xrootd and Globus). Solution is either use a proxy, or regular xrootd with "async off".
  • Xrootd 3.1.1 is ready for deployment at all sites.
this meeting:

ANALY queue

this meeting:

X509

  • Andy: code is ready in 3.1.0. Wei (and Doug?) will test it?
  • X509 module that checks VO attributes works in 3.1.1. Doug: this is probably good enough (and allow the cloud setup to avoid a grid infrastructure). Enhancement: VO info is not validated using VO public key.
this meeting
  • RSV dashboard is capable of using X509
  • What are the requirements for clients? X509 proxy with valid ATLAS voms attribute; X509 infrastructure (/etc/grid-security/certificates, etc.)

N2N

Summary of last week(s)
  • See further https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Decided to continue improving the current N2N and leave GUID as a future option. Chicago can keep the source of N2N in CVS for now - Send update to Rob. Wei can compile
  • Doug's use-case - look up files that existed at BNL but N2N can't find it. Hiro: need to change the code slightly - will do. Probably only happens at BNL. Had to do with the way panda outputs to BNL.
  • Complains about possible memory leak in N2N? . Provided to Andy a standalone package for debugging.
  • Hiro: update N2N? for BNL special cases. Doug will test if this can improve hitting rate to near 100%.
  • N2N? crashing issue has a solution. See MinutesFedXrootdJan11#ANALY_queue
  • Fermi-Gamma experiment at SLAC also see Proxy memory footprint grows. Will release memory when there is a period of no activities, will crash if otherwise. Wei will get more info.
  • Debugged N2N? , found no memory leak
  • Changed SLAC's configuration from xrootd native proxy (cluster) to a cluster of regular xrootd with N2N on top of xrootdfs. This allow better observation of where memory grows.
  • Memory still grows in regulate xrootd + N2N. Was this due to caching in N2N?
this meeting:
  • Update to N2N is available: support query using GUID. Maybe useful for LSM.

cmdsd+dcache/xrootd door

last meeting:

  • Sarah: dCache xrootd door's performance similar to dCap door.
  • A "authorization" plugin for the dCache/xrootd door which uses the cached GFN->LFN information to correctly respond to GFN requests (Hiro/Shawn/?). Hiro, will work on a Java API for LFC first.
this meeting:

Sharing Configurations

last meeting:

The command should create "xrootd" directory in your current directory. You need git client 1.7.2.2 or higher.

this meeting:

Detailed monitoring from UCSD /CMS

  • Discussions with Matevz Tadel (USCMS, UCSD) at Lyon
  • Considering deploying an instance at UC; if so would ask sites to publish information to it.
  • Matevz setup a system at SLAC. Currently is customized to provide real time info: open files, src domain, dst domain, bytes read. More customization can be done. Need to decide what to do with file-close info (a rich source of info that is currently dumped to a flat file). CMS feeds file-close into to Gratia.
  • See http://atl-prod05.slac.stanford.edu:4242 for real time info
  • Can all sites add the following line to border data servers (or proxy data servers) configuration file (/etc/xrootd/xrootd-clustered.cfg)
xrootd.monitor all auth flush io 30s mbuff 1472 window 5s dest files io info user atl-prod05.slac.stanford.edu:9930
this meeting
  • Completed: atlas-swt2.org, hep.uiuc.edu, nhn.ou.edu, ochep.ou.edu, slac.stanford.edu, uchicago.edu

Ganglia monitoring information

last meeting:

  • Note from Artem: Hello Robert, We've managed to do some progress since our previous talk. We build rpms, here is link to repo: http://t3mon-build.cern.ch/t3mon/, we have rebuilded versions of gangla, gweb in it. Ganglia people've issued ganglia 3.2 and new ganglia web (gweb), all our stuff was rechecked and works with this new software. It's better to install ganglia from our repo, instructions are here: https://svnweb.cern.ch/trac/t3mon/wiki/T3MONHome. About xrootd: we have created daemonized version of xrootd summary to ganglia script. It's available at the moment at https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGanglia, it sends xrootd summary metrics (http://xrootd.slac.stanford.edu/doc/prod/xrd_monitoring.htm#_Toc235610398) to ganglia web interface. Also we have application which works with xrootd summary stream but at the moment we're not sure how it's better to present fetched data. We collect there user activity and accessed files, all within the site. Last week we installed one more xrd development cluster and we're going to test if it possible to get and then split information about file transfers between sites/within one site. WBR Artem
  • Deployed at BNL, works.
  • Anyone tried this out in the past week? Would be good to try this out before software week to provide feedback.

this meeting:

Performance Studying

  • Network latency related performance turning. There is a US ATLAS working group looking at ATLAS code for possible improvement. Doug is in the group.
  • Analysis IO performance Developer Summary Meeting Dec/15: https://indico.cern.ch/conferenceDisplay.py?confId=166930
  • Should send a request to root IO group asking for a self-contain example to test at FAX, should find out what matrix FAX group want to see from ROOT IO group.
this meeting:

dq2-ls-global and dq2-list-files-global

last meeting:
  • want dq2 client tools that can list files in a dataset in GFN (or local redirector); and check against their existence in FAX or local site.
  • Hiro's poor man's version can be found at http://www.usatlas.bnl.gov/~hiroito/xrootd/dq2/; work with containers.
  • RWG - I am using this for expanding tests across datasets - works great.
  • Hiro will find out who is in charge of the dq2-client
  • Will be available in the next dq2-client release.
  • Available in the latest dq2-client release
this meeting:

D3PD example

last meeting:
  • Get Shuwei's top DP3D? example into HC (Doug?)
  • Doug will follow-up in two weeks to see about getting this into HC, and the workbook updated. Need to drive this with real examples, with updated D3PDs? . So examples need to be updated for Rel 17.
  • Doug: Goal is to get this into HC test, with sites being able to replace input datasets. will be used by sites to compare performance of reading from local and remote storage. will follow up.
  • Non - HC example can be seen here - https://twiki.cern.ch/twiki/bin/view/AtlasProtected/SMWZd3pdExample
The data sets are:

[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Egamma.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716

user.bdouglas.physics_Egamma.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716 INCOMPLETE: COMPLETE: BNL-OSG2_LOCALGROUPDISK ANL_LOCALGROUPDISK MWT2_UC_LOCALGROUPDISK SLACXRD_LOCALGROUPDISK NET2_LOCALGROUPDISK SWT2_CPB_LOCALGROUPDISK ANL-ATLAS-GRIDFTP1

[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Muons.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716

user.bdouglas.physics_Muons.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716 INCOMPLETE: ANL-ATLAS-GRIDFTP1 COMPLETE: BNL-OSG2_LOCALGROUPDISK MWT2_UC_LOCALGROUPDISK SLACXRD_LOCALGROUPDISK NET2_LOCALGROUPDISK SWT2_CPB_LOCALGROUPDISK

  • question for next meeting: How will site request to run this type of HC test? How can site change inputs? How to obtain performance matrix such as total time, etc.
  • HC D3PD? examples will be used as a standard performance benchmark
this meeting:

Checksumming

last meeting:
  • Wei: with 3.1, checksum is working for Xrootd proxy even when N2N is in use. Tested at SLAC at both T2 and T3. Should be straightforward for Posix sites.
  • Not sure about dCache sites. Probably need a plugin for dCache. Callout to figure the checksum from a dCache system. Andy and Hiro will go through this at CERN
  • Wei: Direct reading, dq2-get (-whatever) don't need checksum from remote sites.
  • On-hold
  • rename this item to discuss general issues with checksumming instead of integrated checksumming.
  • Checksumming for native xrootd is basically solved
  • For posix - can adapt
  • For dCache - is there a plugin for checksum? Its there, need to grap.
  • Querying the remote site for checksumming
  • Wrapper script is needed
this meeting:

FRM script standardization

last meetings:
  • Standardize FRM scripts, including authorization, GUID passing, checksum validation and retries.
  • A few flavors possible.
  • Setup a twiki page just for this.

  • Brings up the question again about checking completion of xprep commands. Failures do leave a .failed file. Are there tools to check the frm queues. Can we provide a tool for this?
  • Andy: suggests setting up a webpage to monitor the frm queues. frm_admin command. Hiro wil be looking into this.*
  • a prototype of doing this:
for i in all_your_data_servers; do
    ssh your_dataserver_$i and do the following:
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:xrootd_lib_path
    export PATH=$PATH$xrootd_bin_path
    frm_admin -c your_xrootd_config_file -n your_xrootd_instance_name query xfrq stage lfn qwt 
done | sort -k2 -n -r
this meeting:

-- WeiYang - 04 Apr 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback