r3 - 29 Feb 2012 - 03:29:19 - WeiYangYou are here: TWiki >  Admins Web > MinutesFedXrootdFeb22



  • Attending: Rob, Shawn, Wei, Hiro, Ofer, Sarah, Thomas, David, Doug, Andy, Horst
  • Apologies:

Face-to-face meeting:

April 11-12, 2012, Gleacher Center, University of Chicago (downtown). Recommended hotel: http://cwp.marriott.com/chifd/uchicago/

FAX Status Dashboard


Meeting business

  • Twiki documentation locations
    • Some have difficulty to access certain CERN twiki, unknown why. Suggest to put at BNL twiki with link at CERN twiki to BNL (http, not https).
    • not done yet
    • RG will follow-up

Xrootd release 3.1.0 deployment

Summary of previous meetings:
  • Xrootd releases come out with some functional validation by stakeholders and large sites. But lack a formal release validation process.
  • 3.1.0 is the first "mature" release so suggested to site to deploy for the proxy function. WT2 has migrated Solaris storage (regular xrootd) to 3.1.0 for a month. Will migrate linux storage to 3.1.0 soon. WT2 has also run 3.1.0 single proxy for a month. N2N? works under 3.1.0.
  • CMS abandoned Dcap plug-in for Xrootd OFS. They use dCache xrootd door directly or Xrootd overlap dCache.
  • Known issues:
    • RPM updates overwrite /etc/init.d/{xrootd,cmsd} which have LFC environment setup. Those setup should go to /etc/system/xrootd which survives rpm updates. Patrick will test it.
    • Bug in 3.1.0 prevent setting up a proxy cluster. Fixed in git head.
    • Continue xrdcp debugging between Doug and Andy.
    • "sss" module bug cause permission deny. Probably with replacing existing identity mapping. Workaround (in xrootdfs) is to not update existing identity.
last meeting:
  • Illinois - which is over a dCache server reporting
  • UTA - is crashing under certain conditions (trying to generate a core file). Patrick will send configuration info to Andy. Seems to be related to a specific mode of operation in xrootd.

  • BNL wait-n-see for proxy cluster: Andy: a patch release of 3.1.0 is coming, which should fix the proxy cluster problem
  • UCTier 3: adding more storage, will move to 3.1.0 after that
  • SWT2: N2N? crashing issue is understood (conflict of signal usage in regular xrootd and Globus). Solution is either use a proxy, or regular xrootd with "async off".
  • Wei contacted Lukasz about a bug fixing release for 3.1.0. Lukasz busy on getting first draft of new Xroot client out in January.
this meeting:
  • Defer deployment since a new 3.1.1 is coming
  • 3.1.1rc1 is in testing

dq2-ls-global and dq2-list-files-global

last meeting:
  • want dq2 client tools that can list files in a dataset in GFN (or local redirector); and check against their existence in FAX or local site.
  • Hiro's poor man's version can be found at http://www.usatlas.bnl.gov/~hiroito/xrootd/dq2/; work with containers.
  • RWG - I am using this for expanding tests across datasets - works great.
  • Hiro will find out who is in charge of the dq2-client
  • Will be available in the next dq2-client release.
  • Available in the latest dq2-client release
this meeting:

ANALY queue

last meeting:
  • Subscribed to SLAC, and UC. will do SWT2_CPB. Subscription to SLAC is jammed by 2000+ datasets from other users.

this meeting:

  • Working on TTreeCache settings for jobs sent to SWT2, SLAC and UC
  • Working with a few datasets (1.1M events, 120 files)
  • Results:
    job at UC, data at xrd.mwt2.org (local):
    real	10m35.387s
    user	4m39.398s
    sys	0m8.639s
    job at UC, data at atl-prod09.slac.stanford.edu RTT=52.1 ms:
    real	130m52.543s
    user	4m23.163s
    sys	0m8.343s
  • Trying new settings suggested by Jack Cranshaw.

Performance Studying

  • Network latency related performance turning. There is a US ATLAS working group looking at ATLAS code for possible improvement. Doug is in the group.
  • Analysis IO performance Developer Summary Meeting Dec/15: https://indico.cern.ch/conferenceDisplay.py?confId=166930
  • Should send a request to root IO group asking for a self-contain example to test at FAX, should find out what matrix FAX group want to see from ROOT IO group.
this meeting:


  • Andy: code is ready in 3.1.0. Wei (and Doug?) will test it?

D3PD example

last meeting:
  • Get Shuwei's top DP3D? example into HC (Doug?)
  • Doug will follow-up in two weeks to see about getting this into HC, and the workbook updated. Need to drive this with real examples, with updated D3PDs? . So examples need to be updated for Rel 17.
  • Doug: Goal is to get this into HC test, with sites being able to replace input datasets. will be used by sites to compare performance of reading from local and remote storage. will follow up.
  • Non - HC example can be seen here - https://twiki.cern.ch/twiki/bin/view/AtlasProtected/SMWZd3pdExample
The data sets are:

[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Egamma.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716


[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Muons.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716


this meeting:

  • question for next meeting: How will site request to run this type of HC test? How can site change inputs? How to obtain performance matrix such as total time, etc.
  • HC D3PD? examples will be used as a standard performance benchmark


Summary of last week(s)
  • See further https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Decided to continue improving the current N2N and leave GUID as a future option. Chicago can keep the source of N2N in CVS for now - Send update to Rob. Wei can compile
  • Doug's use-case - look up files that existed at BNL but N2N can't find it. Hiro: need to change the code slightly - will do. Probably only happens at BNL. Had to do with the way panda outputs to BNL.
  • Complains about possible memory leak in N2N? . Provided to Andy a standalone package for debugging.
  • Hiro: update N2N? for BNL special cases. Doug will test if this can improve hitting rate to near 100%.
  • N2N? crashing issue has a solution. See MinutesFedXrootdJan11#ANALY_queue
  • Fermi-Gamma experiment at SLAC also see Proxy memory footprint grows. Will release memory when there is a period of no activities, will crash if otherwise. Wei will get more info.
this meeting:
  • Debugged N2N? , found no memory leak
  • Changed SLAC's configuration from xrootd native proxy (cluster) to a cluster of regular xrootd with N2N on top of xrootdfs. This allow better observation of where memory grows.


last meeting:
  • Wei: with 3.1, checksum is working for Xrootd proxy even when N2N is in use. Tested at SLAC at both T2 and T3. Should be straightforward for Posix sites.
  • Not sure about dCache sites. Probably need a plugin for dCache. Callout to figure the checksum from a dCache system. Andy and Hiro will go through this at CERN
  • Wei: Direct reading, dq2-get (-whatever) don't need checksum from remote sites.
  • On-hold
  • rename this item to discuss general issues with checksumming instead of integrated checksumming.
  • Checksumming for native xrootd is basically solved
  • For posix - can adapt
  • For dCache - is there a plugin for checksum? Its there, need to grap.
  • Querying the remote site for checksumming
  • Wrapper script is needed
  • MWT2 and AGLT2 will evaluate dCache's Xrootd door. Will look at checksum solution after it is found useful.
  • Sarah: dCache xrootd door's performance similar to dCap door.
  • site diagrams: FAX-site-diagram.pptx
this meeting:

FRM script standardization

last meetings:
  • Standardize FRM scripts, including authorization, GUID passing, checksum validation and retries.
  • A few flavors possible.
  • Setup a twiki page just for this.

  • Brings up the question again about checking completion of xprep commands. Failures do leave a .failed file. Are there tools to check the frm queues. Can we provide a tool for this?
  • Andy: suggests setting up a webpage to monitor the frm queues. frm_admin command. Hiro wil be looking into this.*
  • a prototype of doing this:
for i in all_your_data_servers; do
    ssh your_dataserver_$i and do the following:
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:xrootd_lib_path
    export PATH=$PATH$xrootd_bin_path
    frm_admin -c your_xrootd_config_file -n your_xrootd_instance_name query xfrq stage lfn qwt 
done | sort -k2 -n -r
this meeting:

cmdsd+dcache/xrootd door

last meeting:
  • An updated CMSD that will work with the native dCache/xrootd door (Andy?)
  • A caching mechanism to allow the lookup done by the CMSD N2N[2?] plugin to be useable by the xrootd door (either dCache or Xrootd version) (Andy/Hiro/Wei/?)
  • Redirect to the xrootd-dcache door; will do the lookup and do memcached. cmsd will need the N2N plugin. N2N? must write to something the dCache sites can read.
  • Hiro will look into this; not critical path.
  • On-hold.
  • See Paul Milar's talk at Lyon: http://indico.in2p3.fr/contributionDisplay.py?contribId=12&confId=5527
  • See MinutesFedXrootdJan11#Checksumming
this meeting:

Authorization plugin (Hiro)

last meeting:

  • A "authorization" plugin for the dCache/xrootd door which uses the cached GFN->LFN information to correctly respond to GFN requests (Hiro/Shawn/?)
  • On-hold.
  • see MinutesFedXrootdJan11#Checksumming
this meeting:

Sharing Configurations

last meeting: this meeting:
  • Hiro, will work on a Java API for LFC first.


last meeting: this meeting:

Ganglia monitoring information

last meeting:

  • Note from Artem: Hello Robert, We've managed to do some progress since our previous talk. We build rpms, here is link to repo: http://t3mon-build.cern.ch/t3mon/, we have rebuilded versions of gangla, gweb in it. Ganglia people've issued ganglia 3.2 and new ganglia web (gweb), all our stuff was rechecked and works with this new software. It's better to install ganglia from our repo, instructions are here: https://svnweb.cern.ch/trac/t3mon/wiki/T3MONHome. About xrootd: we have created daemonized version of xrootd summary to ganglia script. It's available at the moment at https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGanglia, it sends xrootd summary metrics (http://xrootd.slac.stanford.edu/doc/prod/xrd_monitoring.htm#_Toc235610398) to ganglia web interface. Also we have application which works with xrootd summary stream but at the moment we're not sure how it's better to present fetched data. We collect there user activity and accessed files, all within the site. Last week we installed one more xrd development cluster and we're going to test if it possible to get and then split information about file transfers between sites/within one site. WBR Artem
  • Deployed at BNL, works.
  • Anyone tried this out in the past week? Would be good to try this out before software week to provide feedback.

this meeting:
  • See https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGangliaDetailed. Setup Ganglia based (db-less) detailed monitoring at SLAC. Currently only matries provided in a db-less setup is last activity data and time. Want to know other matrices can be available via Ganglia. Developer will update the doc and provide Postgre SQL DB backend's table schema.

Monalisa monitoring

  • Discussions with Matevz Tadel (USCMS, UCSD) at Lyon
  • Considering deploying an instance at UC; if so would ask sites to publish information to it.
  • Matevz will visit SLAC on Feb 1-2. will ask for a demo. Interested in detailed morning stuff
  • Matevz is working on setting up
this meeting
  • Matevz setup a system at SLAC. Currently is customized to provide real time info: open files, src domain, dst domain, bytes read. More customization can be done. Need to decide what to do with file-close info (a rich source of info that is currently dumped to a flat file). CMS feeds file-close into to Gratia.

-- WeiYang - 22 Feb 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback