r3 - 07 May 2012 - 19:36:31 - WeiYangYou are here: TWiki >  Admins Web > MinutesFedXrootdFeb8

MinutesFedXrootdFeb8

Coordinates

  • Attending: Patrick, Rob, Shawn, Wei, Hiro, Ofer, Sarah, David, Doug, Andy
  • Apologies: Horst

FAX Status Dashboard

Background

Meeting business

  • Twiki documentation locations
    • Some have difficulty to access certain CERN twiki, unknown why. Suggest to put at BNL twiki with link at CERN twiki to BNL (http, not https).
    • not done yet
    • RG will follow-up

Xrootd release 3.1.0 deployment

Summary of previous meetings:
  • Xrootd releases come out with some functional validation by stakeholders and large sites. But lack a formal release validation process.
  • 3.1.0 is the first "mature" release so suggested to site to deploy for the proxy function. WT2 has migrated Solaris storage (regular xrootd) to 3.1.0 for a month. Will migrate linux storage to 3.1.0 soon. WT2 has also run 3.1.0 single proxy for a month. N2N? works under 3.1.0.
  • CMS abandoned Dcap plug-in for Xrootd OFS. They use dCache xrootd door directly or Xrootd overlap dCache.
  • Known issues:
    • RPM updates overwrite /etc/init.d/{xrootd,cmsd} which have LFC environment setup. Those setup should go to /etc/system/xrootd which survives rpm updates. Patrick will test it.
    • Bug in 3.1.0 prevent setting up a proxy cluster. Fixed in git head.
    • Continue xrdcp debugging between Doug and Andy.
    • "sss" module bug cause permission deny. Probably with replacing existing identity mapping. Workaround (in xrootdfs) is to not update existing identity.
last meeting:
  • Illinois - which is over a dCache server reporting
  • UTA - is crashing under certain conditions (trying to generate a core file). Patrick will send configuration info to Andy. Seems to be related to a specific mode of operation in xrootd.

  • BNL wait-n-see for proxy cluster: Andy: a patch release of 3.1.0 is coming, which should fix the proxy cluster problem
  • UCTier 3: adding more storage, will move to 3.1.0 after that
  • SWT2: N2N? crashing issue is understood (conflict of signal usage in regular xrootd and Globus). Solution is either use a proxy, or regular xrootd with "async off".
  • Wei contacted Lukasz about a bug fixing release for 3.1.0. Lukasz busy on getting first draft of new Xroot client out in January.
this meeting:
  • Defer deployment since a new 3.1.1 is coming

dq2-ls-global and dq2-list-files-global

last meeting:
  • want dq2 client tools that can list files in a dataset in GFN (or local redirector); and check against their existence in FAX or local site.
  • Hiro's poor man's version can be found at http://www.usatlas.bnl.gov/~hiroito/xrootd/dq2/; work with containers.
  • RWG - I am using this for expanding tests across datasets - works great.
  • Hiro will find out who is in charge of the dq2-client
  • Will be available in the next dq2-client release.
this meeting:
  • Available in the latest dq2-client release

ANALY queue

last meeting:
  • Rob ran interactive test jobs against glrd, MWT2, Illinois, AGLT2 and BNL. First tried glrd, slow (probably was redirected to BNL). not surprisingly, BNL is slow. subsequent test against glrd is faster (probably was redirected to other sites).
  • To run in Panda queue, Dan van der Ster suggested to use prun with --pfnList to supply a list of files for jobs (list coming from dq2-list-file-global), but may still have dependency on site having the datasets, even though reading points to glrd. Doug: that may not be the case. Rob will try.
  • Noticed that Rob's tested has very high success rate, with obvious long distance effect. The input data sample is small. Large data sample test may be useful to reveal problems.
  • WT2 will deploy more proxy nodes to further reduce bottleneck at proxy. This may help isolate latency related issues.
  • (place holder) Where to write output? A small write space in the federation, or other solutions. Doug is looking at the possibility of have a small xrootd space at BNL for job output.
  • A federated space for writing is out of the scope of this working group. Should be discussed at the facility meeting.
  • Extending this work by have not made much progress since Lyon presentation (here)
  • Have tried TTreeCache tuning - will need guidance since first attempt made things worse.
  • Need to subscribe datasets to sites:
mc10_7TeV.116700.PowHegPythia_ggH110_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116701.PowHegPythia_ggH115_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116702.PowHegPythia_ggH120_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116703.PowHegPythia_ggH125_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116613.PowHegPythia_ggH130_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116704.PowHegPythia_ggH135_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116705.PowHegPythia_ggH140_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116706.PowHegPythia_ggH145_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116707.PowHegPythia_ggH150_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116708.PowHegPythia_ggH155_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116614.PowHegPythia_ggH160_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116709.PowHegPythia_ggH165_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116710.PowHegPythia_ggH170_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116711.PowHegPythia_ggH175_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116712.PowHegPythia_ggH180_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116713.PowHegPythia_ggH185_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116714.PowHegPythia_ggH190_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116715.PowHegPythia_ggH195_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.116716.PowHegPythia_ggH200_WW2lep.merge.NTUP_SMWZ.e773_s933_s946_r2302_r2300_p591/
mc10_7TeV.107680.AlpgenJimmyWenuNp0_pt20.merge.NTUP_SMWZ.e600_s933_s946_r2302_r2300_p591/
mc10_7TeV.107681.AlpgenJimmyWenuNp1_pt20.merge.NTUP_SMWZ.e600_s933_s946_r2302_r2300_p591/
mc10_7TeV.107682.AlpgenJimmyWenuNp2_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107683.AlpgenJimmyWenuNp3_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107684.AlpgenJimmyWenuNp4_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107685.AlpgenJimmyWenuNp5_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107690.AlpgenJimmyWmunuNp0_pt20.merge.NTUP_SMWZ.e600_s933_s946_r2302_r2300_p591/
mc10_7TeV.107691.AlpgenJimmyWmunuNp1_pt20.merge.NTUP_SMWZ.e600_s933_s946_r2302_r2300_p591/
mc10_7TeV.107692.AlpgenJimmyWmunuNp2_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107693.AlpgenJimmyWmunuNp3_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107694.AlpgenJimmyWmunuNp4_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107695.AlpgenJimmyWmunuNp5_pt20.merge.NTUP_SMWZ.e760_s933_s946_r2302_r2300_p591/
mc10_7TeV.107054.PythiaWtaunu_incl.merge.NTUP_SMWZ.e574_s934_s946_r2310_r2300_p591/
mc10_7TeV.107650.AlpgenJimmyZeeNp0_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107651.AlpgenJimmyZeeNp1_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107652.AlpgenJimmyZeeNp2_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107653.AlpgenJimmyZeeNp3_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107654.AlpgenJimmyZeeNp4_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107655.AlpgenJimmyZeeNp5_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107660.AlpgenJimmyZmumuNp0_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107661.AlpgenJimmyZmumuNp1_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107662.AlpgenJimmyZmumuNp2_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107663.AlpgenJimmyZmumuNp3_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107664.AlpgenJimmyZmumuNp4_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.107665.AlpgenJimmyZmumuNp5_pt20.merge.NTUP_SMWZ.e737_s933_s946_r2302_r2300_p591/
mc10_7TeV.116250.AlpgenJimmyZeeNp0_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116251.AlpgenJimmyZeeNp1_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116252.AlpgenJimmyZeeNp2_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116253.AlpgenJimmyZeeNp3_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116254.AlpgenJimmyZeeNp4_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116255.AlpgenJimmyZeeNp5_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116260.AlpgenJimmyZmumuNp0_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/
mc10_7TeV.116261.AlpgenJimmyZmumuNp1_Mll10to40_pt20.merge.NTUP_SMWZ.e660_s933_s946_r2302_r2300_p591/

  • Rob: will scribe the above datasets to selected sites.
this meeting:
  • Subscribed to SLAC, and UC. will do SWT2_CPB. Subscription to SLAC is jammed by 2000+ datasets from other users.

Performance Studying

  • Network latency related performance turning. There is a US ATLAS working group looking at ATLAS code for possible improvement. Doug is in the group.
  • Analysis IO performance Developer Summary Meeting Dec/15: https://indico.cern.ch/conferenceDisplay.py?confId=166930
  • Should send a request to root IO group asking for a self-contain example to test at FAX, should find out what matrix FAX group want to see from ROOT IO group.
this meeting:

X509:

  • Andy: code is ready in 3.1.0. Wei (and Doug?) will test it?

D3PD example

last meeting:
  • Get Shuwei's top DP3D? example into HC (Doug?)
  • Doug will follow-up in two weeks to see about getting this into HC, and the workbook updated. Need to drive this with real examples, with updated D3PDs? . So examples need to be updated for Rel 17.
  • Doug: Goal is to get this into HC test, with sites being able to replace input datasets. will be used by sites to compare performance of reading from local and remote storage. will follow up.
  • Non - HC example can be seen here - https://twiki.cern.ch/twiki/bin/view/AtlasProtected/SMWZd3pdExample
The data sets are:

[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Egamma.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716

user.bdouglas.physics_Egamma.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716 INCOMPLETE: COMPLETE: BNL-OSG2_LOCALGROUPDISK ANL_LOCALGROUPDISK MWT2_UC_LOCALGROUPDISK SLACXRD_LOCALGROUPDISK NET2_LOCALGROUPDISK SWT2_CPB_LOCALGROUPDISK ANL-ATLAS-GRIDFTP1

[dbenjamin@atlas28 ~]$ dq2-ls -r user.bdouglas.physics_Muons.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716

user.bdouglas.physics_Muons.SMWZd3pdExample.NTUP_SMWZ.f406_m991_p716 INCOMPLETE: ANL-ATLAS-GRIDFTP1 COMPLETE: BNL-OSG2_LOCALGROUPDISK MWT2_UC_LOCALGROUPDISK SLACXRD_LOCALGROUPDISK NET2_LOCALGROUPDISK SWT2_CPB_LOCALGROUPDISK

this meeting:

  • question for next meeting: How will site request to run this type of HC test? How can site change inputs? How to obtain performance matrix such as total time, etc.
  • HC D3PD? examples will be used as a standard performance benchmark

N2N

Summary of last week(s)
  • See further https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Decided to continue improving the current N2N and leave GUID as a future option. Chicago can keep the source of N2N in CVS for now - Send update to Rob. Wei can compile
  • Doug's use-case - look up files that existed at BNL but N2N can't find it. Hiro: need to change the code slightly - will do. Probably only happens at BNL. Had to do with the way panda outputs to BNL.
  • Complains about possible memory leak in N2N? . Provided to Andy a standalone package for debugging.
  • Hiro: update N2N? for BNL special cases. Doug will test if this can improve hitting rate to near 100%.
  • N2N? crashing issue has a solution. See MinutesFedXrootdJan11#ANALY_queue
  • Fermi-Gamma experiment at SLAC also see Proxy memory footprint grows. Will release memory when there is a period of no activities, will crash if otherwise. Wei will get more info.
this meeting:
  • Debugged N2N? , found no memory leak
  • Changed SLAC's configuration from xrootd native proxy (cluster) to a cluster of regular xrootd with N2N on top of xrootdfs. This allow better observation of where memory grows.

Checksumming

last meeting:
  • Wei: with 3.1, checksum is working for Xrootd proxy even when N2N is in use. Tested at SLAC at both T2 and T3. Should be straightforward for Posix sites.
  • Not sure about dCache sites. Probably need a plugin for dCache. Callout to figure the checksum from a dCache system. Andy and Hiro will go through this at CERN
  • Wei: Direct reading, dq2-get (-whatever) don't need checksum from remote sites.
  • On-hold
  • rename this item to discuss general issues with checksumming instead of integrated checksumming.
  • Checksumming for native xrootd is basically solved
  • For posix - can adapt
  • For dCache - is there a plugin for checksum? Its there, need to grap.
  • Querying the remote site for checksumming
  • Wrapper script is needed
  • MWT2 and AGLT2 will evaluate dCache's Xrootd door. Will look at checksum solution after it is found useful.
this meeting:

FRM script standardization

last meetings:
  • Standardize FRM scripts, including authorization, GUID passing, checksum validation and retries.
  • A few flavors possible.
  • Setup a twiki page just for this.

  • Brings up the question again about checking completion of xprep commands. Failures do leave a .failed file. Are there tools to check the frm queues. Can we provide a tool for this?
  • Andy: suggests setting up a webpage to monitor the frm queues. frm_admin command. Hiro wil be looking into this.*
  • a prototype of doing this:
for i in all_your_data_servers; do
    ssh your_dataserver_$i and do the following:
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:xrootd_lib_path
    export PATH=$PATH$xrootd_bin_path
    frm_admin -c your_xrootd_config_file -n your_xrootd_instance_name query xfrq stage lfn qwt 
done | sort -k2 -n -r
this meeting:

cmdsd+dcache/xrootd door

last meeting:
  • An updated CMSD that will work with the native dCache/xrootd door (Andy?)
  • A caching mechanism to allow the lookup done by the CMSD N2N[2?] plugin to be useable by the xrootd door (either dCache or Xrootd version) (Andy/Hiro/Wei/?)
  • Redirect to the xrootd-dcache door; will do the lookup and do memcached. cmsd will need the N2N plugin. N2N? must write to something the dCache sites can read.
  • Hiro will look into this; not critical path.
  • On-hold.
  • See Paul Milar's talk at Lyon: http://indico.in2p3.fr/contributionDisplay.py?contribId=12&confId=5527
  • See MinutesFedXrootdJan11#Checksumming
this meeting:

Authorization plugin (Hiro)

last meeting:

  • A "authorization" plugin for the dCache/xrootd door which uses the cached GFN->LFN information to correctly respond to GFN requests (Hiro/Shawn/?)
  • On-hold.
  • see MinutesFedXrootdJan11#Checksumming
this meeting:

Sharing Configurations

last meeting: this meeting:
  • Hiro, will work on a Java API for LFC first.

Monitoring

last meeting: this meeting:

Ganglia monitoring information

last meeting:

  • Note from Artem: Hello Robert, We've managed to do some progress since our previous talk. We build rpms, here is link to repo: http://t3mon-build.cern.ch/t3mon/, we have rebuilded versions of gangla, gweb in it. Ganglia people've issued ganglia 3.2 and new ganglia web (gweb), all our stuff was rechecked and works with this new software. It's better to install ganglia from our repo, instructions are here: https://svnweb.cern.ch/trac/t3mon/wiki/T3MONHome. About xrootd: we have created daemonized version of xrootd summary to ganglia script. It's available at the moment at https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGanglia, it sends xrootd summary metrics (http://xrootd.slac.stanford.edu/doc/prod/xrd_monitoring.htm#_Toc235610398) to ganglia web interface. Also we have application which works with xrootd summary stream but at the moment we're not sure how it's better to present fetched data. We collect there user activity and accessed files, all within the site. Last week we installed one more xrd development cluster and we're going to test if it possible to get and then split information about file transfers between sites/within one site. WBR Artem
  • Deployed at BNL, works.
  • Anyone tried this out in the past week? Would be good to try this out before software week to provide feedback.

this meeting:

  • Wei will look at it again

Monalisa monitoring

  • Discussions with Matevz Tadel (USCMS, UCSD) at Lyon
  • Considering deploying an instance at UC; if so would ask sites to publish information to it.
  • Matevz will visit SLAC on Feb 1-2. will ask for a demo. Interested in detailed morning stuff
this meeting
  • Matevz is working on setting up

-- WeiYang - 08 Feb 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pptx FAX-site-diagram.pptx (88.5K) | WeiYang, 15 Feb 2012 - 01:46 | site architectures
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback