r5 - 17 Nov 2010 - 14:53:23 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov17



Minutes of the Facilities Integration Program meeting, Nov 17, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg


  • Meeting attendees: Karthik, Dave, Charles, Aaron, Nate, Rob, Sarah, Patrick, Torre, Bob, Saul, Michael, Wei, John, Fred, Kaushik, Mark, Armen, Hiro, Xin, Nurcan, Alden, Doug,
  • Apologies: Horst, Rik

Integration program update (Rob, Michael)

  • IntegrationPhase15 NEW
  • Special meetings
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • Program notes:
    • last week(s)
      • CapacitySummary - complete for the last phase, thanks all.
      • There may be some issues with installed capacity as reported to WLCG - http://gstat-wlcg.cern.ch/apps/capacities/comparision/
      • Heads up regarding site status monitoring and auto-exclusion changes coming next week (from Nurcan):
        • See talks by Alessandro Di Girolamo and Dan van der Ster at this week's ADC weekly meeting, http://indico.cern.ch/conferenceDisplay.py?confId=112808
        • Sites should make it sure that Athena release 15.6.9 is always available at their sites (used by analysis test jobs by HammerCloud, a second test using release 16.0.2 to be added)
        • Mail by ADC shifters will be sent to US cloud support mailing list, atlas-support-cloud-us@cern.ch. Make it sure we have relevant people subscribed to this list. Currently Nurcan, Alden, Mark and racf-wlcg-announce-l@lists.bnl.gov (who is in this list? Asked J. Hover). Add Hiro, Wensheng, Xin? Subscription to the cloud support is via this link.
        • Nurcan to give detailed report next week after the system has been tested
      • Reprocessing campaign is on-going, at the Tier 1, and the MC will be taking off soon.
      • Its almost a PB of data, US has got 380 TB to do at BNL. Well underway. All Tier 1's are participating
      • Stress in usage of DATADISK - at BNL it was filling rapidly. 2 TB/ hour, so that management of this space is critical. 7 PB in production, but only a few hundred TB left, so its an urgent matter now.
      • Heavy ion collions - only took 4 days to convert from p-p. December 7 will shut down.
    • this week
      • See site certification table updated
      • Reprocessing campaign - completed a couple of days ago
      • Last phase launched this morning - to reprocess the October data - should roughly a week - then available to users
      • LHC - short break, to resume on Friday; HI is a wild success

Analysis sites auto-exclusion update (Nurcan)

  • Auto-exclusion service has been turned on for Panda ANALY* sites on Monday this week.
  • See the details from Dan's talk today at the DA Dev meeting.
  • US sites will be contacted via the US cloud support mailing list, atlas-support-cloud-us@cern.ch. Make it sure we have relevant people subscribed to this list. Currently Nurcan, Alden, Mark and racf-wlcg-announce-l@lists.bnl.gov (who is in this list? Asked J. Hover). Add Hiro, Wensheng, Xin? Subscription to the cloud support is via this link.

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Twiki page setup at CERN: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Meeting https://twiki.cern.ch/twiki/bin/viewauth/Atlas/XrdMeetingOct25
  • Doug: testing revised configuration files from Andy and Wei, and scripts for input copy. Will need a new version of xrootd than whats in VDT. Several sites in Europe (esp UK, DPM; Spain, Lustre) expressing interest in participating. Also needs to work with Graham for a schedule. Will put this into the Tier 3 part of the project.
  • Charles: convention for namespace implemented easily w/ symlinks in the LFC. Module for LFC lookup - working at SLAC and UC. xrd-dcap debugging - will need to be repackaged. UC-SLAC testbed working. Can access a file using global namespace ATLAS.
  • Patrick, Shawn, Saul standing by ready to test.
this week:
  • Charles: continuing to work on the T2 side. Working on thread-safe xrd-lfc module.
  • Will put some instructions in the twiki when something is stable.
  • Phone call tomorrow to work out some wrinkles with LFC names
  • Doug: CERN IT, ATLAS and CMS on proxy service
  • Doug: at SLAC last week working on xrootd configuration in general. Trying to work out a simplified proxy scheme for T3, not working. Understanding xrootd w/ private networks. Monitoring of xrootd - system and ganglia from EPO yum repo. XrootdFS to be used data management. Meeting with VDT for rpms. Code merge between xrootd and xrootdFS.

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

  • The link to ATLAS T3 working groups Twikis are here
  • Draft users' guide to T3g is here
last week(s):
  • Lining up examples for analysis - Nils working at ANL for three days. Amir's n-tuple example on Tier 3.
  • Desy has large n-tuple benchmark package - adapting this as well.
  • Tier3-Panda - has an account at Argonne, working.
  • Doug at SLAC - working on the next T3 xrootd configuration; needs to be synchronized with the VDT rpm
  • Doug will work with Yu Shu to use Puppet.
  • All the scripts are in SVN at CERN, head node installation has been tested by Doug; worker-node installation has been tested.
  • Twiki security policy creating access problems
  • Yale is having problems with client tools - will look into gridftp-FTS
  • dq2-ls and dq2-get will go into the next release candidate, before December
  • CERN IT plus Dubna developers are starting a development effort for T3s.
this week:
  • Several more sites are making hardware purchases.
  • Alden got Panda working at Brandeis.
  • Working configuration management - working on puppet.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • Facing storage shortfalls world-wide
    • BNL is getting full as well (already using far more than the pledge) - about 1 PB free (down from 1.2 a few days ago)
    • Can we afford secondary copies? May need to delete older ESD copies. US physicists may need to have jobs scheduled elsewhere (other clouds) - its an ATLAS problem
    • AGLT2 can add more to DATADISK if needed
    • Can the space-token auto-adjuster be used at BNL? Probably not - they are hard-tokens.
  • this week:
    • MinutesDataManageNov16
    • Recall May struggling with massive data transfers, solved by PD2P? but not for T1
    • All T2's are fine
    • Massive amounts of data to BNL (> 1PB) - causing a disk crunch
    • Starting to hit hard physical limits - DATADISK, struggling to maintain 100% ESDs at BNL; requires RAC discussing
    • No multi-100's TB targets left to delete
    • Two types of ESDs - original T0 and reprocessed
    • Note BNL is 2 PB over pledge! (7 PB)
    • MCDISK cleaning is difficult - depends on whats been placed by physics groups
    • Will discuss T1 space at CERN
    • All T1's are full
    • Perhaps distribute one copy of ESDs among T1s.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  11/5: New pilot version from Paul (SULU 45a) - details here:
    2)  11/5: WT2 - short outage from ~2:30pm to 5:00pm PDT to replace a system disk on a storage box.
    Site back on-line as of ~4:45 PDT.
    3)  11/6 - 11/7: BNL dCache issues for ATLASDATADISK and ATLASMCDISK space tokens:
    "FTP Door: got response from '[>PoolManager@dCacheDomain:*@dCacheDomain:SrmSpaceManager@srm-dcsrm03Domain:*@srm-dcsrm03Domain:*@dCacheDomain]' with error Best pool  too high : NaN] ACTIVITY: Data Consolidation."  Resolved - from Iris / Michael: 
    It was a space issue which has been fixed (MCDISK filled up).  ggus 63996/99 closed, eLog 19105/125.
    4)  11/7 - 11/8: SWT2_CPB DDM errors - status from Patrick:
    Sunday there was an issue in the configuration of Bestman associated with the number of open file descriptors.  Restarting the SRM cleared the issue.  We had more problems today (Monday), but did not see an issue with the number of open files.  
    We are seeing a high load on one dataserver and may make some changes to the xrootd configuration on this node to see if it can improve things.  We have restarted Bestman and modified the number of the worker threads associated with the 
    XrootdFS component that bestman relies on.  Transfer errors have stopped.
    5)  11/9 early a.m.: SWT2_CPB file transfer errors.  Issue was a problematic network switch port.  Later additional transfer errors were observed, due to the fact that the xrootd storage server plugged into the bad switch port was inaccessible for a period of time.  
    All issues seem to now be resolved.  RT 18616 / ggus 64117 closed, eLog 19265. 
    6)  11/9: HU_ATLAS_Tier2 job failures with the error "Get error: lsm-get failed: time out after 5400 seconds ."  Issue resolved - ggus 64108 closed, eLog 19266.
    Follow-ups from earlier reports:
    (i)  10/21: NERSC_HOTDISK file transfer errors - authentication issue with NERSC accepting the ATLAS
    production voms proxy.  Hiro set the site off-line in DDM until the problem is resolved.  ggus 63319 in-progress, eLog 18494.
    Update: solved as of 11/5 - ggus ticket closed.
    (ii)  10/27: WISC_DATADISK - possibly a missing file.  ggus 63526 in-progress, eLog 18698.
    Update 11/6, from Wen at WISC: Now these 5 files are available. I think it's transferred from other sites by Function Test. So this ticket can be closed.  ggus 63526 was subsequently closed.
    (iii)  10/30: ANL - file transfer tests failing with the error "failed to contact on remote SRM [httpg://atlas11.hep.anl.gov:8443/srm/v2/server]. Givin' up after 3 tries]."  ggus 63633 in-progress, eLog 18807.
    Update, 11/1: Network device failures were solved by a reboot of the machine.  ggus 63633 closed.
    (iv)  10/31: Job failures at SLACXRD with the error "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of local system."  From Xin: The installation at SLAC is corrupted, I am reinstalling there, will update the ticket after the re-install is done.  
    ggus 63639 in-progress, eLog 18845.
    Update, 11/5: ATLAS release 15.6.13 reinstalled, no additional errors of this type seen.  ggus 63639 closed, eLog 19066.
    (v)  11/2: AGLT2 - all jobs failing with the errors indicating a possible file system problem.  From Bob: We have determined that the problem is a corrupted NFS file system hosting OSG/DATA and OSG/APP. That is the bad news. 
    The good news is that this is a copy to a new host from yesterday, so the original will be used to re-create it.  
    ggus 63684 in-progress, eLog 18913.  Queues set off-line.
    Update, 11/4 from Bob at UM: NFS server for OSG directories was reloaded and resolved the issue. This server disk was originally built as XFS and mounted with inode64. It worked fine for all OS level commands, but failed in various python packages used in ATLAS kits. 
    The disk was emptied, and inode64 was removed before it was reloaded. 
    Test jobs successful, queues set back 'on-line.  ggus 63684 closed, eLog 18965.
    (vi)  11/2: OU_OSCER_ATLAS: jobs using release are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  
    May need to run a job "by-hand" to get more detailed debugging information.  In-progress. 
    • Another quiet week in the US cloud
    • Most carryover issues of the last week resolved
    • New pilot from Paul - significant
    • Alden: has removed queues

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  11/10 p.m.: BNL SRM was down for ~30 minutes.  eLog 19310.
    2)  11/10: SWT2_CPB - storage server went off-line (no network connectivity).  Problem (we think) was a bad cooling fan on the NIC.  ggus 64153 / RT 18621 closed,  eLog 19288.
    3)  11/10 - 11/11: BNL - job failures with the error "Can't find [JetMetAnalysis_16_0_2_3_1_i686_slc5_gcc43_opt]."  Not a site issue - instead a problem with s/w installation system (package not available in any cache).  
    Solved - package now installed at BNL.  ggus 64127 closed, eLog 19311.
    4)  11/11 p.m.: network connectivity issue at HU_ATLAS_Tier2, resolved later that evening.  Created a group of panda "lost heartbeat" errors.  eLog 19363.
    5)  11/11 - 11/12: SLAC disk free space low - from Wei: "We run out of space in the front tier. I stopped the channels to let old data moving to back tier."  This led to job failures with stage-out errors.  Problem under investigation.  
    eLog 19324.
    6)  11/11 - present: BNL storage issues due to low free space in DATADISK.  Intermittent errors with DDM transfers and job stage-in/out problems.
    More details in ggus 64154 (open) and 64218 (closed), eLog 19388 / 433 / 488.
    7)  11/14: MWT2_UC_PRODDISK - DDM errors with "SOURCE error during TRANSFER_PREPARATION phase" message.  From Aaron: "This is due to a problem with the disks on one of our storage nodes. We are working to get this 
    node back online, but these errors will continue until this is complete."  ggus 64230 in-progress, eLog 19420.
    8)  11/15: MWT2_UC - job failures with the error "lfc_getreplicas failed with: 1018, Communication error."  Issue resolved - from Aaron: "This was due to an operation being run on our database server causing the LFC to stop responding. 
    Once this operation was killed we are back to running without problems."  ggus 64278 closed, eLog 19514.
    9)  11/16: AGLT2 - network outage caused a loss of running jobs (clean recovery on the gatekeeper not possible).  Issue resolved.  ggus 64323 was opened during this period for related DDM errors.  
    Additional errors like "gridftp_copy_wait: Connection timed out" seen overnight.  ggus ticket in-progress.
    10)  11/17 early a.m.: A panda monitor patch was put in place (https://savannah.cern.ch/bugs/?75351) which caused problems with the display of recent failed jobs pages.  Workaround in place (eLog 19544) while a permanent fix 
    is being developed.
    11)  11/17: HU_ATLAS_Tier2 - job failures with the error "Too little space left on local disk to run job."  ggus 64354, eLog 19564.
    Follow-ups from earlier reports:
    (i)  11/2: OU_OSCER_ATLAS: jobs using release are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  
    May need to run a job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    • There have been problems accessing job records from the Panda monitoring database
    • Obsolete sites removed in Panda - if any further cleaning is needed contact Alden
    • Michael - what about quality and performance of central deletion, long periods of inactivity. Where is this being discussed? Should shifters be monitoring - not clear. Is this the P1 shifter? Will bring this up in the meeting next Tuesday.

DDM Operations (Hiro)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting notes
    • See notes in email
    • Illinois asymmetry resolved - could have been an update to a switch.
    • Goal was to get all perfsonar's updated to 3.2; good progress - issues with SLAC (has alternative version for local security)
    • NET2 had one box down for a while
    • All other sites are updated
    • Question about version at BNL
    • Want Nagios plugins extended to show version
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Reprocessing keeping us busy, especially due to the space crunch (dcache adjustments). Looking into purchasing more disk using FY11 funds.
    • this week: Extensive data replication to BNL has seen rates of > 2 TB/ hour, filling up the site. This has been quite a stress on services and people, learning a lot. Regarding space, in the process of procuring another 2 PB of disk. (Estimates are coming up short.) This will put BNL at 10 PB of disk by the end of year. Pedro, leader of storage management group, has left. Looking for capable and well-plugged in group leader, Hiro will now lead the group. Group has been re-constituted group with systems and database expertise (Carlos, Jason). Will be moving to Chiimera on timescale of a year.

  • AGLT2:
    • last week: Lustre issue - mounted on worker nodes; metadata server, Bob working it. New resources arriving - blade severs arriving. Tom: 64 R410s, all but 10 arrived and racked; setting them up. A little more than doubling the MSU HS.
    • this week: discovered 1% data loss discovered at UM.. completely disrupted local network, spanning-tree top lost. MSU has received all worker nodes (2 racks). Upgrading LAN with larger 10G switches.

  • NET2:
    • last week(s): Running at full capacity, including HU Westmere nodes. Problems keeping ANALY_HU full, a few..
    • this week: Yesterday running short in datadisk; have been doing deletions, 55 TB free. Electrical work completed, ready to plug in another rack. Networking incident at Harvard last week on veterans day, disconnecting a couple of Nehalem racks, they responded quickly. Came back online, only lost 1/3 of the jobs. Problems with CondorG? not updating quickly enough, causing backlog. Happened before, thought fixed by Jamie. Now looks like its out of synch again. Scaled up analy queue at HU, adjustments to lsm to use different server (~700 slots).

  • MWT2:
    • last week(s): gridftp server problems - no route to host - server disabled, investigating.
    • this week: A couple of problems over the week - LFC database index. pnfs-manager load issue.

  • SWT2 (UTA):
    • last week: A couple of issues with the SRM over the weekend - #open files for Bestman. Getting started setting up global xrootd system, in place, will need to turn things on. New 10G connection coming online, new switch in place, testing w/ test-cluster.
    • this week: SRM issue at UTA_SWT2 - more file descriptors (increasing ulimit). Unsure as to cause. Could it be due to central deletion service? ANALY cluster at CPB - problem with storage, restarted. Working on joining the federated xrootd cluster. Wei: need to follow-up with Berkeley team.

  • SWT2 (OU):
    • last week: Turned on HT on R410 - now running close to 800 jobs. All running smoothly. There is an "LFC problem" - investigating.
    • this week: all is well. Working w/ Dell getting new quotes for R410. (not sure of the number)

  • WT2:
    • last week(s): On Friday, problem with a mobo in an older thumper. Large numbers of xrootd connections leading to solaris problems (?). Quite a few SRM issues.. investigating, could be related to the solaris problem. Received 10 Dell R410 servers, awaiting the rest (total 38).
    • this week: Having problems with a data server going down over the past week for unknown sources. Problems with an xrootd redirector that Andy will fix. Finding redirector slowing down in certain situations - using a replacement from Andy.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • See slides attached below
    • goal is to unify methods with other clouds
    • https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified
    • Documentation isn't ready
    • Migrate Tier 2's, but when?
    • SLAC - planning to move ATLAS releases to a new server.
    • Start with 16.3.0; will also require a change in the ToA.
    • Xin will communicate the transition with the site admins and with Alessandro
    • Will start with all the Tier 2's.
  • this meeting:
    • Has double checked with Admins as to the install queues, passed onto Alessandro
    • 16.3.0 will be a dry run - to validate
    • Waiting on WT2 after fileserver is moved.
    • Old releases list is being prepared - for deletion. Is there a space crunch at sites.
      • AGLT2 - ~TB free
      • BU - ~several hundred GB, HU several TB
      • MWT2 - 674 GB free
      • UTA - several TB free
      • OU - no issues
      • WT2 - no issue
      • Kaushik will forward a list of releases that are no longer in use
      • Candidate list of releases that could be cleaned up.

HEPSpec 2006 (Bob)

last week(s): this week:

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:


  • last week
    • Reminder: all T2's to submit DYNES applications by end of the month. Template to become available.
  • this week

-- RobertGardner - 17 Nov 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback