r5 - 24 Nov 2010 - 14:24:14 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov23

MinutesNov23

Introduction

Minutes of the Facilities Integration Program meeting, Nov 24, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Wei, Tom, Nate, Charles, John, Michael, Doug, Saul, Horst, Dave, Wensheng, Xin, Horst
  • Apologies: Shawn, Bob, Hiro, Sarah, Aaron, Karthik

Integration program update (Rob, Michael)

Global Xrootd: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • Twiki page setup at CERN: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/AtlasXrootdSystems
  • Charles: continuing to work on the T2 side. Working on thread-safe xrd-lfc module.
  • Will put some instructions in the twiki when something is stable.
  • Phone call tomorrow to work out some wrinkles with LFC names
  • Doug: CERN IT, ATLAS and CMS on proxy service
  • Doug: at SLAC last week working on xrootd configuration in general. Trying to work out a simplified proxy scheme for T3, not working. Understanding xrootd w/ private networks. Monitoring of xrootd - system and ganglia from EPO yum repo. XrootdFS to be used data management. Meeting with VDT for rpms. Code merge between xrootd and xrootdFS.
this week:
  • Lots of discussion last week. xrd-lfc plugin code is finished, incorporating all suggestions to date. Testing right now.
  • Search logic in the code - since LFC conventions can't be changed
  • Angelos has released a wiki for dq2-ls, -get with new functionality; should be released shortly.
  • Testing - seems there are problems with the global redirector not finding files
  • Will ask Panda developers to store files according to DDM
  • Sites are installing equipment
  • Yushu is working on configuration management
  • Nils working on "back end" - test kit. Example jobs.
  • CVMFS work continues for conditions database files; Doug will work on getting AFS synchronized with CVMFS

Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)

References
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
last week(s):
  • Several more sites are making hardware purchases.
  • Alden got Panda working at Brandeis.
  • Working configuration management - working on puppet.
this week:
  • Monday more than 50% of all panda piliots failed at Duke, ANL and Brandeis
  • Tuesday the number was smaller. Still a large number at ANL.
  • Tier 3 support personnel must have registration in OIM - RT queue, and contacts list.
  • Kaushik: there is a group attribute in Panda which allows only local uses to run jobs - has been done by Asoka in Canada

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

  • Reference
  • last week(s):
    • MinutesDataManageNov16
    • Recall May struggling with massive data transfers, solved by PD2P? but not for T1
    • All T2's are fine
    • Massive amounts of data to BNL (> 1PB) - causing a disk crunch
    • Starting to hit hard physical limits - DATADISK, struggling to maintain 100% ESDs at BNL; requires RAC discussing
    • No multi-100's TB targets left to delete
    • Two types of ESDs - original T0 and reprocessed
    • Note BNL is 2 PB over pledge! (7 PB)
    • MCDISK cleaning is difficult - depends on whats been placed by physics groups
    • Will discuss T1 space at CERN
    • All T1's are full
    • Perhaps distribute one copy of ESDs among T1s.
  • this week:
    • Still working BNL storage issue
    • All T2's should be fine storage-wise.
    • SLAC is filling up, Wei has plenty in reserve. Would like to delete more DATADISK.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=113693
    
    1)  11/10 p.m.: BNL SRM was down for ~30 minutes.  eLog 19310.
    2)  11/10: SWT2_CPB - storage server went off-line (no network connectivity).  Problem (we think) was a bad cooling fan on the NIC.  ggus 64153 / RT 18621 closed,  eLog 19288.
    3)  11/10 - 11/11: BNL - job failures with the error "Can't find [JetMetAnalysis_16_0_2_3_1_i686_slc5_gcc43_opt]."  Not a site issue - instead a problem with s/w installation system (package not available in any cache).  
    Solved - package now installed at BNL.  ggus 64127 closed, eLog 19311.
    4)  11/11 p.m.: network connectivity issue at HU_ATLAS_Tier2, resolved later that evening.  Created a group of panda "lost heartbeat" errors.  eLog 19363.
    5)  11/11 - 11/12: SLAC disk free space low - from Wei: "We run out of space in the front tier. I stopped the channels to let old data moving to back tier."  This led to job failures with stage-out errors.  Problem under investigation.  
    eLog 19324.
    6)  11/11 - present: BNL storage issues due to low free space in DATADISK.  Intermittent errors with DDM transfers and job stage-in/out problems.
    More details in ggus 64154 (open) and 64218 (closed), eLog 19388 / 433 / 488.
    7)  11/14: MWT2_UC_PRODDISK - DDM errors with "SOURCE error during TRANSFER_PREPARATION phase" message.  From Aaron: "This is due to a problem with the disks on one of our storage nodes. We are working to get this 
    node back online, but these errors will continue until this is complete."  ggus 64230 in-progress, eLog 19420.
    8)  11/15: MWT2_UC - job failures with the error "lfc_getreplicas failed with: 1018, Communication error."  Issue resolved - from Aaron: "This was due to an operation being run on our database server causing the LFC to stop responding. 
    Once this operation was killed we are back to running without problems."  ggus 64278 closed, eLog 19514.
    9)  11/16: AGLT2 - network outage caused a loss of running jobs (clean recovery on the gatekeeper not possible).  Issue resolved.  ggus 64323 was opened during this period for related DDM errors.  
    Additional errors like "gridftp_copy_wait: Connection timed out" seen overnight.  ggus ticket in-progress.
    10)  11/17 early a.m.: A panda monitor patch was put in place (https://savannah.cern.ch/bugs/?75351) which caused problems with the display of recent failed jobs pages.  Workaround in place (eLog 19544) while a permanent fix 
    is being developed.
    11)  11/17: HU_ATLAS_Tier2 - job failures with the error "Too little space left on local disk to run job."  ggus 64354, eLog 19564.
    
    Follow-ups from earlier reports:
    
    (i)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  
    May need to run a job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    
    • There have been problems accessing job records from the Panda monitoring database
    • Obsolete sites removed in Panda - if any further cleaning is needed contact Alden
    • Michael - what about quality and performance of central deletion, long periods of inactivity. Where is this being discussed? Should shifters be monitoring - not clear. Is this the P1 shifter? Will bring this up in the meeting next Tuesday.

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting (this week presented by Kai Leffhalm):
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=114346
    
    1)  11/18: AGLT2, from Bob:
    We again had a network issue, and lost our load of jobs. Issue began around 4pm EST yesterday, was not discovered until this morning. We are now restarted afresh.  eLog 19629.
    2)  11/19:  File transfer errors between MWT2_IU_PRODDISK & BNL-OSG2_MCDISK - "FTS Retries [1] Reason [AGENT error during TRANSFER_SERVICE phase: [INTERNAL_ERROR] cannot create archive repository: No space left on device]."  Issue understood - from Hiro:
    The error message comes from FTS, not from the source or destination.  However, it is really not a real error since FTS will transfer a file without space for a log. (Otherwise, we will see errors in many of the transfers done by BNL FTS.) Anyway, the space is being cleaned up as I write. 
    So, the real error will show up in the dashboard if the transfers were still failing.  ggus 64427 closed, eLog 19655.
    3)  11/19: Problem generating proxies with the VOMS server at BNL.  Issue resolved by restarting the service (John H.)  eLog 19674.
    4)  11/19 - 11/21: Intermittent db issues caused periods of slow / non-access to panda monitor pages.  eLog 19647 / 651, 755.  Also led to some sites draining temporarily due to pilots being unable to contact the panda server.  In addition Xin did some clean-up 
    on the submit host gridui12 at BNL to resolve some pilot issues there as well.
    5)  11/20: SWT2_CPB - DDM errors like "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]."  ggus 64457 / RT 18679, eLog 19714/811.  From Patrick: There is an issue with the size of the files and the timeouts enabled on the FTS channel for SWT2_CPB. 
    The incoming files appear to be large (12+GB) and the transfers are not completing within the current timeout. 
    We are playing with the timeouts available on the FTS channel to get these transfers to succeed.
    6)  11/21: BNL - normal DDM & functional tests transfers to other T1's failing.  From Michael: Though the data initial migration from overcommitted pools was completed a few days ago, we had to restart migration again given the massive amount of HI ESD data
    BNL started to receive yesterday. Pools had to be re-allocated because they were filling up quickly.  We are trying hard to make this process for ATLAS as transparent as possible, by that minimizing transfer failures, etc.  
    Please tolerate the few transfer failures for a few more hours, until the migration process is completed.  ggus 64667 closed, eLog 19744.
    7)  11/21: OU_OCHEP_SWT2 (and ANALY queue) set off-line due to missing release 16.2.0.  As of 11/22 s/w was installed, and the queues were set back on-line.  eLog 19800.  Also installed at OU_OSCER_ATLAS.  (ggus  64537 / RT 18691 were opened due to job failures 
    at OU with stage-in errors, but this seemed to be a transient problem - ticket closed.  eLog 19828.)
    
    Follow-ups from earlier reports:
    
    (i)  11/2: OU_OSCER_ATLAS: jobs using release 16.0.2.3 are failing with seg fault errors, while they finish successfully at other sites.  Alessandro checked the release installation, and this doesn't appear to be the issue.  May need to run a job ""by-hand" to get more detailed debugging information.  In-progress.
    Update 11/17: any news on this?
    (ii)  11/11 - 11/12: SLAC disk free space low - from Wei: "We run out of space in the front tier. I stopped the channels to let old data moving to back tier."  This led to job failures with stage-out errors.  Problem under investigation.  eLog 19324.
    (iii)  11/11 - present: BNL storage issues due to low free space in DATADISK.  Intermittent errors with DDM transfers and job stage-in/out problems.
    More details in ggus 64154 (open) and 64218 (closed), eLog 19388 / 433 / 488.
    (iv)  11/14: MWT2_UC_PRODDISK - DDM errors with "SOURCE error during TRANSFER_PREPARATION phase" message.  From Aaron: "This is due to a problem with the disks on one of our storage nodes. We are working to get this node back online, but these errors will continue until this is complete."  ggus 64230 in-progress, eLog 19420.
    (v)  11/16: AGLT2 - network outage caused a loss of running jobs (clean recovery on the gatekeeper not possible).  Issue resolved.  ggus 64323 was opened during this period for related DDM errors.  Additional errors like "gridftp_copy_wait: Connection timed out" seen overnight.  ggus ticket in-progress.
    Update from Bob, 11/19: I see one error in the last 4 hours, 20 error in the last 24 hours (this latter out of 10700). We think there is some packet loss at a very low level in our network, and we are actively investigating to find the source. We also have one very busy server that we will deal with today.
    (vi)  11/17 early a.m.: A panda monitor patch was put in place (https://savannah.cern.ch/bugs/?75351) which caused problems with the display of recent failed jobs pages.  Workaround in place (eLog 19544) while a permanent fix is being developed.
    Update 11/18: Issue believed to be resolved - the Savannah ticket was closed.
    (vii)  11/17: HU_ATLAS_Tier2 - job failures with the error "Too little space left on local disk to run job."  ggus 64354, eLog 19564.
    Update 11/19: Issue resolved - from John at HU: ATLAS has been allowed to scale up and run on additional, non-owned hardware, but there were other user's files left over in /scratch space. I cleaned all of them out now.  ggus 64354 closed.
    
    

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • Still seeing some failures with BDII. Xin: there have been changes in the network route, believes this has helped. Timeout on CERN end extended to 120 seconds, and this has helped. Rob Quick reported in daily WLCG ops meeting that an MTU mis-match problem was discovered and resolved; may go back to the original timeout.
  • this meeting:
    • In terms of operation, DDM is operating without any problems. And, I did not notice many DDM errors related to T2s/T3s sites. The user dataset deletion announcement should come in this week (I will send the size). Since I have not seen any BDII errors in my warning system, it must be working well since BDII has started to use 120s for timeout last week and since the network issued at OSG/indiana was resolved. In terms of BNL dCache, the space is still tight, but we are coping. We have stopped using the specific storage for specific space tokens. So, we can set the quota more flexibly. There were a few incidents when large number of requests are going to very limited number of read pools, which has causes some errors in PANDA and DDM. The migration of popular data to different pools have solved that problems. Staging from HPSS works fine. It reaches 1Gb/s constantly (for these big RAW files). BNL is CPU limited and not HPSS IO limited (in terms of reprocessing). In terms of throughput/networking, BNL-CNAF network problem in GEANT shows progress by installation of new hardware. But, the switching the circuit back to the original route was not completed and will try again in Nov 29th. This is now approaching four months to fix this network between two T1s. How long are we going to expect for any problems between US T2s to foreign T2s to get resolved (if/when a someone even notice) (NOTE: There is an initiative in wlcg to look into this issue.)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting notes
      
      
    • See notes in email
    • Illinois asymmetry resolved - could have been an update to a switch.
    • Goal was to get all perfsonar's updated to 3.2; good progress - issues with SLAC (has alternative version for local security)
    • NET2 had one box down for a while
    • All other sites are updated
    • Question about version at BNL
    • Want Nagios plugins extended to show version
  • this week:

Site news and issues (all sites)

  • T1:
    • last week(s): Extensive data replication to BNL has seen rates of > 2 TB/ hour, filling up the site. This has been quite a stress on services and people, learning a lot. Regarding space, in the process of procuring another 2 PB of disk. (Estimates are coming up short.) This will put BNL at 10 PB of disk by the end of year. Pedro, leader of storage management group, has left. Looking for capable and well-plugged in group leader, Hiro will now lead the group. Group has been re-constituted group with systems and database expertise (Carlos, Jason). Will be moving to Chiimera on timescale of a year.
    • this week: Repro winding down; did more than 30%. Discussing effects of full pools.

  • AGLT2:
    • last week: discovered 1% data loss discovered at UM.. completely disrupted local network, spanning-tree top lost. MSU has received all worker nodes (2 racks). Upgrading LAN with larger 10G switches.
    • this week: 64 new compute nodes at MSU - change in network configuration caused by firmware update. Hope to have nodes running next week. Juniper switches used for WAN connection, tested very well during SC.

  • NET2:
    • last week(s): Yesterday running short in datadisk; have been doing deletions, 55 TB free. Electrical work completed, ready to plug in another rack. Networking incident at Harvard last week on veterans day, disconnecting a couple of Nehalem racks, they responded quickly. Came back online, only lost 1/3 of the jobs. Problems with CondorG? not updating quickly enough, causing backlog. Happened before, thought fixed by Jamie. Now looks like its out of synch again. Scaled up analy queue at HU, adjustments to lsm to use different server (~700 slots).
    • this week: All is well. Finding about 1/3 analy jobs failing, some due to local scratch filling. What about HU analysis jobs? (apparently broker-off, John is reconfiguring network between lsm box and GPFS volume server).

  • MWT2:
    • last week(s): A couple of problems over the week - LFC database index. pnfs-manager load issue.
    • this week: Finalized purchase orders for worker nodes at IU and UC (42+56 R410 servers). We are seeing Java memory usage on dCache pool nodes, uncertain as to the cause but several investigations underway. Keeping things stable as we do so. Downtime scheduled for December 1.

  • SWT2 (UTA):
    • last week: SRM issue at UTA_SWT2 - more file descriptors (increasing ulimit). Unsure as to cause. Could it be due to central deletion service? ANALY cluster at CPB - problem with storage, restarted. Working on joining the federated xrootd cluster. Wei: need to follow-up with Berkeley team.
    • this week: transfer failures timing out. Files taking too long.. very large files - 12 GB files. Patrick working on FTS timeout settings. The reprocessing tasks not configured well.

  • SWT2 (OU):
    • last week: all is well. Working w/ Dell getting new quotes for R410. (not sure of the number)
    • this week: all is fine. 19 new R410's.

  • WT2:
    • last week(s): Having problems with a data server going down over the past week for unknown sources. Problems with an xrootd redirector that Andy will fix. Finding redirector slowing down in certain situations - using a replacement from Andy.
    • this week: everything is running fine. Asking systems folk for Dell nodes to be installed.

Carryover issues ( any updates?)

Release installation, validation (Xin)

The issue of validating process, completeness of releases on sites, etc.
  • last report
    • See slides attached below
    • goal is to unify methods with other clouds
    • https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified
    • Has double checked with Admins as to the install queues, passed onto Alessandro
    • 16.3.0 will be a dry run - to validate
    • Waiting on WT2 after fileserver is moved.
    • Old releases list is being prepared - for deletion. Is there a space crunch at sites.
      • AGLT2 - ~TB free
      • BU - ~several hundred GB, HU several TB
      • MWT2 - 674 GB free
      • UTA - several TB free
      • OU - no issues
      • WT2 - no issue
      • Kaushik will forward a list of releases that are no longer in use
      • Candidate list of releases that could be cleaned up.
  • this meeting:
    • Went through all the sites, should all be okay from the accidental deletion of gcc this morning
    • Going forward will do one site at a time
    • SLAC need

HEPSpec 2006 (Bob)

last week(s): this week:
  • Table has been updated with measurements

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

libdcap & direct access (Charles)

last report(s):
  • Rebuilt libdcap locally
  • Current HC performance, http://hammercloud.cern.ch/atlas/10000957/test/
  • Comparing to previous results with dcap++
  • Looking at messaging between dcap client and server
  • Most problems with prun jobs (pathena seems robust)
  • Trying to get dcap++ merged into official dcap sources.. libdcap has diverged.
  • Continuing to track down stalls in analysis jobs - leading to hangs in dcap.
  • newline causing buffer overflow
  • chipping away at this on the job-side.
  • dcap++ and newer stock libdcap could be merged at some point.
  • Maximum active movers needs to be high - 1000.
  • Submitted patch to dcache.org for review
  • No estimate on libdcap++ inclusion in to the release
  • No new failure modes observed

this meeting:

AOB

  • last week
    • Reminder: all T2's to submit DYNES applications by end of the month. Template to become available.
  • this week
    • May not have meeting next week - SW dinner.


-- RobertGardner - 24 Nov 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback