r5 - 11 May 2011 - 14:36:17 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay11

MinutesMay11

Introduction

Minutes of the Facilities Integration Program meeting, May 11, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Michael, John, Dave, Booker, Jason, Sarah, Torre, Armen, Kaushik, Mark, Nate, Charles, Doug, Wei, Saul, Patrick, Horst, Bob, Joe, Shawn,
  • Apologies: Fred

Integration program update (Rob, Michael)

OSG Opportunistic Access

last week
  • Reached out to Derick this week
  • HCC running at MWT2_IU and OSCAR
  • NE: No progress
  • SW: Enabled to run on production cluster, no HCC jobs, but full up on samgrid
  • SLAC: Requires GlideIns? , which means outbound connections need to be enabled, working this out with networking folks
  • OSG Council meeting coming up, goal is to report HCC jobs working across the facility by May 17th
this week
  • HCC access somewhat delayed due to reduced workload.
  • Time to initiate another VO (Michael)

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_2_11.txt
    
    1)  4/27: OU_OCHEP_SWT2 - file transfer errors like "failed to contact on remote SRM."  From Horst: Our Bestman got hung up for some reason, but our monitoring 
    caught it and auto-restarted it, and everything seems to be fine again now.  ggus 70053 / RT 19908 closed, eLog 24821.
    2)  4/27: BNL - file transfer failures with the error "The certificate has expired: Credential with subject: /DC=org/DC=doegrids/OU=Services/CN=dcsrm.usatlas.bnl.gov 
    has expired."  From Hiro: Host certificate of srm was renewed on April 15th, which is 12 days ago. So, we are not sure why it shows errors today if any. It seems that 
    it has been cached somewhere.  Anyhow, the restart of the service fixed the problem.  ggus 70067 closed, eLog 24831.
    3)  4/27: LFC errors at all US cloud sites.  Issue was due to a new release of the lcg-vomscerts package for the CERN VOMS server.  All US sites had installed the 
    update as if early afternoon.  The rpm is located at:
    http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.4.0/noarch/lcg-vomscerts 6.4.0-1.slc4.noarch.rpm. 
    4)  4/28: OU_OCHEP_SWT2 - large number of job failures with the error "an unknown exception occurred."  Many of the log files contained the entry "problem 
    running chappy!"  Since (i) some of the jobs from the same task had earlier finished successfully at the site, and (ii) there was a problem trying to install the 16.6.4.3 
    cache, it was decided that possibly the 16.6.4 atlas s/w installation was somehow corrupted.  Alessandro/Lorenzo re-installed it (including 16.6.4.3), so this may have 
    solved the problem.  Hard to say for certain since by now the jobs had cycled through the system.  ggus 70081 / RT 19923 closed, eLog 24855.
    5)  4/28: MWT2_UC  - problematic jobs with "Failed to get LFC replicas" errors.  Issue understood - from Aaron:  This was due to a single worker node with the wrong 
    system time. It had drifted many hours. Since the CRLs were just updated, it was refusing to secure a connection to our LFC. Once the time was updated, connections 
    are working again.  ggus 70112 closed, eLog 24896.
    6)  4/28-4/29: ILLINOISHEP_DATADISK - DDM transfer failures with the error "source file doesn't exist."  Possible that the dataset was deleted from the site.  Hiro 
    restored it, and the error went away.  ggus 70104, eLog 24898.
    7)  4/29: SLACXRD - file transfer failures with "failed to contact on remote SRM" errors.  ggus 69917 was re-opened.  Update 5/1 from Wei: I requested changes at 
    related FTS channels yesterday and I don't see this error again today.  eLog 24891.
    Update 5/4: Errors no longer appear, the failed transfers are done now.  ggus 69917 closed, eLog 25032.
    8)  5/2: UTD_HOTDISK file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]").  From Joe: Hardware failure on our 
    system disk. Currently running with a spare having out of date certificates. Our sys-admin is working on the problem.  ggus 70196 in-progress, eLog 24971.
    9)  5/3: New pilot version (SULU 47b) from Paul.  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_47b.html
    10)  5/4 early a.m.: SWT2_CPB - problem with the NIC (cooling fan) in a dataserver took the host off-line.  Problem should now be fixed.  ggus 70266 / RT 19949 will 
    be closed once we verify transfers are succeeding.  eLog 25046.
    11)  5/4: OU_OCHEP_SWT2_PRODDISK - file transfer failures due to checksum errors ("[INTERNAL_ERROR] Destination file/user checksum mismatch]").  Horst & Hiro 
    are investigating.  https://savannah.cern.ch/bugs/index.php?81834, eLog 25039.
    
    Follow-ups from earlier reports:
    (i)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (ii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    
    • OSG security pushed back on lcg-vomscerts vs permanent files, small enough of an issue that we can handle it ourselves
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-5_9_2011.html
    
    1)  5/4: File transfer failures at SMU with errors like "[GENERAL_FAILURE] Error:/bin/mkdir: cannot create directory ..... Permission denied'.  Issue resolved by 
    Justin (site admin) - ggus 70279 closed, eLog 25052.
    2)  5/4: SE problem at BNL (file transfer failures with SRM timeouts) - issues resolved by rebooting a server.  ggus 70283 closed, eLog 25056/60.
    3)  5/4 - 5/5: AGLT2 - Initially file transfers failures with "PSQLException: ERROR: could not access status of transaction 0; Detail: Could not write to file 
    "pg_subtrans/1F9B" at offset 188416: No space left on device."  From Shawn: The postgresql partition hosting the dCache billingdb has filled. We are cleaning it.  
    Later in the day job were failing due to a local user filling the /tmp partition on some WN's.  User contacted - issue resolved.  ggus 70251 closed, eLog 25071.
    4)  5/6 p.m. - 5/9 a.m. - SLAC power outage - from Wei: This is a scheduled power outage to bring additional power to SLAC computer center. During the outage, 
    all ATLAS resource at SLAC, including those belonging to SLAC ATLAS department will be unavailable.  Work completed as of ~1:00 p.m. PST.  eLog 25125.  
    https://savannah.cern.ch/support/index.php?120808.
    5)  5/8 - 5/9: AGLT2 - file transfer failures, due to networking issue at the MSU site.  Queues set off-line for a period of time
    until the problem was resolved.  Once networking was restored test jobs submitted, completed successfully - whitelisted in DDM, queues back on-line.  
    ggus 70361 / RT 19968 closed, eLog 25210.
    6)  5/10: BNL - network maintenance (8:00 a.m. EDT => 12:00 noon) completed successfully.  eLog 25196.
    7)  5/10 a.m. - from Saul at NET2: We had about 500 jobs fail at BU_ATLAS_Tier2o last night due to a bad node.   It's now off-line.  Later in the evening, from John: 
    We're swapping around some internal disk behind atlasproddisk, and we're draining the queues so that we can make the final switchover tomorrow morning after 
    the sites have quiesced.  Panda queues set to 'brokeroff'.
    8)  5/11: WISC DDM failures.  Blacklisted in DDM: https://savannah.cern.ch/support/index.php?120901.  ggus 70467.  Issue is a cooling system problem in a data center.  
    (Also, there seem to be some older, still open Savannah tickets related to DDM errors at the site?)
    
    Follow-ups from earlier reports:
    (i)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (ii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    (iii)  5/2: UTD_HOTDISK file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]").  From Joe: Hardware failure on our 
    system disk. Currently running with a spare having out of date certificates. Our sys-admin is working on the problem.  ggus 70196 in-progress, eLog 24971.
    Update 5/10: Site admin reported that UTD was ready for new test jobs, but they failed with "Required CMTCONFIG (i686-slc5-gcc43-opt) incompatible with that of 
    local system (local cmtconfig not set)" (evgen jobs) and missing input dataset (G4sim).  Under investigation.   https://savannah.cern.ch/support/?120588, eLog 25250.
    (iv)  5/4 early a.m.: SWT2_CPB - problem with the NIC (cooling fan) in a dataserver took the host off-line.  Problem should now be fixed.  ggus 70266 / RT 19949 will 
    be closed once we verify transfers are succeeding.  eLog 25046.
    Update 5/4 p.m.: successful transfers for several hours after the hardware fix - ggus / RT tickets closed, eLog 25046.
    (v)  5/4: OU_OCHEP_SWT2_PRODDISK - file transfer failures due to checksum errors ("[INTERNAL_ERROR] Destination file/user checksum mismatch]").  Horst & Hiro 
    are investigating.  https://savannah.cern.ch/bugs/index.php?81834, eLog 25039.
    
    
    • No major issues this week
    • Opportunistic OSCER site at OU - it has been offline for quite a while. Will turn back on.
    • ddm-ops mailing list

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • No meeting this week
    • LHCONE meeting next week
  • this week:
    • Bi-weekly meeting yesterday. What we're going to do this quarter - seen site certification matrix.
    • quarterly cleaning of personar database
    • mwt2-iu and aglt2 problem seemed to have cleared: latency symmetric, throughput back up. Still an issue at OU - Jason investigating in-depth.
    • site related things in the other business.
    • modular version of perfsonar dashboard work on-going by Tomaz. Probably create hierarchy of pages, then go down to cloud-level.
    • ADC development monitoring presentation - goal by next software week to get Italian cloud instrumented.
    • LHCONE meeting tomorrow in Washington hosted by I2.

HTPC configuration for AthenaMP testing (Horst, Dave)

last week
  • When the pilot submits a node, kills efficiency until all job slots on the node are free, starts up and exits, frees up node, very inefficient
  • More efficiency possible if we keep multi-core jobs on multi-core nodes, not allow single-core jobs to run on multi-core nodes
  • Panda queue for OU HEP MP queue set up, Horst has not had time to test this yet, but likely ready for testing
this week
  • Horst - trying to enable the OU ITB site
  • Dave - Doug sent jobs to Illinois - there seemed to be a case-sensitivity issue. Also waiting for a new release of AthenaMP.

Python + LFC bindings, clients (Charles)

last week(s): this week:
  • wlcg-client-lite deprecate
  • Still waiting on VDT for wlcg-client, wn-client
  • Question - could CVMFS be used to distribute software more broadly.

WLCG accounting (Karthik)

last week:
  • Following up on interoperability issues when changing OSG config file, Doug asked Tony about this, waiting to talk to a new employee who won't start until later this month
  • Requires input from WLCG folks, next discussion will happen at interoperability meeting on May 20th, to get implications of downstream changes
  • Question of whether USCMS has similar issues about CPU types/speeds as USATLAS does w.r.t WLCG accounting stats

this week:

  • Sites reporting are within 5%.
  • Won't expect progress on hyperthreading report
  • UTA - Gratia numbers are coming low, tracking down a systemic reporting issue
  • NET2 - will look into this, also making a comprehensive comparison with WLCG.
  • Michael - WLCG has defined efficiency figures for T1 and T2 - it was set to 60% years ago. Discussion yesterday at WLCG MB meeting - proposal to increase to 70%.

CVMFS

See TestingCVMFS

last week:

  • All MWT2 queues are running CVMFS, no significant load on squid
  • Request: Update the TestingCVMFS page to reflect changes made in schedconfig panda SVN
  • Puppet module as published will be updated this afternoon to add modprobe of fuse module if not loaded before autofs runs
  • AGLT2 CVMFS: very close to rebuild worker nodes with CVMFS, likely ready by friday
  • SLAC: new a few more hard drives for a CVMFS local cache
  • NE: planning to install, especially on the Tier3
  • SW: basic squid set up, not production-ready, test machine with CVMFS, new queue to test against it
  • Questions: what is the plan for the PFC files? Does the CVMFS_NFILES param in /etc/cvmfs/default.conf limit the size the cache will grow to?
this week:
  • CVMFS Tests at MWT2
  • Site status: AGLT2 - when nodes are re-installed; working at Illinois;
  • Alden - what about BDII publishing? How should the brokerage respond? Will require changes by Alden and Tadashi to work out brokerage. Torre: we can deal with this at the adc-panda level.
  • Doug had a question to John: should reverse proxies be put on the BNL site
  • Michael: FNAL has switched to CVMFS for software distribution and have been experienced problems, and backed out. Happened during normal production. Suggests continue running stress tests with large numbers of jobs starting. We'll need to be careful.

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • no update
this week:
  • See Charles' email of this morning. Need sites to update plugin. Once updated will start second round of testing.
  • Doug - working with Andy and Wei.

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
  • US ATLAS Tier3 RT Tickets

last week(s):

  • Doug not here this week

this week:

  • Setup of CVMFS with ATLAS releases setup to 70%. Conditions data has been put there.
  • Need a Tier 3 production site policy document. - will take up with management.
  • Looking at pre-release testing of xrootdfs.

Tier 3GS site reports (Doug Benjamin, Joe, AK, Taeksu)

last week:
  • Doug not here

this week:

  • UTD: Joe: problems on gatekeeper node, required disk swaps.

Site news and issues (all sites)

  • T1:
    • last week:
      • GLExec autopilot factory work going on, will go into production soon
      • Xrootd Global redirector to be moved to BNL, Hiro pursuing this
      • Chimera migration for dcache at BNL
      • Hiro testing direct-access method from dcache with new panda queue
      • Federated ID management with CERN using shibboleth with trust relationships between BNL and CERN
    • this week:

  • AGLT2:
    • last week(s):
      • LFC number of threads increased from 20 to 99
      • BillingDB? filled up a partition on dcache, now cleanup is automated so as this will no longer occur
      • Meeting with Dell next week with regard to SSDs, hope for better pricing
      • Met with MWT2 to discuss LSM and pcache
      • Revisiting testing of direct-access methods
      • Plan to deploy CVMFS with new rocks build, likely complete today, rolling re-builds will being later in the week
      • Testing NexSAN? , doing iozone testing, issues with 60-unit rack-mounting, improvement over Dell in Density and Performance
    • this week:

  • NET2:
    • last week(s):
      • Relatively smooth operations
      • Tier3 work: operational now
      • Focused on local IO ramp-up: joining GPFS volumes complete, rebalancing of files, good performance from GPFS
      • Harvard directly mounting GPFS with clustered NFS with local site mover
      • Getting another 10Gb link from campus folks
      • Adding GridFTP? instances at BU
      • Upgrading OSG, upgrading to BestMan2? , moving to CVMFS
      • Purchasing more storage, 2 rack worth: ~300TB per rack, ending up at 1.5PB by July
    • this week:

  • MWT2:
    • last week:
      • UC: Completed our move, moved to CVMFS
      • IU:
      • Illinois:
    • this week:

  • SWT2 (UTA):
    • last week:
      • Looking to CVMFS
      • Storage server NIC went away, caused some problems but back fine now
      • Partial outage on sunday due to 8 hour power outage, generator should come on without issues, but 2 racks of SWT2_CPB workers will be affected
      • Rolled back BestMan2? version, ran on a second node with no problems of new version, in the mean time newer version was released, will test soon and then take a downtime to move to latest BestMan2? and newer OSG stack, also spin up another 200TB of disk
    • this week:

  • SWT2 (OU):
    • last week:
      • Glitch with a release getting corrupted, deleted and re-installed fixed the problem
      • CVMFS testing ongoing, hope to move to CVMFS next week
      • Working on MP cluster/queue at OU
    • this week:

  • WT2:
    • last week(s):
      • Latest BestMan upgraded, working and fixes a number of issues
      • Alex will release latest BestMan to OSG by the end of the week, including plugin to dynamically add/remove GridFTP? servers
      • 2-3 day power outage starting Friday afternoon through Sunday or Monday to bring more power to the building
      • After power outage, starting OS installation of new compute nodes once power is delivered to new nodes
    • this week:

Carryover issues (any updates?)

AOB

last week this week


-- RobertGardner - 07 May 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback