r5 - 16 Apr 2012 - 15:23:56 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesApr11

MinutesApr11

Introduction

Minutes of the Facilities Integration Program meeting, April 11, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Bob, Michael, John DeStefano, Horst, Tom, Shawn, Sarah, Dave, Rob, Lincoln, Nate, Ilija, Wei, Andy, Doug, Saul, Fred, Booker, Patrick, Alden, Armen, Kaushik, Mark
  • Apologies: Jason
  • Guests: Dan Bradley, U Wisconsin (US CMS) , Matevz (at FAX workshop)

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
    • Wednesday (1pm CDT, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • GuideToJIRA - will use to coordinate perfSONAR and OSG 3.0 deployments this quarter.
      • IntegrationPhase20, SiteCertificationP20
      • Nearing end of quarter, therefore SiteCertificationP20 will need updating
      • Enabling multicore scheduling in the US Facility. More resources are needed to test out the environment for AthenaMP. Take this as an opportunity step up and provide this service, dynamically adjust resources as according to need. We'll need to work up recipes for various sites. We'll need to collaborate on this. Farm groups at BNL are starting on an recipe for Condor.
      • VmemSitesSurvey - nearly filled in. Don't want to provide high mem queues initially.
      • LFC consolidation, report is in: ConsolidatingLFCStudyGroupUS, report document
        • Hiro - will need to look at it this week. Patrick.
        • Michael - report is good, distributed leadership to help provide guidance to facility
      • Capacity summary:
        • MWT2 - will ramp to 2.4 PB, within April. Presently 1.8 PB useable.
        • AGLT2 - 2160 TB online. 90 TB additional coming at MSU. 2250 PB.
        • NET2 - 1.2 PB; two additional racks will make 2.2 PB; under test. Will be ready well before end of April.
        • SWT2 - waiting for UPS to be set, and additional power. Racks in place to get to 2.2 PB as pledge; to be installed. Awaiting on final UPS hook-up, April 2. End of April will be tight, not sure.
        • WT2 - Only 50 TB short (2150 TB usable is available now), don't have immediate plans to purchase storage. Building up area for low-density storage. Will be ordering 10 TB of SSDs.
        • BNL - both CPU and disk at pledge level
      • All hands meeting follow-on comments
        • Good meeting, US ATLAS-specific issues
        • Joint sessions with US CMS, and common all-hands meeting agenda - seemed to work well.
        • Redefinition of production operations (Kaushik and Michael)
        • Usage and accounting for resources beyond pledge - will need to follow this up, Kaushik will be leading this
        • Optimization of resources Tier 1 / Tier 2 - as discussed at last review. Raised at ATLAS ICB - needs followed up. Will be brought up at CREM. There is a lack of accounting of the general use of resources, will lead to new accounting requirements. There is a monitoring group within ADC that is collecting requirements as well - it will be an ATLAS-wide activity, it is underway.
        • Site performance metrics - we've discussed this a few weeks ago; Michael and Rob are discussing details and will discuss a proposal soon.
        • Network monitoring services - should we widen our work beyond US ATLAS; Shawn, Jason, Lothar B, Miron discussed this at a side-meeting.
        • WLCG TEG groups are ramping up with their activities - recommendations and draft reports are coming together. Data management and storage management TEG that outlines a number of recommendations; might be interesting to go over this. Michael will circulate, have a short discussion at the next meeting.
        • LHCONE is making good progress - we have peering between LHCONE and ESNet SLAC and BNL; initial measurements to Naples - no improvement, but now coming along. Would like to see at least two other sites to get included in this activity soon. Would like to see other sites join soon - AGLT2 sites.
        • LHC is getting close to providing collisions
    • this week
      • FAX workshop in progress
      • Business about the frontier setup misconfiguration - can it be monitored from the client? We need to investigate this.
      • We are back in data taking mode - the LHC is performing well - so our primary goal is stability and good performance. Make sure communication is prompt - tickets should be responded to within 12 hours.

CMS Opportunistic Access (Dan Bradley)

  • SupportingCMS
  • Squids would be good
  • Access CMS software via CVMFS - accessed via Parrot; a small performance hit < 5%, rather than using fuse. Parrot uses ptrace mechanism.
  • Michael: Outbound connectivity - can you use a proxy rather than port 80. Test jobs at BNL have been shown to work.
  • Patrick: which accounts? Do you expect to use pool accounts? Dan: not required, but wouldn't hurt. Small pool of DNs would come in. Group account would work fine.
  • SLAC might be a problem - given that outputs go to an arbitrary set of destination VOs
  • Reciprical arrangements possible with CMS resources - in particular UW. Michael notes that any virtualized environments would be of interest.
  • We need to figure out the corresponding set of accounts, and the gums configuration.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • Completely full, mc12 keeping us busy
    • Alden has implemented MOU shares in schedconfig; US showing 20%, its a little low.
  • this meeting:

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=185177
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_2_2012.html
    
    1)  3/29: Saul reported that pcache at NET2 was giving a 'connection refused' error when attempting to contact the panda server.  From Tadashi: 
    It seems that someone rebooted the panda machines at ~3PM, and the pandaserver didn't automatically restart on some machines.  I've restarted them.
    2)  3/29: SMU file transfer errors ("Authorization denied: The name of the remote host (smuosgse.hpc.smu.edu), and the expected name for the remote 
    host (smuosgseint.hpc.smu.edu) do not match. This happens when the name in the host certificate does not match the information obtained from DNS 
    and is often a DNS configuration problem.").  DNS issue at SMU resolved as of 3/30.  ggus 80746 closed, https://savannah.cern.ch/support/?127514 
    (blacklisted in DDM - Savannah site exclusion), eLog 34768.
    3)  3/30: AGLT2 file transfer errors ("failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]").  From Bob: We had two related 
    issues. First, a dCache pool server crashed last night (6:30pm) and no one was available to see that until around 6:30am. Second, around 4:50am, 
    the partition hosting the dCache DB filled.  Both issues were resolved, and ggus 80755 closed.  Transfer errors reappeared on 3/31 - from Shawn: 
    One of our dCache headnodes was hitting multiple soft CPU lockups starting around 4 AM Eastern time. We have captured some screen shots and 
    power-cycled the system. dCache should have recovered by around 11:15 AM Eastern time.  ggus 80773 closed, eLog 34781.
    4)  4/1: NDGF-T1_DATADISK to MWT2_DATADISK: "Connection timed out" transfer errors.  Not clear what the issue was - the errors went away by 4/3.  
    ggus 80789 closed, eLog 34809.
    5)  4/2: BNL upgraded FTS to version 2.2.8.  eLog 34729/34826/27.  Note from Hiro: The new version of FTS will check the source checksum against 
    the provided checksum. And, if it does not match, it will reject correctly as a source error unlike the old version which will check the checksum after the 
    transfer and reject it as a destination error.
    6)  4/2: From Saul: We're having a problem with the gatekeeper at NET2 and have to reboot.  SRM is down.  Errors are expected.
    Later: We had a brief but unexpected problem with GPFS, now resolved.  SRM is back.  New storage is on-line.
    7)  4/3: From Stephane Jezequel: FTS overwrite implemented in in US and DE . It means that all DDM FT use now FTS overwrite. FTS overwrite 
    outside DDM FT for US and DE will be implemented tomorrow (except TAPE and T0 export as usual).  eLog 34860.
    8)  4/4 early a.m.: SWT2_CPB - fille transfer errors, for example: "[INTERNAL_ERROR] Parsing document '/usr/etc/glite-data-srm-util-cpp.patterns.xml' failed. 
    Check that the file exists, is accessible and contains valid xml."  Storage server had to be rebooted (it had crashed) - problem under investigation.  
    ggus 80888 / RT 21875, eLog 34968.  Blacklisted in DDM: http://savannah.cern.ch/support/?127654.
    9)  4/4: BNL voms not accessible with voms command - hangs at the point "Contacting vo.racf.bnl.gov:15003 [/DC=org/DC=doegrids/OU=Services/CN=
    vo.racf.bnl.gov] "atlas" ."  From John Hover at BNL: A service restart fixed the problem.  I'm now checking into why my alert script didn't detect the 
    problem and  restart automatically.  ggus 80902 in-progress, eLog 34908.
    
    Follow-ups from earlier reports:
    (i)  2/29: UTD-HEP set off-line to replace a failed disk.  Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767.  As of 3/7 site 
    reported the problem was fixed.  Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/
    XrdPosix.py, line 5, in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
    Under investigation.  eLog 34259.
    Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above.  Closed on 3/13 since this issue is being tracked 
    in the Savannah ticket.
    (ii)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (iii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which 
    hosts gridftp & SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at 
    DUKE_LOCALGROUPDISK.  Tickets cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  
    DDM blacklist ticket: https://savannah.cern.ch/support/index.php?127055
    (iv)  3/19: UTD-HEP - DDM errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 80364 in-progress, 
    eLog 34434, blacklisted in DDM: https://savannah.cern.ch/support/index.php?127205.  (This Savannah site exclusion ticket was unneeded, as the 
    site issues already covered in 126767 - so the ticket was closed.)
    

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=186044
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_9_2012.html
    
    1)  4/4: ggus ticket 80912 opened for FTS errors as BNL ("[INTERNAL_ERROR] The process serving the transfer (status = RUNNING) is no longer 
    active (could not open file /proc/20917/cmdline").  The ggus ticket was closed before it was discovered that two links were needed post-FTS upgrade.  
    They were added.  ggus 80985 was opened after 80912 was closed prematurely.  Now closed as well.  eLog 35099.
    2)  4/4: SLACXRD/WT2 DDM errors: reads/writes to the site were failing - large numbers of checksum errors across many sites with SLAC as the source.  
    Wei reported the problem was fixed.  ggus 80913 closed, eLog 34930.  Later the same day ggus 80916 was opened for a different error 
    ("[INTERNAL_ERROR] Parsing document '/usr/etc/glite-data-srm-util-cpp.patterns.xml' failed. Check that the file exists, is accessible and contains valid xml").   
    At the time BNL FTS monitor not available, so difficult to debug the problem.  eLog 34932.  Update 4/9: Not clear whether this was a SLAC issue. Also, 
    recent file transfers are successful - no longer seeing the error reported in the ticket.  ggus 80916 closed - eLog 35106. 
    3)  4/5: BNL file transfer errors ("[TRANSFER error during TRANSFER phase: [GRIDFTP_ERROR] globus_ftp_client: the server responded with an error 
    426 FTP proxy did not shut down]").  Comment from Stephane: Transfers are managed by the FTS server serving destination. In this case, it is IN2P3-CC 
    and not BNL. There is a similar issue for TRIUMF->IRFU. In addition, it affects FT transfers for which DDM Ops expert should be contacted before GGUS.  
    ggus 80975 closed.
    4)  4/5: ggus ticket 80919 opened and assigned to BNL, reporting that a task was waiting on input data files in the US cloud.  Such issues are generally 
    not site issues, but instead should be reported to DDM ops.  Ticket closed, eLog 34937.
    5)  4/5: ggus ticket 80920 opened and assigned to BNL, in this case noting that task 751725 was failing at the site.  Again not a site problem, but rather 
    one with the task, as tracked in http://savannah.cern.ch/bugs/?93446.  Ticket closed, eLog 34955.
    6)  4/5: OUHEP_OSG - file transfer failures with checksum errors ("DESTINATION error during TRANSFER_FINALIZATION phase: [INTERNAL_ERROR] 
    Checksum mismatch").  Horst and others performed extensive testing to determine the cause of the errors.  Not obvious whether this was a problem on 
    the OCHEP end.  Transfers eventaully succeeded, so ggus 80927/RT 21884 tickets were closed.  eLog 34942.
    7)  4/5 - 4/6: NET2 - Saul reported that some software related to the LSM created problems at the HU_ATLAS site.  Issue resolved early a.m. 4/6. 
    8)  4/6: SRM problems at BNL-OSG2 were affecting tier 0 exports (for example: "[GENERAL_FAILURE] All ls requests failed in some way or another"), 
    in additional to general file transfers.  From Iris at BNL: SRM service was not contactable due to one of the core components in dCache crashed.  
    ggus 80998 closed, eLog 34971.  (ggus 8100 also opened around this time for the same issue - ticket closed.  eLog 34977.)
    9)  4/7: From Alexei: Information related to 2011 data export from CERN, T1-T1, T1-T2 data11 transfers is archived and it is deleted from the current database 
    tables. Panda page http://panda.cern.ch/server/pandamon/query?mode=listCR and related plots like http://atladcops.cern.ch:8000/drmon/crmon.html have 
    info only from Jan 2012.  eLog 34998.
    10)  4/7: UTD-HEP - following being un-blacklisted in DDM (see (iv) below) site requested to be tested by HC/panda.  However, pilots were not able to 
    find/access the atlas s/w release areas.  RT 21898 opened.
    11)  4/8: UTD-HEP - file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 81035, eLog 35046.
    12)  4/9: NERSC - file transfer errors to SCRATCHDISK ("failed to contact on remote SRM [httpg://pdsfdtn1.nersc.gov:62443/srm/v2/server]").  ggus 81050 
    in-progress, eLog 81050. 
    13)  4/9: 4/9: SWT2_CPB: file transfers failing at the site with a security/credentials error (" [SECURITY_ERROR] not mapped /DC=ch/DC=cern/OU= Organic Units/
    OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management").  Two issues: (1) host cert for the GUMS server expired; (2) SRM host had to be 
    restarted - ran out of memory during the GUMS host outage.  On 4/10 ggus 81052/RT 21907 closed, eLog 35056/76/77.  (Duplicate tickets ggus 81104/RT 21910 
    were closed.)
    14)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    
    Follow-ups from earlier reports:
    
    (i)  2/29: UTD-HEP set off-line to replace a failed disk.  Savannah site exclusion: https://savannah.cern.ch/support/index.php?126767.  As of 3/7 site 
    reported the problem was fixed.  Test jobs are failing with the error "Put error: nt call last): File , line 10, in ? File /usr/lib/python2.4/site-packages/XrdPosix.py, 
    line 5, in ? import _XrdPosix ImportError: /usr/lib/python2.4/site-packages/_XrdPosixmodule.so: undefined symbol: XrdPosix_Truncate."
    Under investigation.  eLog 34259.
    Update 3/12: ggus 80175 was opened for the site due to test jobs failing with the error shown above.  Closed on 3/13 since this issue is being tracked 
    in the Savannah ticket.
    (ii)  3/2: BU_ATLAS_Tier2 - DDM deletion errors (175 over a four-hour period).  ggus 79827 in-progress, eLog 34150.
    Update 3/13: ggus 79827 closed, as this issue is being followed in the new ticket 80214.  eLog 34336.
    (iii)  3/10: Duke - DDM errors (" failed to contact on remote SRM [httpg://atlfs02.phy.duke.edu:8443/srm/v2/server]").  Issue with a fileserver which 
    hosts gridftp & SRM services being investigated.  ggus 80126, eLog 34315.  ggus 80228 also opened on 3/13 for file transfer failures at DUKE_LOCALGROUPDISK.  
    Tickets cross-referenced.  System marked as 'off-line' while the hardware problem is worked on.  eLog 34343.  DDM blacklist ticket:
    https://savannah.cern.ch/support/index.php?127055
    Update 4/5: Downtime extended until the end of April.
    (iv)  3/19: UTD-HEP - DDM errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8443/srm/v2/server]").  ggus 80364 in-progress, eLog 34434, 
    blacklisted in DDM: https://savannah.cern.ch/support/index.php?127205.  (This Savannah site exclusion ticket was unneeded, as the site issues already 
    covered in 126767 - so the ticket was closed.)
    Update 4/4: site reported that the OSG software had been updated, and hence was ready for un-blacklisting.  ggus 80364/RT 21811 closed, eLog 34950, 
    Savannah site exclusion ticket https://savannah.cern.ch/support/index.php?126767 closed.
    (v)  4/4 early a.m.: SWT2_CPB - fille transfer errors, for example: "[INTERNAL_ERROR] Parsing document '/usr/etc/glite-data-srm-util-cpp.patterns.xml' failed. 
    Check that the file exists, is accessible and contains valid xml."  Storage server had to be rebooted (it had crashed) - problem under investigation.  
    ggus 80888 / RT 21875, eLog 34968.  Blacklisted in DDM: http://savannah.cern.ch/support/?127654.
    Update 4/5: Site removed from blacklisting.  Still investigating a disk failure in a RAID set.  eLog 34967.
    Update 4/9: File transfers working fine following the un-blacklisting.  Investigated checksums on a questionable partition - three out of ~68k files might have an 
    issue.  ggus 80888/RT 21875 closed, eLog 35052.
    (vi)  4/4: BNL voms not accessible with voms command - hangs at the point "Contacting vo.racf.bnl.gov:15003 [/DC=org/DC=doegrids/OU=Services/CN=vo.racf.bnl.gov] "atlas" ."  
    From John Hover at BNL: A service restart fixed the problem.  I'm now checking into why my alert script didn't detect the problem and  restart automatically.  
    ggus 80902 in-progress, eLog 34908.
    Update 4/4 p.m. from John:  The Python version available under the probe account was not high enough (2.4 vs 2.6). It appears I removed the one I had been using 
    in the course of solving the VOMS synchronization issue two weeks ago. I'll ask that my probe host be upgraded to RHEL6, which will provide 2.6 and remove the 
    need for me to maintain my own Python install.  ggus 80902 closed, eLog 34945.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

Federated Xrootd deployment in the US (Wei)

last week(s) this week:

Site news and issues (all sites)

  • T1:
    • last meeting(s): Intervention planned for next Tuesday - border router change would affect production in the US. Affecting the OPN connectivity, primarily. Moving to Nexxus generation.
    • this meeting:

  • AGLT2:
    • last meeting(s): Working on local network configuration - need to get switches talking correctly, may need a downtime. May need to re-stack switches.
    • this meeting: all storage is online and in production 2160 TB in production. Equip coming for MSU.

  • NET2:
    • last meeting(s): Had a downtime to upgrade NFS server and mount points on BU side. Storage upgrade in progress - 2 racks, 3 TB drives.
    • this meeting: Storage is online, and being used - under some load. 2.1 PB now.

  • MWT2:
    • last meeting(s): Illinois CC - hardware has arrived, 16 nodes, being racked. 4 loaners nodes in CC integrated with MWT2 running analysis and production. Perfsonar nodes available. Reconfiguration of CC to accommodate us, one more piece required for the networking, for higher capacity and for upgrading to 100 Gbps in the future.
    • this meeting:

  • SWT2 (UTA):
    • last meeting(s): CVMFS installed at CPB, back online Friday no problems. Data server issue caused issues with XrootdFS. Bringing online storage. R310's received; working on optics (LR or SR).
    • this meeting: a couple of incidents - MD1000 rebuild needed; ran checksums; 3 files might have been affected. GUMS host cert expired. Everything else is running smoothly. UPS upgrade is progressing.

  • SWT2 (OU):
    • last meeting(s): Came out of downtime last week. All looks good, pleased with storage performance. 1100 MB/s for gridftp IO. 250-350 MB/s on average to BNL. Started filling up on the weekend. Lustre servers work better with the cluster fills. Handling large number of analysis.
    • this meeting: all is well

  • WT2:
    • last meeting(s): Order 10 Gbps network monitoring machines; operations are smooth. One incident over the weekend - USERDISK directories exceeded 32K number of directories, had to purge.
    • this meeting: all is well.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • AGLT2
  • MWT2 - in production DONE
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

  • SLAC: enabled HCC and Glow. Have problems with Engage - since they want gridftp from the worker node. Possible problems with Glow VO cert.
  • GEANT4 campaign, SupportingGEANT4
this week
  • See USCMS above

AOB

last week this week


-- RobertGardner - 10 Apr 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback