r6 - 30 Apr 2011 - 10:11:38 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr27

MinutesApr27

Introduction

Minutes of the Facilities Integration Program meeting, April 27, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg

Attending

  • Meeting attendees: Rob, Charles, Jason, John DeStefano, Doug, Shawn, Pat, Torre, Karthik, Dave, Bob, Saul, Sarah, Armen, Mark, Wei, Hiro, Tom,
  • Guests:
  • Apologies: Michael, Horst, Fred, Taeksu (HU), Kaushik, AK

Integration program update (Rob, Michael)

OSG Opportunistic Access

last week
  • The HCC (Holland Computing Center at the University of Nebraska) VO
  • Contacts: Derek Weitzel <dweitzel@cse.unl.edu>, Adam Caprez <acaprez@cse.unl.edu>
  • Website, http://hcc.unl.edu/main/index.php
  • Presentation: HCC-opportunistic-Atlas.pdf: HCC-opportunistic-Atlas.pdf
  • don't use $APP, $DATA, or home directories
  • glideins will exit after 16 hours, or 20 mins if they're idle. Kaushik notes most of our sites do not do prevention.
  • Load on the gatekeeper caused by glidein? Notes glidein-cms is 'nice' since only one instance is used. Believes loads are small.
    • The issue is loads caused by sudden preemption
  • Expanding usage
    • expand usage at BNL
    • Kaushik - suggests putting a limit on the max number of slots - eg. 10-20% of the site capacity maximum.
    • Michael - we should try to get some statistics - want to dynamically change this.
    • Suggest max glidein time of 8 hours. all agreed.
this week
  • LIGO
    • A user at UM using SRM at BNL. 600 jobs were attempting to stage to BNL, creating load, happened due to a lot of preemption. User has reconfigured to remove monitoring and put in timing on stage-out. Limiting number of simultaneous jobs to 32. They chose SRM as an output area. Hiro - notes this could be a problem w/ BNL's namespace holding up. Robert Engel's jobs.
    • Xin - summarized this at the OSG production call.
    • Shawn: would like to see an OSG recommended best practice for preemption.
    • Charles - can't SRM be configured to refuse connections? Hiro - also consider DN blacklist.
    • Request for a "how data flows". In this case, he had storage at BNL, but not at UM. Should we be providing SRM-based storage for opportunistic VO.
  • HCC - no activity reported.

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • No meeting this week, no urgencies
  • this week:
    • Sites need to delete files in proddisk (not done by central deletion)
    • Central deletion cleanup is okay - a small backlog at Tier 2s, most activity at Tier1 (investigating change in parameters to increase rate)
    • Cleanup old BNLPANDA, BNLTAPE space tokens ( ~ 0.5 PB to clean up)
    • New space tokens unallocated, unpowered are being implemented at the sites
    • Question from Saul - will there be deletions of ESD datasets since they are becoming less common? Tied somewhat to reprocessing.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=1&confId=135036
    
    1)  4/13: SWT2_CPB - user reported a problem downloading some files from the site.  The error was related to a glitch in updating the CA certificates/CRL's on 
    the cluster (hence errors like "The certificate has expired: Credential with subject: /C=PL/O=GRID/CN=Polish Grid CA has expired").  Problem should be fixed 
    now - waiting for a confirmation from the user.  ggus 69674 / RT 19779.
    2)  4/14: OU_OCHEP_SWT2 - job failures with the error " Unable to verify signature! Server certificate possibly not installed."  Eventually it was determined the 
    issue was with release 16.6.3.  Alessandro re-installed this version, including the various caches, and this appears to have solved the problem.  
    ggus 69690 / RT 19786 closed, eLog 24332.
    3)  4/14: New pilot version from Paul (SULU 47a).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version-SULU-47a.html
    4)  4/14: UTD-HEP set off-line at the request of site admin (PRODDISK low on space).  eLog 24359.
    5)  4/14: IllinoisHEP file transfer errors ("[GENERAL_FAILURE] AsyncWait] Duration [0]").  From Dave: A restart of dCache on the pool node last night appears 
    to have fixed the problem. No new transfer problems have been seen since the restart.  ggus 69719 closed, eLog 24371.
    6)  4/19: UTA_SWT2 - maintenance outage to update s/w on the cluster (OSG, Bestman, xrootd, etc.)  Work completed as of early a.m. 4/20.  Test jobs 
    submitted to the site.
    7)  4/19: From Bob at AGLT2 -  The UM Ann Arbor campus suffered a nearly complete power outage at 12:39 p.m. today.   Some 100 or so jobs here were lost, 
    that had been running on worker nodes without UPS protection.  Apparently, global network connectivity to AGLT2 was also impacted, as we have some reports 
    of job submission problems to ANALY_AGLT2.  The outage lasted for 3-5minutes on the main UM AGLT2 sites.  It is unknown how long the network was down.
    8)  4/20 early a.m.: SWT2_CPB - initially file transfers were failing due to an expired host cert, which has been updated.  A bit later later transfer failures were 
    reported with the error "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]" and ggus 69875 / RT 19808 were re-opened.  This 
    latter issue was possibly due to a couple of data servers being heavily loaded for several hours.  eLog 24558. 
    9)  4/20: Tadashi updated the panda server so that jobs go to 'waiting' instead of 'failed' when release/cache is missing in a cloud.
    
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: 
    CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely 
    resolved, users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    (ii)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (iii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  
    Discussed in: https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    (iv)  4/11: IllinoisHEP - job failures in task 296070 due to missing input files.  Dave at Illinois reported that it appears the files were never transferred to the site?  
    ggus 69601 in-progress, eLog 24234, http://savannah.cern.ch/bugs/?80830.
    Update 4/13: no more errors of this type after the initial group of errors.  Possibly a case where panda attempted to run the jobs before the input files has been 
    staged to the site.  ggus 69601 closed.
    (v)  4/12: UTD-HEP - job failures with errors like "Mover.py | !!FAILED!!3000!! Get error: Replica with guid 601B99EE-1E42-E011-BA72-001D0967D549 not found 
    at srm://fester.utdallas.edu."  Possibly due to concurrently running a disk clean-up script.  ggus 69641 in-progress, eLog 24284.
    Update 4/14 from Harisankar at UTD: We believe the error is caused due the cleaning process of dark data we performed. We are currently working on it (hence 
    closing this ticket.)  ggus 69641 closed.  Test jobs successful, queue set back on-line.  eLog 24331.
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-4_25_11.html
    
    1)  4/20: SLAC maintenance outage (replace an uplink fiber in a switch) completed as of ~5:30 p.m. CST.
    2)  4/21: File transfer errors at SLAC ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  A restart of the SRM service 
    resolved the issue.  ggus 69917 closed, eLog 24599.
    3)  4/22: AGLT2 - file transfer errors - from Bob: One pool went off-line overnight. We updated all firmware and BIOS of the server, brought the machine down, 
    checked all pools, and have just now brought all 5 pools of that server back on-line.
    4)  4/22-4/23: Issue with the VObox is hosted in BNL - problem with some subscription(s) to BNL_SCRATCHDISK.  Issue understood - from Hiro: There was a 
    typo in newest entry to ToA. I fixed it and commited it to cvs.  eLog 24652/60.
    5)  4/23: MWT2_UC - job failures with the error "lsm-put failed: time out after 5400 seconds."  Site is in downtime / blacklisted in DDM.  From Aaron (4/25): This 
    seems to have been caused by the Chimera services in our dcache stopping. I've restarted it and we're back to functional. I will be doing some more tests this 
    morning before bringing this queue back into test and running some more test jobs.  ggus 69957 in-progress, eLog 24752.
    Update 4/26: MWT2_UC is back on-line following a move of the cluster hardware to a new machine room.  Issue reported in ggus 69957 resolved, ticket closed, 
    eLog 24809.  https://savannah.cern.ch/support/?120545 (site exclusion Savannah).
    6)  4/24: ~9:30 p.m. - UTD-HEP requested to be set off-line due to severe storms in the area (potential for power outages).  Next day test jobs were submitted, 
    they completed successfully, so set the site back on-line.  https://savannah.cern.ch/support/index.php?120550, eLog 24764.
    7)  4/25: UM muon calibration db status was showing 'ABORTED' errors.  Seems to have been only a transient issued - resolved itself within a couple of hours.  
    ggus 69973 closed, eLog 24734.
    8)  4/25: BNL - 'voms-proxy-init' commands hitting the server at BNL were hanging up (but worked on the CERN server).  An auto-restart of the tomcat service 
    apparently resolved the issue.  eLog 24740.
    9)  4/25: SWT2_CPB - disk failure in one of the RAID arrays caused problems for one of the xrootd data servers.  Once the array began rebuilding with the spare 
    access to the storage was restored.  FTS channels were set off-line for a few hours during the incident.
    10)  4/25-4/26: ANALY_NET2 set off-line (A/C problem at the BU site - necessary to turn off some worker nodes overnight).  Queue set back on-line the next 
    morning.  eLog 24810.
    
    Follow-ups from earlier reports:
    
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: 
    CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, 
    users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    Update 4/25 from Rob Quick: This ticket has been stalled for over a month. Closing it as abandoned.  ggus 66298 closed.
    (ii)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (iii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    (iv)  4/13: SWT2_CPB - user reported a problem downloading some files from the site.  The error was related to a glitch in updating the CA certificates/CRL's on 
    the cluster (hence errors like "The certificate has expired: Credential with subject: /C=PL/O=GRID/CN=Polish Grid CA has expired").  Problem should be fixed 
    now - waiting for a confirmation from the user.  ggus 69674 / RT 19779.
    Update 4/27: User reported the files became accessible after the cert update issue was resolved.  ggus 69674 / RT 19779 closed.
    (v)  4/14: UTD-HEP set off-line at the request of site admin (PRODDISK low on space).  eLog 24359.
    Update 4/20: Disk clean-up completed - test jobs submitted to the site, but they were failing due to an expired host cert.  (Also seemed to be a potential issue with 
    release 16.2.1.2, but Alessandro confirmed that it was installed successfully back in March?)  Certificate updated, test jobs successful, site set back 'on-line' as of 
    4/22 p.m.  https://savannah.cern.ch/support/?120370, eLog 24688.
    (vi)  4/19: UTA_SWT2 - maintenance outage to update s/w on the cluster (OSG, Bestman, xrootd, etc.)  Work completed as of early a.m. 4/20.  Test jobs submitted 
    to the site.
    Update 4/22: test jobs (event generation & simulation) completed successfully - site set back to 'on-line'.  (Simulation jobs had to wait for the input dataset to be 
    subscribed at BNL - now done.)  eLog 24657.
    (vii)  4/20 early a.m.: SWT2_CPB - initially file transfers were failing due to an expired host cert, which has been updated.  A bit later later transfer failures were 
    reported with the error "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]" and ggus 69875 / RT 19808 were re-opened.  This latter 
    issue was possibly due to a couple of data servers being heavily loaded for several hours.  eLog 24558. 
    Update 4/21: no additional errors of this type seen - ggus 69875 & RT 19808 closed.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • No meeting this week.
    • There is some on-going discussions w/ Alessandro deSalvo re: perfsonar monitoring for IT cloud.
    • Longer term getting this ATLAS-wide
  • this week:
    • Meeting yesterday - notes sent to the list.
    • R410 testing - future to have 10G enabled boxes; would like to deploy only one box.
    • BWCTL testing discussion. 10G hosts can overdrive 1G hosts which appear like problems associated with the path. No good known solution. Will compare 10-1G transfer tests with iperf.
    • Italian cloud still interested in setting up perfsonar infrastructure, possibly to start summer.
    • Doug: what about running the toolkit on a non-dedicated server, and with an SL-based kernel? Shawn notes this is possible losing some capabilities.
    • Jason - we have a yum repo, could work with setting this up for smaller sites. Shawn - what would be the use-case for those types of systems, potentially showing false positives. There are the clients available in the OSG client - but perhaps the instructions aren't there, but you'd want a service to be able to respond to remote tests. Fancy gui won't be available.
      • this week.

HTPC configuration for AthenaMP testing (Horst)

last week
  • OSG reference, https://twiki.grid.iu.edu/bin/view/Documentation/HighThroughputParallelComputing
  • Still waiting on others to make progress - the Muonboy causing segfaults (can't do reco jobs).
  • Douglas Smith had tried jobs, which required pool file catalog, but this doesn't work at OSCER.
  • Not using Tier 2 cluster since it would require a Condor upgrade. Is this possible?
  • Saul - caused by fortran library? Other sites using these rpms?
  • Justin at SMU - volunteered to do this.
  • Only requirement is to configure scheduler for whole-node scheduler
  • Saul will look into setting this up at NET2 (PBS, lsf). Will send Paolo a message.

this week

  • HTPC and AthenaMP available on the OU ITB cluster. Waiting for queue - will ask Alden.
  • Dave at Illinois ready to start testing - queue created. One node on the site configured to run the whole node. Alden - scheduling updates to go hourly.

Python + LFC bindings, clients (Charles)

last week:
  • new dq2 clients package requiring at least 2.5, recommendation is 2.6. goal is to not distribute this with clients. goal is to make our install look like lxplus - /usr/bin/python26 installable from yum. Plus setup files. this will be the platform pre-requisite.
  • LFC python bindings problem - mixed 32/64 bit environ. Goal is to make sure wlcg-client has both 32 bit and 64 bit environment. /lib and /lib64 directories. _lfc.so - also pulls in whole Globus stack. Hopefully next update for wlcg-client will incorporate these changes. Charles will write this up and circulate an email.
  • Working on it - in progress.
this week:
  • Meeting today at 2 pm Central w/ VDT

WLCG accounting (Karthik)

last week:
  • Called into OSG production meeting, not much feedback during the meeting.
  • Dan Fraser suggested a separate meeting with Brian, Burt, Karthik, & others. Karthik will setup a meeting for Monday.
  • Michael - facts are on the table, goal for the meeting should be that the owners assume responsibility for implementing a solution.

this week:

  • I, Brian, Burt & Horst had a discussion about the hyperthreading issue. We discussed about if we could somehow reuse the existing variables (in config.ini) to interpret them differently (cores_per_node vs slots_per_node) or if we need to introduce new values. This led to discussion about the GIP GLUE schema and how it might affect the interoperability for any consumers downstream. It was suggested that we need to find out more from the interoperability/WLCG team about this. Below is the email from Burt about this. The action items for now are: 1) Decide how we want to change the information in config.ini (re-use existing variables or add new variables). 2) Decide how to interpret the changes on the GIP side. 3) Once we have an agreement on the above, we could work with Suchandra to implement the changes in the config.ini file, get it tested on the ITB and make sure it works as expected before rolling it into production.
  • Will track in the OSG production meeting
  • Timeframe? Believes maximum one month.
  • April statistics will be done hand. (For all sites)
  • At NET2 they were off by 20% last time.
  • Saul believes we'll get to about 10%.

CVMFS (John DeStefano)

last week:
  • Updates on sites testing.
  • John waiting on tests from sites. Sees no activity.
  • MWT2 - down, no tests
  • AGLT2 - deployed on one VM. Not an insignificant amount of work to set this up.
  • SWT2 UTA - plan to roll out on UTA_SWT2 cluster
  • SWT2 OU - Horst has installed on a few nodes - latest version, seems to be working okay.
this week:
  • Doug: Jacob was looking at creating the initial directories; Alessandro will be giving instructions as to how to install the software. Plan is to install the releases.
  • Current directory structure is /cvmfs/atlas.cern.ch/
  • Can update instructions at TestingCVMFS to use 0.2.68 and to use US ATLAS squid

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • apologies - will send a note out to sites to check on version running.
  • global redirector is back up.
  • Michael - would like to set up a global redirector at BNL using usatlas.org domain.

this week:

  • no updates, lowered due to wlcg-client work.

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • No report

this week:

  • wlcg-client issue / meeting
  • xrootd rpm testing

Tier 3GS site reports (Doug, Joe, AK, Taeksu)

last week:
  • UTD: not here
  • BELLARMINE-OU: still working on firewall issue. Could consult with OSG Security.
  • Hampton U: investigating getting a site up. Will report in a future meeting

this week:

  • Will follow-up with Horst re: using Jason's help.

Site news and issues (all sites)

  • T1:
    • last week(s): Chimera migration is progressing, hardware specs are out, PO issued. Expect a fair amount of SSD disks for the database. Will start learning about the migration to convert 100M inventory. Federated xrootd, cvmfs, and other things. Planning upgrades to power infrastructure in the building addition, more panels and breakers, this will require a partial downtime. Esnet is working on getting additional circuits operational on the new fiber infrastructure; light budget not enough bnl-to-manhattan requiring a light amp halfway.
    • this week:

  • AGLT2:
    • last week(s): 10G nic flaky on a storage node, resolved. Bob - getting wn's updated to sl 5.5 - jobs running successfully. Will start a rolling update to all the machines including security patch.
    • this week: Tracking a packet loss network issue at MSU - happens a few times a week. Using the perfsonar box to track the packet loss.

  • NET2:
    • last week(s): Still working on IO upgrade; John away on vacation; getting ready to buy more storage.
    • this week: AC outtage at BU - had to shut down worker nodes overnight.

  • MWT2:
    • last week: Major downtime pushed back till April 18 for UC server room move. LOCALGROUPDISK cleanup in progress. Site reports:
      • UC: moving server room: major downtime this week and next.
      • IU: took a short downtime to migrate GUMS from UC to IU, back online.
      • Illinois: all is well.
    • this week:
      • UC: Server room move essentially complete - photos. May need a downtime next week for AC unit commissioning.
      • IU: down a short while
      • Illinois: Upgrades of kernels and firmware - Condor up to 7.6 for AthenaMP testing

  • SWT2 (UTA):
    • last week: OSG updated. Replaced xrootd and bestman - to latest. Update to worker nodes. Might be an issue with RSV. Occasional problems with an analysis job type. Patrick: OSG is trailing too far behind the xrootd release - would rather work directly with source.
    • this week: Working on SRM issue (Bestman2 bug) - had to roll back to previous version. Wei does not see the same problem in his version. Will be installing a new squid server.

  • SWT2 (OU):
    • last week: All is well. Pursuing segfault issue at OSCER - Ed Moyse is looking at it locally.
    • this week: CVMFS installed and working on ITB nodes. Working on AthenaMP tests.

  • WT2:
    • last week(s): Short downtime this afternoon for network changes. Problem w/ uplinks w/ 8024F was due to bad fiber. 2 20G uplinks to Cisco.
    • this week: Upgraded to latest bestman2 - concerned about the number of open pipes (> 2000) in previous version. Email to LBL - addressed the problem, reduced to 20-30.

Carryover issues (any updates?)

AOB

last week this week


-- RobertGardner - 25 Apr 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf move-tmp.pdf (7202.8K) | RobertGardner, 27 Apr 2011 - 12:48 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback