r5 - 04 May 2011 - 14:44:08 - AaronvanMeertenYou are here: TWiki >  Admins Web > MinutesMay4



Minutes of the Facilities Integration Program meeting, May 4, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • 866-740-1260, Access code: 7027475

Audio Details: Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link: https://cc.readytalk.com/r/bd2w3deu2kkg


  • Meeting attendees: Aaron, Rob, Nate, Charles, Michael, Tom, AK, Fred, Saul, Sarah, Shawn, Patrick, Bob, Dave, Booker, Torre, Justin, JohnB? , Karthik, Kaushik, Armen, Horst, Hiro, Wensheng, Wei
  • Apologies: Doug, Jason, John, Mark

Integration program update (Rob, Michael)

OSG Opportunistic Access

last week
  • LIGO
    • A user at UM using SRM at BNL. 600 jobs were attempting to stage to BNL, creating load, happened due to a lot of preemption. User has reconfigured to remove monitoring and put in timing on stage-out. Limiting number of simultaneous jobs to 32. They chose SRM as an output area. Hiro - notes this could be a problem w/ BNL's namespace holding up. Robert Engel's jobs.
    • Xin - summarized this at the OSG production call.
    • Shawn: would like to see an OSG recommended best practice for preemption.
    • Charles - can't SRM be configured to refuse connections? Hiro - also consider DN blacklist.
    • Request for a "how data flows". In this case, he had storage at BNL, but not at UM. Should we be providing SRM-based storage for opportunistic VO.
  • HCC - no activity reported.
this week
  • No issues this week
  • Reached out to Derrick this week
  • HCC running at MWT2_IU and OSCAR
  • NE: No progress
  • SW: Enabled to run on production cluster, no HCC jobs, but full up on samgrid
  • SLAC: Requires GlideIns? , which means outbound connections need to be enabled, working this out with networking folks
  • OSG counsel meeting coming up, goal is to report HCC jobs working across the facility by May 17th

Operations overview: Production and Analysis (Kaushik)

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • Sites need to delete files in proddisk (not done by central deletion)
    • Central deletion cleanup is okay - a small backlog at Tier 2s, most activity at Tier1 (investigating change in parameters to increase rate)
    • Cleanup old BNLPANDA, BNLTAPE space tokens ( ~ 0.5 PB to clean up)
    • New space tokens unallocated, unpowered are being implemented at the sites
    • Question from Saul - will there be deletions of ESD datasets since they are becoming less common? Tied somewhat to reprocessing.
  • this week:
    • No storage meeting this week
    • No urgent issues

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/20: SLAC maintenance outage (replace an uplink fiber in a switch) completed as of ~5:30 p.m. CST.
    2)  4/21: File transfer errors at SLAC ("failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  A restart of the SRM service 
    resolved the issue.  ggus 69917 closed, eLog 24599.
    3)  4/22: AGLT2 - file transfer errors - from Bob: One pool went off-line overnight. We updated all firmware and BIOS of the server, brought the machine down, 
    checked all pools, and have just now brought all 5 pools of that server back on-line.
    4)  4/22-4/23: Issue with the VObox is hosted in BNL - problem with some subscription(s) to BNL_SCRATCHDISK.  Issue understood - from Hiro: There was a 
    typo in newest entry to ToA. I fixed it and commited it to cvs.  eLog 24652/60.
    5)  4/23: MWT2_UC - job failures with the error "lsm-put failed: time out after 5400 seconds."  Site is in downtime / blacklisted in DDM.  From Aaron (4/25): This 
    seems to have been caused by the Chimera services in our dcache stopping. I've restarted it and we're back to functional. I will be doing some more tests this 
    morning before bringing this queue back into test and running some more test jobs.  ggus 69957 in-progress, eLog 24752.
    Update 4/26: MWT2_UC is back on-line following a move of the cluster hardware to a new machine room.  Issue reported in ggus 69957 resolved, ticket closed, 
    eLog 24809.  https://savannah.cern.ch/support/?120545 (site exclusion Savannah).
    6)  4/24: ~9:30 p.m. - UTD-HEP requested to be set off-line due to severe storms in the area (potential for power outages).  Next day test jobs were submitted, 
    they completed successfully, so set the site back on-line.  https://savannah.cern.ch/support/index.php?120550, eLog 24764.
    7)  4/25: UM muon calibration db status was showing 'ABORTED' errors.  Seems to have been only a transient issued - resolved itself within a couple of hours.  
    ggus 69973 closed, eLog 24734.
    8)  4/25: BNL - 'voms-proxy-init' commands hitting the server at BNL were hanging up (but worked on the CERN server).  An auto-restart of the tomcat service 
    apparently resolved the issue.  eLog 24740.
    9)  4/25: SWT2_CPB - disk failure in one of the RAID arrays caused problems for one of the xrootd data servers.  Once the array began rebuilding with the spare 
    access to the storage was restored.  FTS channels were set off-line for a few hours during the incident.
    10)  4/25-4/26: ANALY_NET2 set off-line (A/C problem at the BU site - necessary to turn off some worker nodes overnight).  Queue set back on-line the next 
    morning.  eLog 24810.
    Follow-ups from earlier reports:
    (i)  1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: 
    CGSI-gSOAP running on t301.hep.tau.ac.il reports Error reading token data header: Connection closed."  ggus 66298.  From Hiro:
    There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, 
    users are suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
    Update 3/14 from Iris: The issue is still under investigation. Thank you for your patience.
    Update 4/25 from Rob Quick: This ticket has been stalled for over a month. Closing it as abandoned.  ggus 66298 closed.
    (ii)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (iii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    (iv)  4/13: SWT2_CPB - user reported a problem downloading some files from the site.  The error was related to a glitch in updating the CA certificates/CRL's on 
    the cluster (hence errors like "The certificate has expired: Credential with subject: /C=PL/O=GRID/CN=Polish Grid CA has expired").  Problem should be fixed 
    now - waiting for a confirmation from the user.  ggus 69674 / RT 19779.
    Update 4/27: User reported the files became accessible after the cert update issue was resolved.  ggus 69674 / RT 19779 closed.
    (v)  4/14: UTD-HEP set off-line at the request of site admin (PRODDISK low on space).  eLog 24359.
    Update 4/20: Disk clean-up completed - test jobs submitted to the site, but they were failing due to an expired host cert.  (Also seemed to be a potential issue with 
    release, but Alessandro confirmed that it was installed successfully back in March?)  Certificate updated, test jobs successful, site set back 'on-line' as of 
    4/22 p.m.  https://savannah.cern.ch/support/?120370, eLog 24688.
    (vi)  4/19: UTA_SWT2 - maintenance outage to update s/w on the cluster (OSG, Bestman, xrootd, etc.)  Work completed as of early a.m. 4/20.  Test jobs submitted 
    to the site.
    Update 4/22: test jobs (event generation & simulation) completed successfully - site set back to 'on-line'.  (Simulation jobs had to wait for the input dataset to be 
    subscribed at BNL - now done.)  eLog 24657.
    (vii)  4/20 early a.m.: SWT2_CPB - initially file transfers were failing due to an expired host cert, which has been updated.  A bit later later transfer failures were 
    reported with the error "failed to contact on remote SRM [httpg://gk03.atlas-swt2.org:8443/srm/v2/server]" and ggus 69875 / RT 19808 were re-opened.  This latter 
    issue was possibly due to a couple of data servers being heavily loaded for several hours.  eLog 24558. 
    Update 4/21: no additional errors of this type seen - ggus 69875 & RT 19808 closed.
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/27: OU_OCHEP_SWT2 - file transfer errors like "failed to contact on remote SRM."  From Horst: Our Bestman got hung up for some reason, but our monitoring 
    caught it and auto-restarted it, and everything seems to be fine again now.  ggus 70053 / RT 19908 closed, eLog 24821.
    2)  4/27: BNL - file transfer failures with the error "The certificate has expired: Credential with subject: /DC=org/DC=doegrids/OU=Services/CN=dcsrm.usatlas.bnl.gov 
    has expired."  From Hiro: Host certificate of srm was renewed on April 15th, which is 12 days ago. So, we are not sure why it shows errors today if any. It seems that 
    it has been cached somewhere.  Anyhow, the restart of the service fixed the problem.  ggus 70067 closed, eLog 24831.
    3)  4/27: LFC errors at all US cloud sites.  Issue was due to a new release of the lcg-vomscerts package for the CERN VOMS server.  All US sites had installed the 
    update as if early afternoon.  The rpm is located at:
    http://etics-repository.cern.ch/repository/download/registered/org.glite/lcg-vomscerts/6.4.0/noarch/lcg-vomscerts 6.4.0-1.slc4.noarch.rpm. 
    4)  4/28: OU_OCHEP_SWT2 - large number of job failures with the error "an unknown exception occurred."  Many of the log files contained the entry "problem 
    running chappy!"  Since (i) some of the jobs from the same task had earlier finished successfully at the site, and (ii) there was a problem trying to install the 
    cache, it was decided that possibly the 16.6.4 atlas s/w installation was somehow corrupted.  Alessandro/Lorenzo re-installed it (including, so this may have 
    solved the problem.  Hard to say for certain since by now the jobs had cycled through the system.  ggus 70081 / RT 19923 closed, eLog 24855.
    5)  4/28: MWT2_UC  - problematic jobs with "Failed to get LFC replicas" errors.  Issue understood - from Aaron:  This was due to a single worker node with the wrong 
    system time. It had drifted many hours. Since the CRLs were just updated, it was refusing to secure a connection to our LFC. Once the time was updated, connections 
    are working again.  ggus 70112 closed, eLog 24896.
    6)  4/28-4/29: ILLINOISHEP_DATADISK - DDM transfer failures with the error "source file doesn't exist."  Possible that the dataset was deleted from the site.  Hiro 
    restored it, and the error went away.  ggus 70104, eLog 24898.
    7)  4/29: SLACXRD - file transfer failures with "failed to contact on remote SRM" errors.  ggus 69917 was re-opened.  Update 5/1 from Wei: I requested changes at 
    related FTS channels yesterday and I don't see this error again today.  eLog 24891.
    Update 5/4: Errors no longer appear, the failed transfers are done now.  ggus 69917 closed, eLog 25032.
    8)  5/2: UTD_HOTDISK file transfer errors ("failed to contact on remote SRM [httpg://fester.utdallas.edu:8446/srm/v2/server]").  From Joe: Hardware failure on our 
    system disk. Currently running with a spare having out of date certificates. Our sys-admin is working on the problem.  ggus 70196 in-progress, eLog 24971.
    9)  5/3: New pilot version (SULU 47b) from Paul.  Details here:
    10)  5/4 early a.m.: SWT2_CPB - problem with the NIC (cooling fan) in a dataserver took the host off-line.  Problem should now be fixed.  ggus 70266 / RT 19949 will 
    be closed once we verify transfers are succeeding.  eLog 25046.
    11)  5/4: OU_OCHEP_SWT2_PRODDISK - file transfer failures due to checksum errors ("[INTERNAL_ERROR] Destination file/user checksum mismatch]").  Horst & Hiro 
    are investigating.  https://savannah.cern.ch/bugs/index.php?81834, eLog 25039.
    Follow-ups from earlier reports:
    (i)  4/8: NERSC - file transfer errors.  See ggus 69526 (in-progress), eLog 24176.
    Update 4/19: some progress has been made on understanding the issue(s) - will close this ticket once it appears everything is working correctly.
    (ii)  4/8: OU_OSCER_ATLAS - still see intermittent job failures with segfault errors.  Site was set off-line 4/11 due to a spike in the failure rate.  Discussed in: 
    https://savannah.cern.ch/support/?120307 (site exclusion), ggus 69558 / RT 19757, eLog 24133/92, https://savannah.cern.ch/bugs/index.php?79656.
    • OSG security pushed back on lcg-vomscerts vs permanent files, small enough of an issue that we can handle it ourselves

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week:
    • Meeting yesterday - notes sent to the list.
    • R410 testing - future to have 10G enabled boxes; would like to deploy only one box.
    • BWCTL testing discussion. 10G hosts can overdrive 1G hosts which appear like problems associated with the path. No good known solution. Will compare 10-1G transfer tests with iperf.
    • Italian cloud still interested in setting up perfsonar infrastructure, possibly to start summer.
    • Doug: what about running the toolkit on a non-dedicated server, and with an SL-based kernel? Shawn notes this is possible losing some capabilities.
    • Jason - we have a yum repo, could work with setting this up for smaller sites. Shawn - what would be the use-case for those types of systems, potentially showing false positives. There are the clients available in the OSG client - but perhaps the instructions aren't there, but you'd want a service to be able to respond to remote tests. Fancy gui won't be available.
  • this week:
    • No meeting this week
    • LHCONE meeting next week

HTPC configuration for AthenaMP testing (Horst)

last week this week
  • When the pilot submits a node, kills efficiency until all job slots on the node are free, starts up and exits, frees up node, very inefficient
  • More efficiency possible if we keep multi-core jobs on multi-core nodes, not allow single-core jobs to run on multi-core nodes
  • Panda queue for OU HEP MP queue set up, Horst has not had time to test this yet, but likely ready for testing

Python + LFC bindings, clients (Charles)

last week(s):
  • new dq2 clients package requiring at least 2.5, recommendation is 2.6. goal is to not distribute this with clients. goal is to make our install look like lxplus - /usr/bin/python26 installable from yum. Plus setup files. this will be the platform pre-requisite.
  • LFC python bindings problem - mixed 32/64 bit environ. Goal is to make sure wlcg-client has both 32 bit and 64 bit environment. /lib and /lib64 directories. _lfc.so - also pulls in whole Globus stack. Hopefully next update for wlcg-client will incorporate these changes. Charles will write this up and circulate an email.
  • Working on it - in progress.
  • Meeting today at 2 pm Central w/ VDT
this week:

WLCG accounting (Karthik)

last week:
  • I, Brian, Burt & Horst had a discussion about the hyperthreading issue. We discussed about if we could somehow reuse the existing variables (in config.ini) to interpret them differently (cores_per_node vs slots_per_node) or if we need to introduce new values. This led to discussion about the GIP GLUE schema and how it might affect the interoperability for any consumers downstream. It was suggested that we need to find out more from the interoperability/WLCG team about this. Below is the email from Burt about this. The action items for now are: 1) Decide how we want to change the information in config.ini (re-use existing variables or add new variables). 2) Decide how to interpret the changes on the GIP side. 3) Once we have an agreement on the above, we could work with Suchandra to implement the changes in the config.ini file, get it tested on the ITB and make sure it works as expected before rolling it into production.
  • Will track in the OSG production meeting
  • Timeframe? Believes maximum one month.
  • April statistics will be done hand. (For all sites)
  • At NET2 they were off by 20% last time.
  • Saul believes we'll get to about 10%.

this week:

  • Following up on interoperability issues when changing OSG config file, Doug asked Tony about this, waiting to talk to a new employee who won't start until later this month
  • Requires input from WLCG folks, next discussion will happen at interoperability meeting on May 20th, to get implications of downstream changes
  • Question of whether USCMS has similar issues about CPU types/speeds as USATLAS does w.r.t WLCG accounting stats


See TestingCVMFS last week:
  • Doug: Jacob was looking at creating the initial directories; Alessandro will be giving instructions as to how to install the software. Plan is to install the releases.
  • Current directory structure is /cvmfs/atlas.cern.ch/
  • Can update instructions at TestingCVMFS to use 0.2.68 and to use US ATLAS squid. Shawn will follow-up.
this week:
  • All MWT2 queues are running CVMFS, no significant load on squid
  • Request: Update the TestingCVMFS page to reflect changes made in schedconfig panda SVN
  • Puppet module as published will be updated this afternoon to add modprobe of fuse module if not loaded before autofs runs
  • AGLT2 CVMFS: very close to rebuild worker nodes with CVMFS, likely ready by friday
  • SLAC: new a few more hard drives for a CVMFS local cache
  • NE: planning to install, especially on the Tier3
  • SW: basic squid set up, not production-ready, test machine with CVMFS, new queue to test against it
  • Questions: what is the plan for the PFC files? Does the CVMFS_NFILES param in /etc/cvmfs/default.conf limit the size the cache will grow to?

Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)

last week(s):
  • no updates, lowered due to wlcg-client work.
this week:
  • no big updates
  • meeting earlier today with ADC discussing global redirector move from SLAC to BNL or CERN
  • need to send a note out about upgrades prior to next round of testing
  • next week or two for next steps

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here

last week(s):

  • wlcg-client issue / meeting
  • xrootd rpm testing

this week:

  • Doug not here this week

Tier 3GS site reports (Doug, Joe, AK, Taeksu)

last week:
  • Doug is coordinating networking help for T3 sites with Jason

this week:

  • Doug not here

Site news and issues (all sites)

  • T1:
    • last week(s): Chimera migration is progressing, hardware specs are out, PO issued. Expect a fair amount of SSD disks for the database. Will start learning about the migration to convert 100M inventory. Federated xrootd, cvmfs, and other things. Planning upgrades to power infrastructure in the building addition, more panels and breakers, this will require a partial downtime. Esnet is working on getting additional circuits operational on the new fiber infrastructure; light budget not enough bnl-to-manhattan requiring a light amp halfway.
    • this week:
      • GLExec autopilot factory work going on, will go into production soon
      • Xrootd Global redirector to be moved to BNL, Hiro pursuing this
      • Chimera migration for dcache at BNL
      • Hiro testing direct-access method from dcache with new panda queue
      • Federated ID management with CERN using shibboleth with trust relationships between BNL and CERN

  • AGLT2:
    • last week(s): Tracking a packet loss network issue at MSU - happens a few times a week. Using the perfsonar box to track the packet loss.
    • this week:
      • LFC number of threads increased from 20 to 99
      • BillingDB? filled up a partition on dcache, now cleanup is automated so as this will no longer occur
      • Meeting with Dell next week with regard to SSDs, hope for better pricing
      • Met with MWT2 to discuss LSM and pcache
      • Revisiting testing of direct-access methods
      • Plan to deploy CVMFS with new rocks build, likely complete today, rolling re-builds will being later in the week
      • Testing NexSAN? , doing iozone testing, issues with 60-unit rack-mounting, improvement over Dell in Density and Performance

  • NET2:
    • last week(s): AC outtage at BU - had to shut down worker nodes overnight.
    • this week:
      • Relatively smooth operations
      • Tier3 work: operational now
      • Focused on local IO ramp-up: joining GPFS volumes complete, rebalancing of files, good performance from GPFS
      • Harvard directly mounting GPFS with clustered NFS with local site mover
      • Getting another 10Gb link from campus folks
      • Adding GridFTP? instances at BU
      • Upgrading OSG, upgrading to BestMan2? , moving to CVMFS
      • Purchasing more storage, 2 rack worth: ~300TB per rack, ending up at 1.5PB by July

  • MWT2:
    • last week: Major downtime pushed back till April 18 for UC server room move. LOCALGROUPDISK cleanup in progress. Site reports:
      • UC: Server room move essentially complete - photos. May need a downtime next week for AC unit commissioning.
      • IU: down a short while
      • Illinois: Upgrades of kernels and firmware - Condor up to 7.6 for AthenaMP testing
    • this week:
      • UC: Completed our move, moved to CVMFS
      • IU:
      • Illinois:

  • SWT2 (UTA):
    • last week: Working on SRM issue (Bestman2 bug) - had to roll back to previous version. Wei does not see the same problem in his version. Will be installing a new squid server.
    • this week:
      • Looking to CVMFS
      • Storage server NIC went away, caused some problems but back fine now
      • Partial outage on sunday due to 8 hour power outage, generator should come on without issues, but 2 racks of SWT2_CPB workers will be affected
      • Rolled back BestMan2? version, ran on a second node with no problems of new version, in the mean time newer version was released, will test soon and then take a downtime to move to latest BestMan2? and newer OSG stack, also spin up another 200TB of disk

  • SWT2 (OU):
    • last week: CVMFS installed and working on ITB nodes. Working on AthenaMP tests.
    • this week:
      • Glitch with a release getting corrupted, deleted and re-installed fixed the problem
      • CVMFS testing ongoing, hope to move to CVMFS next week
      • Working on MP cluster/queue at OU

  • WT2:
    • last week(s): Upgraded to latest bestman2 - concerned about the number of open pipes (> 2000) in previous version. Email to LBL - addressed the problem, reduced to 20-30.
    • this week:
      • Latest BestMan upgraded, working and fixes a number of issues
      • Alex will release latest BestMan to OSG by the end of the week, including plugin to dynamically add/remove GridFTP? servers
      • 2-3 day power outage starting Friday afternoon through Sunday or Monday to bring more power to the building
      • After power outage, starting OS installation of new compute nodes once power is delivered to new nodes

Carryover issues (any updates?)


last week this week

-- RobertGardner - 30 Apr 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback