r8 - 22 Jun 2011 - 17:48:10 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune22

MinutesJune22

Introduction

Minutes of the Facilities Integration Program meeting, June 20, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Aaron, Charle,s Rob, Hari Namsivayam (UTD), AK, Michael, Patrick, John D, Sarah, Booker, Nate, Saul, Wei, Tom, Hiro, Shawn, Xin, Wensheng, Armen, Mark, Alden
  • Apologies: Jason, Horst
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program from this quarter, FY11Q3
      • Discussion about upcoming meeting on virtual machines and config management, FacilitiesMeetingConfigVM
        • See agenda - format will be improvised as topics are discussed, and interactive
      • Note that mode of operation has changed for multi-cloud production: input files will come from remote Tier 1s. There is a performance issue - transfers are fairly slow, causing sites to drain. Note that we have not optimized network links; it needs to be addressed with ADC. We do expect this to change with LHCONE, but this won't be for a while.
      • Hiro has attempted optimization of FTS settings, but this has not helped.
      • Sarah - notes MWT2 is draining anyway; there is a problem delivering pilots. Is the timefloor setting set?
      • Need to investigate any issues with the pilot rate, but also the transfer rates back to the destination cloud. Hiro notes the small file transfers are dominated by the setup overhead. The transfers to Lyon in particular are problematic.
      • Need some monitoring plots to show this back to ADC. Hiro has some of these.
    • this week
      • Reminder one page summaries for FacilitiesMeetingConfigVM working groups due Friday
      • End of quarter approaching. From SiteCertificationP17, very easy updates to mark progress are:
        • FabricUpgradeP17 - very easy to fill out.
        • NetworkMonitoringP17 - as discussed in throughput calls, net diagrams, etc.
        • TestingCVMFS - this has been in an integration and testing phase, not all sites participating, in this case use led-gray
        • No FAX updates - debugging name translation plugin
        • UpdateOSG - green if updated to 1.2.19. May defer to next phase pending release of new wn-client (having new LFC python client bindings)
        • AccessOSG - green if the Tier 2 has enabled HCC
        • Please update by Friday, July 1.
        • Will also send out request for installed capacity updates.
      • 800 TB of raw data have been collected in 2011 so far - there will be derived data ~ 200 TB (AOD) which are targets for Tier 2s.
      • See Alexei's talk during ATLAS week this week - categories of data usage. ESDs no longer of interest. AODs are very popular, more so are ntuples (as seen from Datri, or from PD2P? ). 20 - copies of same datasets for Tier 2s. Good news since these are used for analysis and the services are successfully working to move where they are needed.
      • Management news - FY11 funding for Tier 2's granted, with no cuts. Still now word for FY12 - which requires the new cooperative agreement grant, no word yet from NSF.
      • June 29 would like to have a more comprehensive status report - going over the anticipated deliverables.
      • LHC had a number of hiccups - cryo lost. Perhaps beam tomorrow. increasing luminosity, >1200 bunches expected.
      • Summer conferences coming up, expect analysis activities to

OSG Opportunistic Access (Rob)

last week(s)
  • Engage VO introduction - overview, requirements, questions.
  • John McGee, Mats Rynge, Steven Cox
  • Overview presentation: http://www.renci.org/~scox/engage_at_atlas/engage-atlas-v2.0.pdf
  • Support wiki: https://twiki.grid.iu.edu/bin/view/Engagement/EngageAtUSATLAS
  • Worker node outbound access - what are the restrictions at sites. Eg. at BNL, no direct access is permitted.
  • SL5 is okay - works for all applications with a couple exceptions.
  • What needs to be known about the applications? Varies a bit by user/application.
  • What will be the first application to run? Would start with a straightforward application.
  • What about storage? They don't use SRM access at the moment. Michael points out the advantages of SRM as a control mechanism.
  • Steve is the technical rep; engate-team@opensciencegrid.org.
this week
  • HCC update
    • May 1 - June 22: query
    • 50K CPU-hours since May. Represents ~ 7% of the total HCC output
    • Plot: facility_success_cumulative_smry.png:
      facility_success_cumulative_smry.png
    • screenshot_03.jpg:
      screenshot_03.jpg
    • Preparing to submit to SLAC: Wei sent an email asking for an update, since an agreement is in place.
    • Need to enable HCC at NET2 and email Derek. NET2 has been busy.
    • http://glidein.unl.edu/glidestats/
  • No updates from Engage.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • PD2P? algorithm will be changed - Tier 1's had been favored. Will change to MOU share. Tier 2 is brokerage only.
    • How does this coexist with GROUPDISK subscriptions.
    • Dubna workshop was full of talks - fewer discussions, in contrast with Napoli.
    • Doug - talks should be representative of the workshop
    • Transferring backlog - related to star channel usage. Concurrent channels in the same star channel.
  • this week:
    • All is smooth. 17K running jobs. No site issues, failure rate very low.
    • Intermittent problem with task brokerage - seemed to be related to BDII. Saul reports BDII stop reporting. Seems to have happened at several sites over past ~week, not necessarily related reasons.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • No major issues, no meeting
    • USERDISK cleanup campaign started yesterday
  • this week:
    • No meeting this week.
    • All looks good mostly.
    • userdisk cleanup still on-going. generally okay. NET2 - there were some issues last week, srm timouts, solved. lfc acl errors persist at net2, but this is understood.
    • bnl rate - much more data, deletion taking longer. finally central deletion team acknowledging problems. only 1-2Hz!
    • local cleanup for legacy space tokens still going well.
    • atlas-cloud-support-us SSB email, annoying message about CALIB threshold (20 TB). Armen investigating.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-6_13_2011.txt
    
    1)  6/8: MWT2_UC  - job failures with "lsm-get failed" errors.  From Aaron: There were two files causing these jobs failures. The first was deleted by 
    the central deletion service before jobs which required it had a chance to access it.  The second was erroneously being reported as missing to our 
    xrootd clients by the our xrootd server. This stopped happening after an hour or so, and the jobs have succeeded ever since. We have investigated 
    this problem and determined that it was a caching issue on our xrootd redirector, and have hopefully addressed it with an updated to our xrdlfc plugin.  
    ggus 71322 closed, eLog 26217.
    2)  6/9: shifter reported that BNL site services were degraded (via SLS monitoring).  Issue resolved - from Hiro: It was a problem with a backend mysql, 
    running out of space. The service has been restored.  ggus 71370 closed, eLog 26258.
    3)  6/9: AGLT2 - from Bob: We may have lost our full job load, or most of it, at AGLT2 due to a network problem in our UM server room.  Network is 
    back, but we may well have a lot of lost heartbeats showing up eventually.  ggus 71416 was opened during this period - now closed.  eLog 26298.
    4)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, jobs 
    are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    5)  6/10-6/11 early a.m.: File transfer errors at NET2 (BU SRM errors to NET2_phys-top), but this was due to the site maintenance outage (4 above).  
    ggus 71461 closed, https://savannah.cern.ch/support/index.php?121506, eLog 26312/19.
    6)  6/12: Issue with sites being incorrectly associated with ggus tickets in the "SSB daily resume bugs" mailing resolved (thanks Carlos).
    7)  6/13: File transfer errors at BNL such as "dcdoor09.usatlas.bnl.gov:2811globus_xio: System error in connect: Connection refused."  From Iris: 
    Restart door fixed the problem. Automatically restart will be added to script monitoring door connection.  ggus 71749 closed, eLog 26391.
    8)  6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error, 
    consult log file."  Wei solved the problem by disabling the multi-job pilots.  Issue will be raised with panda / pilot developers.  ggus 71475 closed, 
    eLog 26382.
    9)  6/13: SWT2_CPB - DDM file transfer errors.  From Patrick: One layer in our SRM (xrootdfs) was not running and was restarted.  Investigating to 
    try and understand why this happened.  ggus 71505 / RT 20200 closed, eLog 26439.
    10) 6/14: New pilot software release from Paul (includes several changes related to CVMFS).  Details here:
    http://www-hep.uta.edu/~sosebee/ADCoS/pilot-version_SULU_47f-prime.html
    11)  6/14: Job failures at HU_ATLAS_Tier2 with the error "lsm-get failed: time out after 5400 seconds."  ggus 71539, eLog 26438.
    
    Follow-ups from earlier reports:
    
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=143330
    
    1)  6/15: IllinoisHEP, from Dave: For some reason, many jobs in the IllinoisHEP production queue are failing.  I am not sure why just yet, so I have put 
    this queue offline.
    2)  6/19:  DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").  
    ggus 71675 in-progress, eLog 26572.
    3)  6/20: DDM transfer errors to NET2_* tokens (" failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server").  From Saul & John at NET2: 
    Our srm was having a problem picking up new credentials this morning and was rejecting most requests and needed to be re-started. All seems to be 
    fine now.  ggus 71701 closed, eLog 26603.
    4)  6/21 early a.m.: SWT2_CPB - file transfer errors ("/bin/mkdir: cannot create directory...").  Issue was a failed disk in one of the RAID's, which triggered 
    a re-build, but the controller hung up, which necessitated a reboot of the storage server.  System back up as of early evening - test jobs successful, 
    prod & analy queues back to 'on-line'.  ggus 71758 / RT 20237 closed, eLog 26680.
    http://savannah.cern.ch/support/?121682.
    5)  6/21: From Shawn at AGLT2: We have lost the current Condor job load at AGLT2. We had a problem with the iSCSI server that hosts the OSGHOME 
    and ATLAS release areas and a quick reboot turned into a much longer repair than anticipated.  All running Condor jobs are lost and will show up (over 
    the next N hours ) as lost-heartbeats I assume.  (Shifters were requested to ignore any associated lhb errors.)
    6)  6/21: SLAC - job failures with errors like "Put error: lfc_creatg failed with (1015, Internal error)|Log put error: lfc_creatg failed with (1015, Internal error)."  
    Wei reported the issue was a failed disk, now fixed.  ggus 71774 closed, eLog 26658, https://savannah.cern.ch/support/index.php?121698.
    7)  6/21-22: NET2: jobs not being brokered to the site.  Saul reported that tomcat died on the BU gatekeeper, causing the system to stop reporting to the 
    OSG bdii.  Re-started, will monitor.
    
    Follow-ups from earlier reports:
    
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    (ii)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, jobs 
    are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    (iii)  6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error, 
    consult log file."  Wei solved the problem by disabling the multi-job pilots.  Issue will be raised with panda / pilot developers.  ggus 71475 closed, 
    eLog 26382.
    (iv)  6/14: Job failures at HU_ATLAS_Tier2 with the error "lsm-get failed: time out after 5400 seconds."  ggus 71539, eLog 26438.
    Update 6/17 from Saul & John at NET2: Problem resolved by improving our LSM so that it can handle the whole Harvard site starting at once.  
    ggus 71539 closed.
    

DDM Operations (Hiro)

Federated Xrootd deployment in the US (Charles, Doug, Hiro, Wei)

last week(s) this week:
  • Items on track.
  • name translation plug-in problem.
  • waiting for 3.0.4 release module
  • xprep option to be ready by end

Throughput and Networking (Shawn)

CVMFS

See TestingCVMFS

last week:

this week:

  • See email summary of actions from meeting 6/16/11:
    .  Follow-up on failed production jobs overnight at Illinois to understand cause (Dave et al)
    
    2. Alessandro will modify the installation & validation code to check for the presence of local site overrides to setup files for either:
    	a) traditional: use of pool file catalog file exported out of NFS plus conditions data in HOTDISK
    	b) cvmfs: use PFC and conditions data from cvmfs
    
       This will provide the option to roll-back changes if there are problems with cvmfs, and to test performance and other issues associated with having conditions data served from cvmfs.
    
       Test both modes at Illinois.
    
       Note dbrelease files are still required in HOTDISK (even if unused) for Panda brokering purposes.
    
       The ATLAS worker node client will continue to be supported with the OSG worker node client for the time being; we discussed dependency issues and testing required in the case that dq2 clients may be drawn from CVMFS itself (involves worker node client, local site mover, pilot). 
    
    
    3. Prepare first pass of OSG-specific documentation in the ATLAS twiki, https://twiki.cern.ch/twiki/bin/view/Atlas/CernVMFS#Setup_Instructions_for_OSG_Grid
    
    4. Broaden tests to include the following sites:
    	MWT2 (new queue) - Sarah, starting next week
    	SWT2_CPB - Patrick, starting in two weeks
    	BNL_ITB - Xin, starting next week
    
    5. Clearing of grid3-locations and re-validation and tagging of releases at sites from cvmfs.  Note Panda brokering requires (sites)
    
    6. Running of validation jobs over these sites - analysis, production, and HC (eg. this test as a template, http://hammercloud.cern.ch/atlas/10004919/test/)  (sites)
    
    7. Finalize any deployment instructions based on these tests.
  • Updates: job failures at Illinois with new setup not reproducible, new jobs completing successfully.

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
  • US ATLAS Tier3 RT Tickets

last week(s): this week:

Tier 3GS site reports (Doug Benjamin, Joe, AK, Taeksu)

last week:
  • AK - CIO is looking at a number of issues that came from Jason.

this week:

  • UTD - all well
  • AK - will be repeating tests w/ Jason without the firewall in place to check performance; max of 10 MB/s in one direction. For bi-directional (simultaneous) transfers, we get 10 MB/s max for outbound and 5 MB/s max for inbound. Hiro will start a load test using the SRM endpoint.

Site news and issues (all sites)

  • T1:
    • last week:
      • All is well. Pleased to see the available headroom in cooling capacity at BNL. There was an automatic shutdown yesterday at FNAL.
      • Autopyfactory progress.
      • Jose making good progress w/ glexec. 8 out of 10 Tier 1 centers in production.
    • this week:
      • smooth operations. Sharp increase in analysis jobs - 60K activated, so adjusted analysis share. LHCONE connectivity - dedicated fiber in service up to peering point in Manhattan, to be connected to MANLAN by next week. Once the distributed exchange point comes up
      • Chimera hardware arrived.

  • AGLT2:
    • last week(s):
      • Want to update dcache to 1.9.12-3, its now golden; downtime? wait for a couple weeks (wait for PLHC results to go out)
    • this week:
      • iSCSI server non-responsive, caused drop of job load (usatlas home). Not fully recovered - there was an automated script and job removal from Condor necessary to restore full loaded.
      • Waiting for dcache upgrade - will need a downtime within the next few weeks.

  • NET2:
    • last week(s):
      • Will do a major ramp-up of analysis jobs next week.
      • HU will be in downtime next week.
    • this week:
      • I/O upgrade progress: new software for dynamically spreading adler checksum load, re-checksummed most of inventory; improvements to LSM software for BU/HU/Tufts. Still to do: go to multiple gridftp endpoints; direct reading at BU, cluster-nfs direct reading at HU. Ready to pull the trigger on a second 10Gbps link to NoX? .
      • Lots of incoming data to the TOP-Physics group area (65TB so far)
      • Ramping up HU analysis to 500 jobs for the first time; 850MB/s BU->HU workers; ready to pull the trigger on second 10Gbps link to NoX? .
      • Major HU shutdown last weekend. Some minor troubles coming back from this.
      • Tufts using NET2 DDM via LSM ramping up in earnest for the first time. Upgrading Tufts LSM, use pcache.
      • Generally smooth T2/ BU T3/ HU T3 operations modulo the HU shutdown.
      • Note to self: enable HCC

  • MWT2:
    • last week:
      • UC: testing srvadmin 6.5.0 x86_64 rpms from dell as well as 6.3.0-0001 firmware on the PERC6E? /I cards in our R710s to reduce sense errors
      • IU: MWT2, cvmfs
      • UIUC: cvmfs testing
    • this week:
      • UC: Finishing up rebuild of one of our storage servers (disk replacements, RAID firmware update, full testing). Working on new "bootstrap" machine integrating Cobbler+Puppet, DNS, DHCP services. Still tying up loose ends on server room (UPS readouts, Liebert monitoring in building controls system & load balancing).
      • IU: Working on CVMFS testing with new MWT2 Condor queue. Retiring Force10 switch, recabling to use 6248; considering new switch gear options with IU networking group. Preparing for rack rearrangement in server room, likely August downtime of ~three days.
      • UIUC, from Dave: I will be on vacation this week and might not be able to attend the integration meeting on Wednesday. Here is some information on CVMFS since the breakout last week. I restarted production on Thursday after the breakout session when I emailed you on possible Illinois squid issues. I have seen a few additional production and also user analysis jobs fail where they could not find files in CVMFS. I suspected a squid problem, so I added the UC squid server as a backup to ours in cvmfs (but not as a load balance server). I then rebooted the Illinois squid and cleared out its disk cache. During the reboot/cleanout time, cvmfs accesses did properly move to the UC squid. From what I could determine, all the nodes just started using the UC squid without any repercussions on the running jobs. cvmf-talk proxy info will give you the information as to which squid is being used by a node and it showed that UC was being used. So far, after rebooting and cleaning up the Illinois squid, I have not seen any problems with jobs and missing files in cvmfs, but, since it is rare....it will take some time to know if that problem is now gone. There is a web site that I do not think has a link on the CVMFS twiki page that folks might like to know about http://cernvm.cern.ch/portal. You can find the release notes for cvmfs, etc on this site. Also, the writeup by Jakob Blomer is very useful https://cernvm.cern.ch/project/trac/downloads/cernvm/cvmfstech-0.2.70-1.pdf. One piece of information in this document is that it claims the servers for the repositories should have access to the local squid. It does not explain why though. I doubt if my "missing files" problem is related to the fact that I did not have these servers in the allowed ACL but....just in case I have added them to my squid configuration file....which now looks like....
        acl our_networks src 192.17.18.32/28 192.168.207.0/24 128.174.118.0/24 127.0.0.1 cvmfs-stratum-one.cern.ch cernvmfs.gridpp.rl.ac.uk cvmfs.racf.bnl.gov
        http_access allow our_networks
        If this is really is a requirement of a squid for CVMFS, then I think it might be good to put this on the twiki as well. Perhaps Doug can get some clarification on this point.

  • SWT2 (UTA):
    • last week:
      • Working on CVMFS issues - main focus.
    • this week:
      • April update of UTA_SWT2 of OSG - would up sending duplicate data to WLCG. Working with OSG accounting folks as to what to do.
      • Drained over the weekend, missing releases and no jobs brokered. Can also initiate request from Alessandro's database.
      • Outage yesterday at CPB due to some storage servers going offline.
      • Supporting CVMFS testing - using three nodes to act test nodes.

  • SWT2 (OU):
    • last week:
      • Re-installing corrupted releases.
    • this week:
      • on vacation

  • WT2:
    • last week(s):
      • Completed last round of power outtages.
      • Prep w/ HCC VO - new sub-cluster. Agreement with SLAC security. HCC and Pilot factory firewall exceptions.
    • this week:
      • Problem with LFC hardware yesterday, replaced.
      • DDM transfer failures from Germany and France - all logfiles. ROOT file are working fine. Is FTS failing to get these? Email sent to Hiro. NET2 also finding performance problems.
      • Hiro - notes many of these are never started, they're in the queue too long.
      • Suspects these are group production channels.
      • T2D? channels.
      • FZK to SLAC seems to be failing all the time. Official Tier 1 service contact for FZK.

Carryover issues (any updates?)

Python + LFC bindings, clients (Charles)

last week(s):
  • We've had an update from Alain Roy/VDT - delays because of personnel availability, but progress on build is being made, expect more concrete news soon.
  • wlcg-client being tested by Marco Mambelli
  • wn-client being tested at UC
this week:

WLCG accounting (Karthik)

last week: this week:

HTPC configuration for AthenaMP testing (Horst, Dave)

last week
  • Queue is setup and working, Douglas has been running lots of jobs. Some failing, but others succeeding.
this week
  • Dave reports successful jobs submitted by Douglas last week

AOB

last week
  • No meeting next week (BNL virtual machines workshop).
  • Fred - there are some descrepancies in RSV reliability and availability numbers being reported. Has to do with maintenance downtimes being scheduled across UTC boundary. Tracking issue with the GOC.
this week
  • None.


-- RobertGardner - 21 Jun 2011

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png facility_success_cumulative_smry.png (62.4K) | RobertGardner, 22 Jun 2011 - 11:52 |
png PastedGraphic-2.tiff (1564.1K) | RobertGardner, 22 Jun 2011 - 12:26 |
jpg screenshot_03.jpg (222.5K) | RobertGardner, 22 Jun 2011 - 12:30 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback