r3 - 14 Nov 2007 - 14:47:13 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesNov14

MinutesNov14

Introduction

Minutes of the Facilities Integration Program meeting, November 14, 2007
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Fred, Charles, Rob, Alexei, Patrick, Wei, Nurcan, Kausik, Bob, John, Jay, John, Karthik, Hiro, Wensheng

Integration program update (Rob, Michael)

Operations: Production (Kaushik)

  • Production summary (Kaushik)
    • Sites full again, yeah! Reprocessing task completed, back into simulation mode. Now seeing some bottlenecks in Eoywyn again, Karthik and Mark setting up a second server. MWT2_UC backlog of files - discussed below. AGLT2 still not filled, probably because of Eoywn. Not getting enough pilots, Bob thinks. May need a discussion about scaling up when MSU comes online.
    • Recon jobs coming again tomorrow - may have problems again getting datasets off tape. Input data sets? "Everything" for CSC notes. Discuss.
  • Production shift report (Nurcan)
    • OU combing back into service, will be sending jobs soon.
    • Lyon - AFS loading issues.

Operations: DDM (Alexei)

  • Status of M5 processing and distribution of datasets to the facility
  • Some late comers to M5 - late datasets. BNL has 85-96%.
  • Central services problems - Alexei will pause his cron-checking processes. Simple calls to DQ2 are failing - problems access the database. Pedro investigating.
  • Operations group will provide a new list of requirements for DQ2. Alexei compiling the list, will distribute before Friday.
  • For FDR, RDO datasets need to be transferred to CERN. 40% of all this data is at BNL, most of it on tape, requiring stage-in.
  • Are these RDOs overlapping with a request from Kaushik? There is probably some.
  • State of M5 distribution to Tier2's.

DQ2 0.4 deployment (Hiro, Patrick, Shawn)

  • See further DQ2SiteServices to capture deployment experience, known issues.
  • Next site: MWT2 - done.
  • Follow-up on zombie processes:
    • Patrick reports a DQ2 host failed - not sure if it was a hardware failure. Also notices a large number of zombie processes wrapped around glite-transfer processes (possible a status call). Wei also reports this happening at SLAC as well (3 started since November 3, still there). Happening at BU, AGLT2 (1500), UC. Submit as a DDM Savannah ticket. Could be an FTS issue.
  • Patrick - opened the ticket, has been taken by Miguel.
  • Charles notes they are taking resources - not really an operational problem.

Analysis Queues (Bob, Mark)

  • See AnalysisQueues
  • Early next week OU will start setting up the queues.
  • Patrick and Mark working on analysis on DPCC working - but not quite working as desired.
  • Pathena-evgen jobs as validtion.
  • Email Bob, ball@umich.edu.

Accounting (Shawn, Rob)

Follow-up on (see Accounting) issues.

Network Performance and Throughput initiative (Dantong)

  • See work in progress at NetworkPerformanceP2
  • Finished BNL and OU tuning (last week)
  • Need to revisit BU tuning - problem at NOX.
  • All tier2 sites now visited: * 10 G sites: IU, UC, Mich - simple tuning effective * 1 G sites: simple tuning not as effective. Only marginal improvements, requiring larger number of parallel.
  • Next steps - push the 10G sites.

Throughput initiative - overview (Shawn)

  • Current exercise:
    • Hiro preparing the 70 files and has initiated a transfer, 3.6 GB files in a test subscription, pinned.
    • Ramping up to a higher rate, using FTS controlled by Hiro.
  • No comment this week.

Load test displays, issues from the last week (Jay)

  • Follow-up on:
    • Making live graphs available on web page via MonALISA respoitory.
    • Looking into gridview plots via web service publisher
  • Have understood how to setup the ML repository for broad visibility.
  • Disk-to-disk - the results are very low, such that only the number of streams mattered.
  • *Action item

OSG

  • OSG 0.8 released, deployment instructions: OSGservices
    • General OSG summary is: Why upgrade?
    • Of these, the following are important for ATLAS:
      • RSV - resource validation service for WLCG-SAM service availability monitoring
      • Updates to Gratia probes for PBS, Condor and LSF schedulers
      • Site information services - BDII and Condor ClassAd mechanisms
      • Glexec for supporting user jobs within pilots - for Panda development
      • Syslog-ng through VDT: can be used for troubleshooting gatekeeper issues
      • Managing updates to OSG: will be much easier updating from an OSG 0.8 stack (via pacman update); updates to OSG 0.6 via pacman -update possible, though usually not preferred.
      • Updates for GIP: The list can be found at: https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/GenericInformationProviders#Major_Updates_for_OSG_0_8_0
        1. Updated GIP to include Glue Schema 1.3
        2. Changed default value from UNAVAILABLE to UNDEFINED
        3. Deprecated osg-info-dynamic-dcache
        4. Added support for VDT_OLD_LOCATION
        5. Fixed LSF bug (VDT Ticket #2647)
        6. Added squid service plugin
        7. Updated templates to conform to Glue 1.3
        8. Admins can add site defined constraints to customize configure condor status commands
        9. Admins can specify if they don't want to count VMs in "Owner" State in Condor Batch system
  • Security issues addressed in OSG 0.8:
    • We have two new methods to automatically update the CA certificates. One is the vdt-update-certs program. People can also use yum to get an RPM with the new CA certificates automatically. This will do a lot to help sites keep up to date.
    • glexec is in the VDT. It's a longer argument to convince people of its importance, but it is a critical piece of the security infrastructure for sites that care about it.
    • Tomcat and Java are at the latest version, and they contain security fixes.
  • OSG has a campaign to help sites install the software, see request form for help.
  • OSG instructions much improved over previous versions, see https://twiki.grid.iu.edu/twiki/bin/view/ReleaseDocumentation/WebHome
  • The installation is fairly simple. I did this yesterday over the course of six hours (total) on MWT2_UC without interrupting production. It would have been shorter without interruptions for meetings.
    • MWT2_UC updated yesterday
    • Took about 6 hours end to end, with interruptions for meetings and lunch; we also re-organized the install for wn-client and added our SRM-dCache as a registered OSG storage element.
    • Installed Managed fork, Syslog-ng, configured RSV
    • See validation in VORS
    • Note RSV, being a new service still has some kinks in terms of ease of configuration. Had to work harder than usual to get this setup correctly - not done yet on MWT2_UC, but on Monday we completed this for UC_ATLAS_MWT2, see: https://tier2-osg.uchicago.edu:8443/rsv/ UC_ATLAS_MWT2
    • osg-0.8-install.pdf: OSG install notes for MWT2_UC
    • However, the ClassAds are reporting correctly, see the ReSS validation service
  • OSG site administrators meeting at Fermilab: Dec 12-13

Panda release installation jobs (Wei, Tadashi)

  • Initial problems with perms on directories - probably solved.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

  • Need information from Arvind about accessing the RSV database.
  • Progress on eliminating firewall problems. Some timeouts still happening - just close ticket.
  • RT tickets - MWT2_IU - software installation at IU; Fred believes the SW has been installed, but moving $APP at the moment.
  • Still preparing split Nagios - internal/external - writing docuentation. Tutorial by SLAC.

Site news and issues (All Sites)

  • T1: Dantong: both gatekeepers have OSG 0.8 installed; glexec installed. Michael: Storage: several updates - robot control software upgraded. Expect drives installed for staging activities discussed early.
  • AGLT2: broken 500 jobs barrier. Marco notes there is a max submit per site at the submit host. Will need to increase the rate. Playing with 2950's disk servers; at least one 40 TB server for dCache; looking at optimizations. Notes there are probs with resilent dCache. Note: use osg-storage@opensciencegrid.org. Running smoothly. Will schedule a downtime, opportunistic, for network and gatekeeper work.
  • NET2: Things going well - 10G up and functional, things fixed on Friday. Note also gatekeeper needs to be tuned for network.
  • MWT2: OSG 0.8 upgraded. Backlog - caused by srm and gridftp on different hosts. We have a mix of srm and gridftp. Charles changed the LRC manually to have gsi endpoints. Backlog clearing now according to Hiro.
  • SWT2_UTA: busy getting next cluster up and running. Rocks and SL 4.5 not working correctly. Making progress w/ CentOS. Setting xrootd. Gridftp host from OSG. Hope to get this done this week. OSG 0.8 upgrade will follow this. Opteron 2216.
  • SWT2_OU: Close to getting new cluster into production. Will get a few Panda test jobs running this week, going into production next week.
  • WT2: Tier2 meeting preparations. 320 cores (Intel) on order, will take a while - 4-5 weeks. January installation. Pushing Andy for new xrootd release.

NOTE

RT Queues and pending issues (Tomasz)

Carryover action items

Syslog-ng

  • Encryption to syslog-ng Still to do, carryover.

Site performance jobs and metrics

  • Carryover; some benchmarking work w/ quad core opterons.

New Action Items

  • See items in carry-overs and new in bold above.

AOB

  • none

-- RobertGardner - 13 Nov 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf Tier-2_Accounting_Report_October2007.pdf (113.0K) | RobertGardner, 13 Nov 2007 - 09:57 | WLCG Tier2 accounting report
pdf osg-0.8-install.pdf (46.1K) | RobertGardner, 14 Nov 2007 - 12:09 | OSG install notes for MWT2_UC
xls ATLAS_Processor_benchmarks_-_20071031.xls (99.0K) | RobertGardner, 14 Nov 2007 - 14:46 | Processor benchmarks from Kitval
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback