r5 - 10 Jul 2007 - 18:32:47 - MichaelErnstYou are here: TWiki >  Admins Web > MinutesJun27



Minutes of the Facilities Integration Program meeting, June 13, 2007


  • Meeting attendees: Jay, Tom, Yuri, Horst, Karthik, Bob, Nurcan, Patrick, Saul, Xin, Mark, Rich, Tom, Michael, Rob, Kristy, Shawn, Charles, Wensheng
  • Apologies: Kaushik, Alexei

Operations: Production

  • Status report from Kaushik
  • Shifters report (Mark)
    • MWT2 - Charles - continuing to work on the http interface to the LRC. DQ2 0.3 site services finished tho. Question of siteinfo.py.
    • UTA_SWT2 - going to investigate sharing of site services, LRCs in front of the other clusters. Why not share the LRC as well? Patrick thinks its best to keep things separate for the time being, in order not to complicate things.
    • AGLT2, SLAC_XROOTD running since last week.
    • Large backlog of jobs to be sent back to BNL. Transfer rates are low - Hiro investigating.
    • Another problem is the delivery of jobs from Eowyn. Yuri thinks there are plenty of jobs in the assigned state.
    • Overall failure rates, other than file movement, is low.
    • Michael notes that dcache at bnl has been ruled out - the problem is somewhere in DQ2.
    • Now the present status are that input files are not being transferred to the tier2s. Why? follow-up
    • For UTA, there is a backlog of 4GB files that Patrick thinks will not make it to UTA. A problem in FTS? Seeing timeouts in the transfers. (Perhaps open this up?) Seems capped at 100 Mbs. What is rate limiting?
  • Analysis queues: initial instructions, site guides, Panda integration issues: AnalysisQueueP1 - Patrick
    • Need to understand what handles are available for handling analysis vs production jobs. Will require some changs within Panda. Will be represented as a separate site (two entries as siteinfo.py). Need to be able to distinguish between two types - in terms of RSL. Eg. specifying a different queue. Or set priorites and runtimes. At dppc, previously had setup a separate queue. For Condor - need to figure how to set the priorities. Perhaps setup a separate job manager for analysis.
    • In terms of policy - we should stick to production quota of 23% - and leave room from analysis.
    • For Condor - this might be handled by setting the priorities on the submit host. Action item: Bob Ball and Mark.
  • Background

Operations: DDM

  • See issues covered above.
  • Status report from Alexei (not present).
  • DQ2 0.3 deployment update and other DDM issues
  • See DQ2SiteServicesP1 for latest status and issues on the DQ2 0.3 upgrade.
  • AOD replication exercise - not discussed.
  • DQ2 site service at BNL. There are lots of bad requests that DQ2 is wasting time on. They are internally cancelled. These are caused by missing files at Tier2s. But, there are also problems - Wensheng notes that files are not cached off tape yet. Will take a day to clear.

Tier3/Tier2 Workshop recap

  • Last week's workshop, Tier2/Tier3 Workshop at IU.
  • Review of facility planning notes, FacilityNotesTier2Jun22
  • We need to get some of the site admins interested in setting up a Tier3 into the loop. Michael will setup some milestones and make some contacts.

Site certification review for Phase 1

  • The site certification table, SiteCertificationP1 (all sites please update prior to meeting)
  • AGLT2 has checked off most of these
  • Will setup a timeline and discuss

OSG 0.6 deployment update

  • See OSGservicesP1 for info, and SiteCertificationP1 for site status. Please add "gotcha" and additional notes here that come up during the installation so that we can compare notes and experiences.
  • Site status:
    • AGLT2: was job manager issue resolved? Was never resolved, but have gone to local pilot submission. Not an AFS issue. Can Yuri try another Condor-G submission.
    • BU: Saul - dealing with AC and new 60 TB storage. Switching to three gatekeepers. Within two weeks.
    • BNL_1: done ; BNL_2: before next week (finished?)
    • MWT2_IU: done
    • MWT2_UC: done
    • UC_ATLAS_MWT2: done
    • UC_Teraport: defer until new RHEL4 x86_64 validated. Two weeks estimate.
    • OU - upgraded OSCER already; still waiting on additional 23 dual-quad nodes, and a move of the cluster (sched for July 11).
    • UTA_SWT2 cluster; have OSG 0.6 installed, but not the default gatekeeper.
    • UTA_dppc cluster: just waiting for a lull in production. DONE
    • SLACXRD: now running OSG 0.6

Logging: Syslog-ng upgrade

  • LoggingServicesP1 - describes VDT-based syslog-ng install.
  • AGLT2 - should be done today. MWT2_UC done. BU will do it today. BNL would like encryption.

Carryover action items from previous meetings

  • Tomasz will consult local experts for off-site Nagios console access: there have been some fixes: need to test. DONE
  • Tomasz will take first steps towards creating a Nagios plugins respository: BNL will setup a SVN respository accessible by grid certificate. Done DONE.
  • All: sites to upgrade syslog-ng installation: *in progress * DONE
  • All sites: continue with DQ2 0.3 DONE and OSG 0.6 installations in progress still
  • All sites: check off actions taken in SiteCertificationP1 will setup timeline
  • Evaluate initial OSG site availability scripts. DONE - Tomasz doesn't think these are immediate useful.
  • Setup of Load Tests scripts repository. DONE
  • Follow-up with Shawn and folks on first set of network I/O load tests. Need to install NDT on dedicated host at each site. Then tests on-demand can take place.
  • Follow-up with OSG troubleshooting on AGLT2 gatekeeper problem. Shawn and Yuri to follow up.
  • Develop DQ2 0.3 site services installation notes for the Facility. DONE
  • Test deployment of DQ2 0.3 and functionality on a number of sites. DONE

New Action Items

  • Encryption to syslog-ng
  • Install NDT at each site - put in site certification table.
  • RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.

-- RobertGardner - 25 Jun 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback