r5 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr2



Minutes of the Facilities Integration Program meeting, April 2, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Kaushik, Armind, Michael, Rob, Justin, Charles, Nurcan, John B, Saul, John H, Shawn, Patrick, Torre, Alexei, Wei, Bob, Horst, Karthik, Fred, Tomasz
  • Apologies: none

Integration program update (Rob, Michael)

DQ2 0.6 upgrade status/plan (Alexei)

  • Stable release available by the end of this week.
  • There are some minor changes - but there have no new bugs found and is considered stable.
  • DQ2 1.0 - to be used for CCRC08-run2. Miguel will announce schedule, test instances available by April 15. Will need to work with Panda and Operations teams for testing with MC and functional testing. Believes site-services will be unchanged for OSG 1.0. There will be a request for a downtime, and a contingency.
    • Kaushik - will be tested with the Panda development server. Expect the actual migration of the catalogs to be short, less than a day.
    • Alexei - is requesting 3 days to 1 week downtime.
  • Release notes will be setup by the development team, and these will be distributed when patches are available.
  • There will be a dedicated mailing list setup for this.

Next procecurements

Analysis Queue Update (Nurcan)

  • TWiki page for submitting pathena jobs on the FDR Data PathenaOnFDRData.
  • Any follow-up's from the past week?
    • Deleting user datasets made with pathena: the work already started, the implementation is in place, some tests have been done, final requirements/features to be added (Hiro and Charles).
      • Charles - there is a zeroth order version available, checked into CVS. Its a minor modification of the lrc code. Need to add a certification based on grid cert.
      • Kaushik - thought the plan was to have Panda do this? Nurcan - started from the standpoint of users being able to delete datasets in LCG, why not on OSG sites.
      • Need to resolve authentication and what to enable the user to delete files.
      • What about glexec? End-user's certificate would be used.
    • Automatic redirection of analysis jobs within a cloud. Namely, no need to specify site - pathena will choose the best site based on data availability and available CPU's (needs couple of weeks to implement).
      • No obstacles, but has not yet been implemented. Auto-redirect is there, based on location.
    • A survey: analysis packages used to run pathena at the analysis queues are listed here.

Operations: Production (Kaushik)

  • Production summary
    • Running quite well. Had an unexplained load on the Panda DB server last night, no explanation. Nothing in error logs. A little worrisome.
    • Missing release DB files - jury still out.
      • At SLAC, cleanse.py script deleted file, but not LRC.
      • At AGLT2, no match in LRC entry.
      • Charles - has circulated a proposal for change in cleanse.py code, with less stringent matching. Also has local script to "groom" LRC to change endpoints for example.
      • Kaushik - there are so many conventions in how the PFNs are entered. Why?
      • Patrick - problem is coming from DQ2. dq2_cr adds // but not port number for a reg attempt for a gsiftp endpoint.
      • Wei - also sees this problem when doing direct reading from xrootd servers - requires additional code in pilot.
      • Need to report something DQ2 savannah. Patrick has already contacted.
      • Torre - should use GUIDS for reliability.
      • John BU - has dbrelease files in lrc and on disk, but not visible to DQ2. dq2-list, or -list-replica. Follow-up on usatlas-ddm-l
      • Also, for the same db release file, there were 10 LRC entries (noticed at AGLT2). Not resolved. Follow-up on usatlas-ddm-l
  • Production shift report
    • No report - Wensheng.

DDM operations within the BNL cloud (Hiro)

SRM v2.2 functionality for storage elements (ATLAS April 2 milestone)

  • Hiro: a space token to be setup by April 25. There is a twiki page describing.
  • AGLT2 - has 4 FTS channels associated with the site. Are supporting both SRM and gridftp endpoints. The main production storage element endpoint should be running v2.2.
    • Can we setup an RSV probe to monitor the SE endpoint.
    • May need to migrate data currently in the gridftp server into dCache, and update LRC.
  • MWT2
    • SRM is setup at IU, running in front of a single gridftp server.
    • SRM now working at UC - there were problems getting it to work with the private network.
    • Need to specify streams-num=1.
    • Also setup RSV probe
  • WT2
    • SRM running for several months, new version of bestman-xrootd is nearly available. Expect by next week to have everything ready.
    • Also RSV probe is needed.
  • NET2
    • Have been in contact with Wei and Alex about using bestman-xrootd- using the Posix interface. Installed, and using with a separate gridftp. Seems to be working fine. New gatekeeper available, install bestman and OSG, and modify tiers of atlas (8 cores, 32 G of ram). Not sure if a migration is necessary after the new service is available. Will setup two space tokens - mc/data. No functional tests yet.
  • SWT2
    • Waiting to hear how Wei's latest attempt goes.

RSV-SAM availability discussion

  • We had this above.

LFC integration (John/Mark/Hiro)

  • Follow-up
  • With discussions with Kaushik - suggested to provision the final hardware and do performance testing against it.
  • LFC hardware - 3 weeks from now. To be followed by performance testing.
  • Migration process - under discussion. 20M entries.
  • There will be some panda mover development needed.
  • Utilities? eg. cleanse.py.
  • Alexei makes the point that we need to use standard tools.
  • Follow-up in two weeks.

Throughput initiative - status (Shawn)

  • Report from Monday's meeting - see email this week.
  • Meetings continue to check on progress with reconfigurations.
  • A number of sites will become ready for new tests. Next Monday will get updates from each site.
  • Updates at BU, OU, MWT2, SLAC...
  • Need to notify Jay/Hiro when ready

Panda release installation issues (Xin)

  • New format from Alessandro in information system for attributes about architecture and compiler. Needs to update script for this.
  • Needs to inform Tadashi of these updates.
  • Progress on using Pacman pacball to do installation. Saul, Fred, John B... discussing with Alessandro, Xin having identical installs in US and Europe. Archived MD5checksumed releases, distributed like data. Installation now simplified.
  • Installation pilots will be run using the autopilot using the "software" role. John will help.

Nagios Alerts - Focus review (Dantong)

  • Will setup a meeting.

RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)

Tier3 issues

  • None.
  • We will define Tier3 tasks in a more project-like fashion.

Site news and issues (all sites)

  • Review SiteCertificationP4 table
  • T1: Installation of an OSG GIP caused a problem.
  • AGLT2: a problem developed within the last hour. 1000 jobs reached!
  • NET2: new gatekeeper coming online. 16 core local logging machine. 16 new blades. 84 TB raw GPFS storage. No operational problems.
  • MWT2: running close to capacity - had a problem with pnfs server, transient problem.
  • SWT2 (UTA): no issues.
  • SWT2 (OU): no issue. Part of 10G equipment delivered. need myricom and fiber channel cards. irbrix and ddn firmware.
  • WT2: some issues - cleanse.py problems, the extra "/" and/or port number. 4 thumpers to be online by May 12. Do we know the data volume for FDR-2? Will ask at ADC operations meeting tomorrow.

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.
    • News from John Gordon:

New Action Items

  • See items in carry-overs and new in bold above.


  • There is a request from OSG to provide a bestman-xrootd as part of the ITB. Could either a Tier2 or Tier3.

-- RobertGardner - 01 Apr 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback