r4 - 27 Aug 2008 - 15:18:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug27

MinutesAug27

Introduction

Minutes of the Facilities Integration Program meeting, Aug 27, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Kaushik, Armen, Charles, Rob, Horst, Karthik, Bob, Saul, Fred, Mark, Rich, Nurcan, Wei, Jim, Wen, Hiro, Xin
  • Apologies: none
  • Guests: none

Integration program update (Rob, Michael)

  • See IntegrationPhase6
  • Upcoming meetings
    • Analysis tutorial (CERN) - last week of August
    • 3-site Jamboree: 9-12 September
    • Next Tier2/Tier3 meeting - revised dates: 22-23 September; @BNL, webpage coming soon. can book tickets
      • Meeting website to be setup - requested.

3-site Jamboree

  • MC datasets mostly set at Tier 2's. Hong has posted a message in fdr hypernews giving a list of datasets.
  • question about high lumi fdr 2c
  • Containers of multiple copies for high statistcs tasts
  • Akira setup templates

Next procurements

  • Dell, IBM, Sun pricing news
  • Horst needs Dell contact info
  • AGLT2 - waiting on final pricing on configurations, and then will push that through purchasing; will deploy one Thumper for testing. At MSU will buy from contract.
  • SWT2 - will go through local Dell rep for University immediately, then compare with Andy's plan. Focusing on Dell only. Setting up a separate cluster with new purchase (concern about concentrating everything on a single cluster), but will add storage to CPB.
  • MWT2 - awaiting final pricing from Dell and Sun. Considering x4540 + J4400 expansion.
  • NET2 - negotiating with IBM, and combining w/ another larger order at BU. Expect resolution in a couple of days. Interested in iDataPlex versus blades. Storage: DS3200 or the Dell equivalent, and DS4200 (fiber channel) - for GPFS expansion.
  • WT2 - already spent ATLAS money, so no new purchases. Several w/ x4500 and ZFS. Michael: a lot of automated tools have been developed to make the Thumper appear as an appliance. 4 8 TB filesystems known to dCache. 36 TB useable out of 48 TB raw.

Internet2 monitoring hosts

  • Follow-up:
  • Software available end of month, install as servers arrive.
    • Rich: code freeze by end of the week. Delivery date by Sep 4. On schedule
  • Fully deployed infrastructure at all sites by Sep 30

Revised WLCG pledges

  • Need the planned pledge amounts. Rob to send info to Michael and Jim

Operations overview: Production (Mark)

  • No much running in the US in the past week, except for two episodes at the 4-6K jobs.
  • Getting some new jobs. Report to Xavi's shift meeting on Tuesday.
  • Job eviction - changes in the job scheduling algorithm helped. Jobs had started, but were evicted since the autopilot scheduler thought they were sitting idle for 3 hours. There was stale information provided by Condor. Jamie from Condor team aware, no follow-ups from last week though.
  • Kaushik in contact with Borut this week about keeping the queues full. They are "almost ready". Lots of jobs failed in validation and scout jobs sent today. Hoping for a scale of 100M events.
  • PRODDISK - integration still not complete. Expect to have this within the next week. Decoupled from going into full-scale production. Bob reports that jobs complete, but remain in transferring, but are not getting copied back to BNL. Debugging underway. Paul is submitting test jobs. Kaushik wants to see ~ couple thousands.
  • Autopilot adjuster written last week by Charles to dynamically adjust the queue depth. Reduces gatekeeper load when few jobs.
    • Marco believes the missing update on the Condor status is caused by gatekeeper load, so these may be related.
  • Follow-up issue: Adler32 checksum.

Shift report

Analysis queues, FDR analysis (Nurcan)

  • Test jobs with Release 14.2.20 started. Sites successful, ANALY_AGLT2 still ofline. TAG selection and ARA jobs have not been sent with this release yet. See the status at PathenaAnalysisQueuesUScloud.
  • I presented pathena user support overview yesterday in the ATLAS Analysis Workshop, the talk can be found here.
  • A combined user support for both pathena and Ganga users is planned. See the near term plans in the summary talk, slide 5.
  • Routine testing of pathena analysis queues: We plan to use Ganga Robot as part of this routine testing:
    • Ganga Robot is already submitting to pathena queues in OSG. See the user datasets with the name user.JohannesElmsheuser.ganga.pandatest.* at the sites.
    • However it uses all SE names (or space tokens) the developer finds in the ToA file, see for instance which sites the Robot is currently submitting to from this link. We need to match these SE/space token names to the actual queues.
    • Ganga Robot sends a job to a pathena site every 24 hours from the PhysicsAnalysis/!AnalysisCommon/!UserAnalysis package in the release. It produces a small output of on the order of KB's.
    • Adding more releases and different type of jobs into the Ganga Robot is under discussion with the Ganga team.
    • Monitoring of Ganga Robot jobs in LCG sites is being done by SAM. pathena jobs can be monitored by Panda monitor. Can they be monitored by SAM too, this is also under discussion.
    • One concern - these are Ganga jobs, the setup may be different.
    • Production shift team members, or analysis shifters - will make contact with sites. But site administrators need to have a point of contact to report observed problems. There will be an analysis support shift setup soon. Will be under ADC shift team.
  • User dataset deletion
    • Charles reminds sites need to modify LRC python code. Client script is delete_user_datast, version that works is v1.4.

RSV and WLCG SAM (Fred)

  • See https://twiki.grid.iu.edu/twiki/bin/view/Operations/RsvSAMGridView for links to SAM and Gridview reporting consoles.
  • For scheduling downtimes, the OIM system: https://oim.grid.iu.edu/
  • Everything is basically working - there was an issue reported by Dantong for BNL displaying correctly at CERN.
  • Reports for the last few days looked good.
  • At UTA - needed a separate resource in OIM, now working, problem solved. This morning's report - lower than 100% which was probably due to a network glitch.
  • Set of recommendations from Horst.
  • Michael has sent around availability report published today.

Operations: DDM (Hiro)

  • Datasets sent for Jamboree - all done. However, found that seemed to interfere with ccrc transfers.
  • UTA DQ2 not making a call-back. Investigating w/ Mark. Restarting site-services had no effect. Not reporting on the dashboard.
  • Michael: hanging site services - Miguel looking into the issue, perhaps a memory leak. Anticipate restarting the service frequently and automatically. This is being done manually at CERN on a frequent basis. Is this happening on US sites? Is this load related? No reports on this call.

Full SURL

  • Would like to be able to change the port and protocol, so short in the registration of the catalog.
  • There are constraints from the client.
  • We need to converge.

LFC migration

Site news and issues (all sites)

  • T1: operationally smooth - though not under stress. Bit of cosmic data coming in over the weekend.
  • AGLT2: reconfigured dcache token areas; not understanding why new autopilots (using PRODDISK) are not working correctly.
  • NET2: no probs at BU. Still working on Harvard site. OSG installed, Panda pilots running and firewall problems.
  • MWT2: autopilot adjuster implemented. PRODDISK setup at both sites.
  • SWT2 (UTA): CPB working fine; next week SWT2_UTA to be reconfigure.
  • SWT2 (OU): working with Dell on a purchase.
  • WT2: setting up accesss to remote conditions database at BNL. Have setup proxy, working, but access takes much longer. 20 minutes (versus 100 seconds). No explanation known. Network latency? Don't think its the proxy machine. Alternative would be to have data stream from CERN for conditions database.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Carry-over
  • There are user analysis jobs too; authorization related.

Release installation via Pacballs + DMM (Xin)

  • Xin: testing script from Alessandro - couple of modifications needed for OSG.
  • Fred is following up Stan and Alexei about some of the DQ2, hopefully a report next week. Fred having difficulty registering properly pacballs in the CERN LFC.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

  • Data deletion tool - completed. Maro - Testers needed. Updates at sites are needed.

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper (Oct 1). Placeholder to follow developments.

AOB

  • Wen reports AFS server fixed at UW, can now accept jobs.


-- RobertGardner - 26 Aug 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback