r6 - 13 Aug 2008 - 15:18:29 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesAug13

MinutesAug13

Introduction

Minutes of the Facilities Integration Program meeting, Aug 13, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Rob, Charles, Michael, Tom, Saul, Bob, Alexei, Rich, John, Wen, Marco, Hiro, Xin, Patrick, Karthik, Wei, Jim, Nurcan, Kaushik, Mark, Shawn, Wensheng
  • Apologies: none
  • Guests: none

Integration program update (Rob)

  • Overarching near term goals:
    • LFC migration
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • Analysis benchmarks demonstrated at increasing scale (100/200/500/1000 simultaneous jobs) at all Tier 2 facilities.
    • Storage upgrade: provisioning of capacities according to pledges on track for September 15 2008 deployment.
    • WLCG - SAM/RSV, reliability availability metrics for CE and SE reporting >80% for all sites.
    • Network performance monitoring infrastructure deployed.
  • Upcoming meetings
  • Availability monitoring - Michael notes results were very good - congrats to all site admins and Fred, and the OSG folks, Arvind and Brian.

LFC migration

Next procurements

Internet2 monitoring host

  • Orders placed for each site?
  • Resistance to the agreed configuration from SLAC - Wei will check out why. Also a concern about Web100 kernels (or old kernels).
  • AGLT2 - ordering 2 sets of two.
  • MWT2_UC - ordered; _IU
  • NET2 - will order this week
  • SWT2 - will order this week.
  • Software available end of month.
  • Install as arrive.
  • Fully deployed at all sites - Sep 30

Follow-up issues

  • Storage capacity recommendations/guidance for the Facility (320 TB capacity, from Kaushik's model on MinutesJune11).
    • Latest spreadsheets from Roger today. Targets may have changed at bit.
  • Revised WLCG pledges - need info by July 15. Action item for Rob (not done!)
    • Numbers coming shortly

Operations overview: Production (Kaushik)

  • Borut back, tasks defined, filling up again. We'll probably drain in a few days.
    • 10 TeV jobs?
    • Release 14.2
  • What releases are available at sites - for US sites, Xin installs everything. There is a list of releases now associated with US sites, that has been used by Panda brokering. Xin investigating. How is the Panda db filled? Xin notes that in the future we will publish the installed releases, using Alessandro's script and web portal.
  • md5sum issue - there were file transfers from Tier2 to BNL that were corrupted, incidence higher than in the past. There were a few tasks with very high error rates, much greater than <.1%. Now more like 7-8% error rates. Wensheng discovered files were coming from AGLT2. They are using SRM v2.2 as the server, culprit? gridftp issue?
    • Adler32 not completely implemented in Panda - not used by default.
    • Need something that protects from these problems, apart from production. In DQ2 there is no checking. Is this too difficult to implement?
    • Alexei, Hiro, Wensheng, Rob, Michael - to develop plan. Urge Alexei to make this a standing agenda item on ADC operations, to be discussed ATLAS-wide. That would preferred, but if not possible we need something for the US. Alexei - need working group in ATLAS.
    • Kaushik - what do we do in the short term? Need action plan for this particular problem - take AGLT2 offline for the moment. Localize problem.
  • Status of Panda integration with space tokens (PRODDISK), use of lcg-cp
    • Last week: Want to see AGLT2 exercised for a week. Follow-up this next week.
      • Paul's jobs are currently failing - Wensheng has list of problem files.
    • 20-40 TB needed for PRODDISK. Start with 20 TB. Follow-up each Tier2 next week.
    • wn-client from OSG 1.0 needed for lcg-cp.
    • Wei - reports there is a bug with lcg-cp with srm-bestman if file does not exist. Its not yet packaged in glite. Wei requests lcg-ls be used first to check file existence.

Shift report

  • We've been idle.

Analysis queues, FDR analysis (Nurcan)

  • Last week:
    • A wiki page was setup to show the online/offline status of the US pathena analysis queues as well as their availability for various athena releases/packages, see the page at PathenaAnalysisQueues.
    • We have analysis workshop at the end of August - there will be a user-support session that Nurcan will present plans for US. Plan is to provide combined support for pathena and ganglia.
    • Preparing for 3-site Jamboree in September.
    • Usage has been light this month, though expect once users start again and update their pathena hosts which has brokering to other sites in the cloud.
    • Kaushik comments that the brokering seems to be working.
    • Collecting information about Panda releases to be used during workshop.
  • Testing ARA with pathena - expect this to be used for 3-site jamboree; first attempt successful. DPD's at UWisc. Don't expect problems at other sites.
  • Sending test jobs at ANALY_MWT2. Previous job types were successful. TAG jobs now working fine.
  • Waiting to hear from the FDR-analysis users about which releases and datasets.
  • Analysis job functional testing.

RSV and WLCG SAM (Fred)

Operations: DDM (Alexei)

  • http://atladcops.cern.ch:8000/drmon/fdrmon_TiersInfo.html
  • In progress to BNL.
  • Will setup monitoring for Tier2's today, and will start subscriptions tomorrow.
  • Throughput rate tests to Tier 0 - Tier 1.
  • FT running continuously. Every week we'll run a throughput test, 100% of nominal.
  • Gridftp server upgrade at BNL - deployed and operational by the end of the month. Probably by mid-September. Machines ordered.

From Alexei:

Hello,

FDR-II reprocessed datasets are subscribed to ALL Tier-1s and CERN
dataset patterns 
  fdr08_run2.%.merge.AOD.%_r%_t%_tid%
  fdr08_run2.%.merge.TAG.%_r%_t%_tid%
  fdr08_run2.%.DPD%.%_r%_tid%
So only 'merged' AOD and TAG datasets are subscribed

The replication procedure was discussed and agreed with physics 
coordination and please refer to  
http://indico.cern.ch/getFile.py/access?subContId=1&contribId=6&resId=0&materialId=0&confId=39078
for more detailes.


There is no subscription within clouds for the moment. I want to have 
confirmation that shares are up-to-date 
(http://atladcops.cern.ch:8000/drmon/fdrmon_TiersInfo.html)



More details on Thursday's ADC Operations meeting (Sasha, Pavel and Rod 
talks)


some technicalities 

- TAGs are produced now 
- AOD, DPD and TAG datasets are subscribed to all Tier-1s and CERN
- datasets are subscribed to 'T1'_DATADISK
- datasets are subscribed ONLY if at least one Tier-1 is listed as 
  a complete replicas owner
- datasets are subscribed w/o '--source' and '--wait-for-sources' options

exceptions :
 dataset is not subscribed to BNL-OSG2_DATADISK if BNLPANDA is listed 
 as replicas owner
 dataset is not subscribed to NIKHEF-ELPROD_DATADISK if 'SARA' is listed 
 as replicas owner

Cheers, Alexei

wlcg-client (Marco)

Testing of client tools for data access by US ATLAS physicists, and issues.

  • USATLAS-SEreport.pdf: USATLAS-SEreport.pdf
  • Suggestions to make ToA changes to include port number everywhere.
  • Charles has a script that can be used to normalize the LRC with consistent full URL.
  • Bestman requires latest glite release for clients - Marco to check with VDT.

Site news and issues (all sites)

  • T1: not too much to report; integration of new LAN switches.
  • AGLT2: nothing more.
  • NET2: brining up Harvard component. Firewall issues. Will be a separate OSG gatekeeper, share DQ2. New machine for GK and DQ2 has already been deployed; interested in getting some concrete throughput figures between
  • MWT2: move of analysis queue to the production cluster. Problems with voms-proxy certificates with extended role. myproxy-init from OSG must be used. Throughput tests between UC and IU. Using srmcp in push 800 MB/s using direct pool transfers. 50 files in flight at once. BNL to IU - using srmcp rather than g-u-l (200 MB/s).
  • SWT2 (UTA): working on plan for UTA_SWT2, otherwise all okay.
  • SWT2 (OU): OUHEP cluster being upgraded. Upgrade of storage on Tier 2. There is an iptables issue with the current kernel (turning it on puts the BW in the basement)
  • WT2: Working with BNL to get glexec pilot working - made some progress - needs to contact the myproxy server at BNL, setup proxy at SLAC. Looks promising. Close to deploying a local conditions database at SLAC. There is a new version of Bestman-xrootd. Need to get into VDT and supported through OSG-Storage.

Carryover issues (any updates?)

Pilot upgrade for space tokens (Kaushik (Paul))

  • Update from Shawn. lcg-cp working; still need to work on registration. Carry-over

Release installation via Pacballs (Xin)

  • Need to follow-up. Meeting this Friday.

Throughput initiative - status (Shawn)

SRM v2 and Space Tokens (Kaushik)

  • Follow-up issue of atlas versus usatlas role.
  • The issue for dCache space token controlled spaces supporting multiple roles is still open.
  • For the moment, the production role in use for US production remains usatlas, but this may change to get around this problem.

User LRC deletion (Charles)

  • Data deletion tool - completed. Testers needed. Updates at sites are needed.

WLCG accounting

OSG 1.0

  • Following development of Globus gatekeeper errors 17 and 43 at some sites in OSG 1.0

Tier3

  • There is a separate subcommittee formed to redefine the whitepaper. Placeholder to follow developments.

AOB

  • none


-- RobertGardner - 12 Aug 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


xls DS3200_EXP3000_Atlas.xls (95.5K) | RobertGardner, 13 Aug 2008 - 10:48 | DS3200 config
xls DS3400_EXP3000_Atlas.xls (96.0K) | RobertGardner, 13 Aug 2008 - 10:48 | DS3400 config
xls DS4200_EXP420_Atlas.xls (96.5K) | RobertGardner, 13 Aug 2008 - 10:49 | DS4200 config
xls ATLAS_DCS9550_.xls (13.5K) | RobertGardner, 13 Aug 2008 - 10:50 | DCS9550 config
pdf USATLAS-SEreport.pdf (212.1K) | MarcoMambelli, 13 Aug 2008 - 14:29 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback