r5 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr30

MinutesApr30

Introduction

Minutes of the Facilities Integration Program meeting, April 30, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Marco, Rob, Saul, Justin, Sarah, Karthik, Mark, Tom, Wei, Bob, John, Horst, Torre, Michael, Rich, Shawn
  • Apologies: Kaushik, Nurcan

Integration program update (Rob, Michael)

Next procurements

Tier 0/1/2 Jamboree Plans at WLCG workshop

  • Meeting held on April 24, http://indico.cern.ch/conferenceDisplay.py?confId=22138
  • See http://www.nikhef.nl/~bosk/documents/d2008/Jamboree-Apr24.pdf
  • Expectation that all sites srm v2 capable services installed. Most sites are very close, and we need to work hard to meet the milestone and participate in the exercises.
  • Discussion about using the space tokens - and managing the user disk space. Users will have quasi-permanent storage at resources in his/her region.
  • Reprocessing discussion - ccrc08 will have a slow start with data replication exercises, but will become more comprehensive.
  • Reprocessing M5 data to happen at BNL - each week Tuesday through Sunday.
  • Resources - Jim Shank - table by data type requirements.
  • Data deletion tools - central tools operated by ADC team.
  • FDR-2: June 2 through 10. bytestream mixing step at BNL (production) then transferred to CERN; then data acquisition, recon at cern, distribution to T1, distribute AODs.
  • Data replication policy change at BNL - no longer will use a VO box @BNL, but will move this instance to CERN - to be managed by central ADC operations team.
  • Note: not everything has been defined - the exercise will start on Monday.

Analysis Queue Update (Nurcan)

  • Follow-up: Hiro/Charles - there is a script in the works, not yet released, hopefully available by end of the week. Hiro - needs to write the final instructions for the LRC update that is needed.
    • Next week (follow-up) will have LRC update page and first release of the user tool.
    • There is a script now available - new capabilities to the www interface. Code for LRC is there; only piece is the instructions to install for sites.
    • See instructions to update sites in SiteCertificationP5.
    • User tool: lrc_delete_dataset_site DATASET_NAME SITE_ID

Operations: Production (Mark)

  • Updates to the eLog configuration - problem with logins for guests.
  • SLAC returned to production - all running well.
  • MWT2 - keeping new cpu's full. Marco updated submit hosts to keep the CPUs full.
  • libgfortran - issue sorted out and resolved thanks to Charles, Horst and others. Saul - remembers about a year ago there was production issue caused by using the wrong version of fortran 90. Mark thinks the two issues that came up were resolved (missing library, and the case of the wrong version).
  • User analysis jobs at BNL - not enough space in /tmp areas.
  • HPSS downtime at BNL today. No big implications.
  • Mass dump of jobs at AGLT2 - under investigation. An authentication issue? Bob investigating.

Operations: DDM

DQ2 site service upgrade status/plan (Hiro)

  • Follow-up: 0.6.6 is the same as 1.0 - under test; Will follow-up with Miguel about whether it is stable.
  • call-back is better with the newer version, would like update.

SRM v2.2 functionality for storage elements (ATLAS April 2 milestone)

Sites are being required to provide ATLASDATADISK, ATLASMCDISK space tokens. (Optional ATLASUSERDISK). April 25 is the (new) deadline. This has entered into an emergency state.
  • AGLT2: follow-up on the two problems reported last week:
      • finding large backlog because dcache pools on compute nodes slowing things down.
      • Second problem, space tokens - two setup up, store unit and group, a pool group, and a link group. There are problems with lcg-cp working properly with this. Requesting a temporary token by user, the link group manager ignores the token and sends it to the group with the most space, and files get lost. Direct this question to Gabriele. Iris will look at the configuration, and Gabriele will follow-up.
    • Space tokens are setup. Wenjing: has been working with Iris on resolving direction of
    • Separate requirement is that space token needs to be published in the GIP - required by some tools.
    • Storage element in sched config. dq2_cr is used by panda pilots - will it work? Tadashi wants to use lcg-cp.
    • Is srmcp w/ Java 1.6 a problem?
    • Use glite-url-copy?
  • Another issue - compact or full format for file registration in the LRC. Need to decide.
  • Follow-up meeting this Friday, May 2, noon CDT (Chicago) - email reminder to come.

RSV --> SAM (Fred)

Throughput initiative - status (Shawn)

  • Meeting on Monday, reviewed sites.
  • Few sites still working on infrastructure changes. Sites can inform Hiro for load tests; Jay needs to be informed about changes to infrastructure for tests.
  • BNL - end of May to add more doors (13-14 doors). Needed for all-out scalability tests to all Tier2s.
  • Jay - recommends looking at ML graphs for gaps in transfers.
  • Sites recommended to check these graphs before next Monday's meeting.
  • Jay will publish paths he is using for iperf.

Nagios monitoring subcommittee (Dantong)

  • Each site will need to prepare a script publish 5 numbers for space usage. Watermarks, etc. Shawn has a prototype script (based on space.py script).
  • Another meeting on Monday.

Panda release installation issues (Xin)

  • Follow-up on the pacball-based method next week. No update.

OSG 1.0

  • ITB 0.9 deployment and validation in progress
  • Validation of Panda - need to get site info data into Panda queues database
  • RSV - dCache space probes. Wenjing at AGLT2 is working on testing them.
  • lcg-utils?

Site news and issues (all sites)

  • T1: splitting processing resources into various parts for production vs analysis, about 1K analysis jobs/day. 200 slots for trigger studies. Mixing jobs - 150 jobs slots for FDR-2 preparation. These jobs require opening 50 files per job. 95GB/job - to be moved to the local worker node, so as not to overwhelm the SE. CCRC data preparation. Still awaiting FY08 disk resources - will be delayed to end of May (1PB of useable disk). Delay since BNL requires metered racks for power. Dantong investigating bottleneck in the 10G link to CERN.
  • AGLT2: Work continues on the storage elements. Jumbo frames - created problems transferring files back to BNL. Better performance for the public interfaces. MTU discover issue. (TSO - tcp segmentation offload.)
  • NET2: srm 2.2 running; RSV reporting (OSG 0.8); hardware - waiting for a new networking card from IBM, to bring up new blades, gatekeeper.
  • MWT2: UC up to 996 cores. IU bringing up 320 cores today. Probs keeping queues full. Upgraded headnodes w/ cpu and ram.
  • SWT2 (UTA): working with srm - now working at SWT2_UTA. Will get space tokens implemented today. Otherwise no major problem.
  • SWT2 (OU): 10G equipment is now in. Now testing. Internal data transfer tests of the 10G nic. 5.2 Gbps transfers between two hosts. External data transfer tests planned. - Karthik.
  • WT2: Tried to subscribe datasets to mcdatadisk area - went okay. Tadashi - lcp_cp w/ bestman-xrootd, failing. Will probably require a new version upgrade. 5 of 7 thumpers. 110 TB useable disk, plus 50 TB additional. 2 new Gridftp servers in use behind our srm interface. Hope to get 200 MB/s throughput. Will have a meeting to adjust CPU fairshare.

RT Queues and pending issues (Tomasz)

Carryover action items

  • None

AOB

  • None.


-- RobertGardner - 28 Apr 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback