r4 - 20 Aug 2008 - 08:46:25 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr23



Minutes of the Facilities Integration Program meeting, April 23, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Fred, Rob, Sarah, Saul, Charles, John@BU, Patrick, Nurcan, Karthik, Bob, Michael, Wei, Horst, Kaushik, Dantong, Torre
  • Apologies: none.

Integration program update (Rob, Michael)

  • Phase 4 now complete, see: SummaryReportP4 - Final comments welcome!
  • IntegrationProgram for Phase 5 (April 1 - June 30, 2008: FY08Q3) NEW
    • Need to review milestones for next week!
    • AtlasReleases - Pacballs - naming scheme decided for normal datasets (Stan/Glasgow), for distr w/ dq2, accepted. John B and Saul also working on this. Basic machinery in place. New release 3.25 of Pacman. Alessandro will emit a stream of pacballs, that define the installation. Mechanism is up to Alessandro and Xin. Current system will exist, just a new subscription will exist. MS okay.
    • DQ2SiteServices - DQ2 1.0 migration to be complete this Friday. Will shutdown central services tomorrow morning, down for 24 hours. Panda server will keep running, though clouds will be offline. What to do about site services - Miguel thinks nothing is needed. (For US - will keep the running.) Lets revisit the site services upgrade at a later date. (At some point, we may want to move to a central services operated at Tier1.) We have not had this discussion for a while, Michael would like to have this discussion. LFC: work on-going with Hiro. LFC developer help writing schema. Revist results of evaluation at May 15, then we'll decide on how.
    • OSGservices - May 15 validation. Release 14 installed. Run jobs at three sites. Shift can send these jobs on the ITB. Need to review sched config entries. June 15, but with rolling upgrade for OSG 1.0 within a two week period.
    • StorageServices - see the page.
    • MonitoringServices - RSV probes for SE's -- needs to be done in May, we have a firm commitment. Cannot let slip. Our understanding is the development in finished.
    • FileCatalog
    • LoadTestsP5
  • Overarching near term goals for Phase 5:
    • Full and effective participation FDR-2 exercises
    • Complete the benchmarks of 200 MB/s sustained disk-to-disk throughput to all Tier2s
    • SRM v2.2 functionality for all ATLAS sites
    • SAM availability reporting to WLCG (May)
  • Upcoming meetings:
  • Michael - serious milestones w/ WLCG
    • all sites: SRM v2.2 capable SE
    • SAM monitoring - these probes must be put in place, and is properly propagated to WLCG.
    • This must be done within May.
    • CRRB meeting yesterday - CERN+WLCG+our funding agencies review CPU delivered to experiments. Our funding agencies are upset that some Tier2 sites did not deliver as expected.
  • From the workshop
    • there will be an aggressive program for CCRC and FDR. Expect more details in tomorrow's Jamboree.
  • CMS jobs - after FDR-2.

Next procurements

  • Standing agenda item, see CapacitySummary. NEW updated for status as of 4/1/08.

Analysis Queue Update (Nurcan)

  • Expect to see more activity soon - DPD and DPD2 production in analysis queues - once reprocessing starts. Perhaps next week?
  • Expect to start using release - Nurcan will test.
  • Release 14 started. Will it be used? Its in pretty bad shape.
  • Will need to validate analysis queues when users migrate to Rel 14 - expect another round of testing.
  • Follow-up: Hiro/Charles - there is a script in the works, not yet released, hopefully available by end of the week. Hiro - needs to write the final instructions for the LRC update that is needed.
    • Next week (follow-up) will have LRC update page and first release of the user tool.
    • There is a script now available - new capabilities to the www interface. Code for LRC is there; only piece is the instructions to install for sites.
  • Follow-up: When we have service downtimes at sites - we need to let pathena users know. Will add two more columns in eLog, to make them more clear for users. Follow-up with Mark.

Operations: Production (Kaushik)

  • Production summary
    • Started draining, but, no new jobs for any of the clouds.
    • Migration underway, tests appear good, hope to finish within 24 hours.
    • Slowness of SRM v2.2 at great lakes - large backlog of transferring jobs.
    • Horst - Rel 14 does require Fortran 95, not installed by default. Will send instructions usatlas-grid-l.
    • Follow-up on DBRelease inconsistencies and consistency check features. Still not sure what the convention is. Compact notation, turl, gsiftp in PFNs, etc. Not resolved. Kaushik not sure what the status is. Should be brought up with ADC operations.
    • Notes last two DB releases have not been sent to the US sites. Action item to Alexei. Kaushik: this will be handled by the shift team, and the Tier1's will receive this within 24 hours. (For some reason, the subscriptions to BNL did not show up within two weeks.) Alexei has proposed an operational procedure for the shift team, and they will follow-up. They will check only at the Tier 1.
  • Production shift report (Nurcan)
    • Will have some issues during site reports.
    • SWT2 upgrades, OSG related.
    • Problems at UTD - host cert expired, new one doesn't work.
    • Monday afternoon probs at BNL - stage-in slowness, resolved but not sure why.
    • MWT2_UC - Panda mover problems. Troubleshooting difficult due to log file truncation in Condor.

Operations: DDM

  • http://www.usatlas.bnl.gov/dq2/monitor
  • Kaushik notes - if there are any fixes or downtimes needed by sites - tomorrow would be a good day.
  • Wei - cleaning of storage elements - do we have agreement. Patrick: Charles' version will tend to not delete files, looking.

DQ2 0.6.5 upgrade status/plan (Hiro)

  • Follow-up:

SRM v2.2 functionality for storage elements (ATLAS April 2 milestone)

Sites are being required to provide ATLASDATADISK, ATLASMCDISK space tokens. (Optional ATLASUSERDISK). April 25 is the (new) deadline. This has entered into an emergency state.
  • Follow-up from last Friday's meeting: MinutesSRMApr18
  • AGLT2 - two problems - finding large backlog because dcache pools on compute nodes slowing things down. Second problem, space tokens - two setup up, store unit and group, a pool group, and a link group. There are problems with lcg-cp working properly with this. Requesting a temporary token by user, the link group manager ignores the token and sends it to the group with the most space, and files get lost. Direct this question to Gabriele. Iris will look at the configuration, and Gabriele will follow-up.
  • MWT2 - troubleshooting SRM and some low-level network.
  • WT2 - cut off
  • NET2 - now have bestman and xrootd running on the gatekeeper. John: following SLAC's version of bestman-xrootd. srm commands work. Will change endpoints in ToA with srm and space tokens. Code coming from lbl.
  • SWT2 - swt2-cpb - running xrootd file system and fuse. then installed bestman-xrootd on a test server with two gridftp doors. functional, passing preliminary tests. srm is working, need to update ToA machine. Looking at b-x on top of iris. has two dedicated gridftp doors behind the srm door.

RSV --> SAM (Fred)

Throughput initiative - status (Shawn)

  • There was no meeting last Monday

Nagios monitoring subcommittee (Dantong)

  • Report from Monday's meeting
  • Sarah, Bob, Tomasz, Mark. Reviewed what was needed for Nagios alerts.
  • Identified new Panda-related alarms.
  • Ping of regular Tier2 sites for network and storage.
  • Will instrument RT tickets
  • April 28 - Monday's
  • Each

Panda release installation issues (Xin)

  • Follow-up: installation of 14.0.0. on all sites. Done DONE
  • Follow-up: Pacball-based installation: Stan merging setup script of the pacball into the production cache.

OSG 1.0

  • ITB 0.9 deployment and validation in progress
  • Validation of Panda
  • Covered above.

Site news and issues (all sites)

  • T1: still working on reliable Panda services. Next week another dcache upgrade for CCRC - will send an announcement. There will be an emergency upgrade - relocating ATLAS release host. Will post.
  • AGLT2: trying to data transf
  • NET2: doing OSG upgrade to OSG 0.8, RSV probes, bestman-xrootd.
  • MWT2: new nodes being installed. SRM troubleshooting. 10G hardware at IU not functioning to spec. libg fortran problem addressed. Panda mover issues reported, problem fixed but not confident it won't come back.
  • SWT2 (UTA): srm working; RSV working; electrical work finished - start racking and stacking next set of compute nodes; otherwise in good shape.
  • SWT2 (OU): RSV working; still experiencing crashes on tier2-02. waiting on 10G equipment.
  • WT2: site wide power outage this weekend. Expect to come back up on Monday/Tuesday. deployed latest version of bestman-xrootd. lcg-cp working, but incorrect return code. We are actually using the space token description (not name) in ATLAS.

RT Queues and pending issues (Tomasz)

Carryover action items

  • None


  • None.

-- RobertGardner - 22 Apr 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback