r5 - 20 Aug 2008 - 08:46:24 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr16



Minutes of the Facilities Integration Program meeting, April 16, 2008
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Rob, Charles, Justin, Shawn, Sarah, Michael, Fred, Saul, Karthik, Bob, Nurcan, Kaushik, Wensheng, John Hover,
  • Apologies: Horst

Integration program update (Rob, Michael)

Next procurements

  • Standing agenda item, see CapacitySummary. NEW updated for status as of 4/1/08.

Analysis Queue Update (Nurcan)

  • Continuing issue of deleting datasets with pathena (for user analysis datasets in SEs). No news from the DQ group.
  • Hiro/Charles - there is a script in the works, not yet released, hopefully available by end of the week. Hiro - needs to write the final instructions for the LRC update that is needed.
    • Next week will have LRC update page and first release of the user tool.
  • When we have service downtimes at sites - we need to let pathena users know. Will add two more columns in eLog, to make them more clear for users.
  • Form a unit among the shift time to support user analysis jobs.
  • What about digi jobs on the grid as regular users. Note - this will require large numbers of input files. Do we have a policy? Or - do we need a quota request? Kaushik: becoming more and more of an issue. We have had many requests for reprocessing using pathena rather than Panada proper. There are issues w/ storage requirements, files on disk vs tape, etc. for RDO files.
    • Note this was not in the computing model. Need to discuss in RAC.
  • Number of users in the system analyzing requiring AODs from release 13. "We need them all."

Operations: Production (Kaushik)

  • Production summary
    • US is doing well - there are some lingering issues with SRM v2.2 (AGLT2 and SLAC).
    • Lots of FDR2 jobs to do.
    • Release 14 is out - Xin is installing it - but it only runs on SLC4. Will be differentiating SLC4 vs SLC3 in Panda's info system.
    • From last week: Biggest issue - mysterious autopilot problem, Condor-G submission to AGLT2, also at SLAC, UTA. Submit host state is not the same as the queues at the site. Wisconsin engaged. Condor team has identified two bug fixes to keep Condor from losing state information. John Hover has bug fixes that solve the problem manually. Condor will provide a new release: estimate? (Need to hear back from Jamie Frey.) Michael believes we're close to receiving the fixes, and John says these will go into 7.0.2. Excellent response from Condor team.
    • Update on planning stages of working out mixing jobs at BNL; Michael: decided to have mixing jobs not read from dcache directly (each job reads 50 input files x 150 jobs). Decided to copy to local disk. 95GB per job. Will need to reconfigure the farm. No input files can be shared between jobs, unfortunately. Queue has been setup. Illyana is setting up a testbed for this. Not sure when this will start (est. end of April, beginning of May). Kaushik: higher priority are the digitization jobs for FDR-2.
    • Follow-up on DBRelease inconsistencies and consistency check features. Still not sure what the convention is. Compact notation, turl, gsiftp in PFNs, etc.
    • Notes last two DB releases have not been sent to the US sites. Action item to Alexei.
  • Production shift report
    • Mark - couple of site issues - upgrade of dCache at BNL. One of the EU sites had a large backlog of transferring jobs. The AGLT

Operations: DDM

  • Follow-up:
    • ATLAS functional tests and throughput latency
      • US got a lot more data than other regions - T1 and T2s. Each US T2 got 275 datasets. Replication within the cloud went well, with low latency.
    • http://www.usatlas.bnl.gov/dq2/monitor
    • Stable now. Most datasets that were subscribed to BNLDISK release 12.
    • Quite nice!
    • Registering datasets to sites - not automatic, plans to add an automatic process to do this for all sites.
    • Also, SRM monitoring.
    • Q: which DQ2 site to use for AOD replication at UTA (SWT2_CPB)

DQ2 0.6.5 upgrade status/plan (Hiro)

  • Follow-up:
    • 0.6.6 will come out this week - Miguel claims it stable.
    • 1.0 will come out in three weeks.
    • Question is whether we need to upgrade or not.

SRM v2.2 functionality for storage elements (ATLAS April 2 milestone)

Sites are being required to provide ATLASDATADISK, ATLASMCDISK space tokens. (Optional ATLASUSERDISK). April 25 is the (new) deadline. This has entered into an emergency state.
  • AGLT2 - have been moving a stand alone gridftp server, migrating into srm-dcache. Made copy, moved over. change LRC, and update central db with new site name. has taken more time for copy, lrc update, and dq2 site services, and panda config data. decided to keep prod data going to aglt2_srm while aglt2_mcdisk coming into production. wensheng testing and things are working. Shawn thinks both are working correctly now. Focus now is getting production back.
    • Clarification: ATLASDATADISK - for all AOD replication, DPD, etc
    • ATLASMCDISK - for production
    • ATLASUSERDISK -for users
    • Better to re-subscribe and migrate for ATLASDATADISK
    • Is there a PandaMover issue for space tokens? dccp ignores space tokens
    • Need a dedicated meeting to sort these issues. Dedicated meeting this Friday - 10 central.
  • MWT2 - working on SRM functional tests at IU and UC.
  • WT2 - put several entries into ToA. Current production is going to the old site that does not support space tokens. q: how to convert the old data into the new space token area, and need to change the LRC entries, and how will Panda jobs write to this area. Wei has tested with glite-url-copy working. No FTS or DDM system tests. We do have a problem with Panda Mover - it dosen't understand the full surl format. Discussions w/ Tadashi - Panda Mover is using gridftp door, not srm for now.
  • NET2 - new gatekeeper problems with 10G nic - waiting for IBM to respond. Have bestman-xrootd installed on gatekeeper. OSG upgrade on new gatekeeper.
  • SWT2 - several issues - old cluster dpcc is being upgraded, for limited production, or tier3. UTA_SWT2 - running steady. UTA_CPB - down for electrical work, reposition the racks. For SE, Bestman-xrootd a la SLAC. Some questions came up - in communication with Wei. No problem meeting the May deadline.

RSV --> SAM (Fred)

  • Please see this link: http://www.usatlas.bnl.gov/twiki/bin/view/Admins/MonitoringServices
  • There was an assumption that the gridftp door is on the gatekeeper.
  • Sarah Williams is working w/ the RSV team to provide a hack to the RSV gridftp probe. There are instructions for the modified probe above.
  • Fred notes that help is available to get the CE's reporting.
  • May 1 is the deadline for the CE availability.
  • The RSV probes for SE need to come very quickly after that.
  • Fred will send a status of sites to the list.
  • Michael notes there is a release available - Fred and Sarah will investigate.

Throughput initiative - status (Shawn)

  • Report from Monday's meeting, see LoadTestsP5
  • A number of sites are making changes to edge servers - also waiting for new gridftp doors at BNL (for aggregate testing, current limit is 700MB/s).
  • Each week we are asking sites for reports of changes, and whether they are ready for a new round of tests.

Nagios monitoring subcommittee (Dantong)

  • Meeting yesterday to discuss: review of Nagios probes for each Tier2 - these are Panda-related monitoring problems, eg., disk space.
  • Probe for successful pilot. Useful for production team and site admins.
  • Wish list - SE srm can be tested; how can we merge with RSV? Wrap RSV probe, or fetch status.
  • Details of the triggering for each alarm, and a proposal for setting up watermarks. Eg., warning messages for available disk at each site. Intial proposal in hand, to collect input from sites.
  • Nagios can provide customized time-outs for each probe. How to handle Tier3-related alerts, if they are associated with a Tier2.
  • Monitoring page - after the split, giving more accessibility to site admins, to manage probes.
  • Another meeting early next week. Meeting next week, 2pm EDT Monday; IU, BU, UTA

Panda release installation issues (Xin)

  • Script is updated to use the new format from Alessandro to publish, 14.0.0. on all sites. Tadashi advised.
  • Pacball-based installation - there have been discussions between Saul, Torre, Tadashi, Alessandro, Xin, Stan. Advantage is cryptographically controlled release - for validation. To be integrated with DQ2.
  • No explicit schedule - but Alessandro is quick with these things.
  • Xin waiting on word from Stan, then will pass to Tadashi to make a job transformation in the new system.
  • Will keep current system running to meet existing.

Site news and issues (all sites)

  • T1: There were lots of Nagios alerts with regard to the Panda server (handle in email). John Hover was in contact with the Condor developers, expect to receive a fix for the Condor-G submit host. Getting prepared for delivery of resource additions - order on processor supplement 3M SI2K? . 40 new servers augmenting and replacing old servers. Also coming are 31 Thumpers (48 x 1 TB disks); 1.5 PB raw capacity. Deduct 20% for useable, then dcache, will be about 1 PB. Local area backbone updates - switches. Force10 switch will connect storage farm with 10G. Will get away from channel bonding this way.
  • AGLT2: covered above. Lots of assigned jobs, not enough activated, being troubleshot.
  • NET2: no special updates - same new equipment, stuck with 10G probs.
  • MWT2: tail end on testing and validation of srm doors. Doing inter-site testing IU-UC.
  • SWT2 (UTA): biggest priority getting UTA_CPB, and new hardware in new server room. 500 CPUs, 240 TB disk plus existing cluster.
  • SWT2 (OU): running smooth; there are some pilot issues on OSCER - consulting Paul. OSG-ITB upgraded 0.9, working on Bestman-xrootd.
  • WT2: working on installation of thumpers (48x1TB, 4x1G channel) - first delivery this week. three others before May 12. For srm - as above - feel we are done. Power outage April 26-27 - expect things to be messy coming back.

RT Queues and pending issues (Tomasz)

Carryover action items

  • Procurements
    • We need to come up with a good plan for the split between storage and CPU. There is some flexibility.
  • Accounting: US ATLAS Facility view (Rob) - status: John Gordon follow-up with APEL developers; expect something in about a month.
    • Still no news from John Gordon or EGEE (I've given up)

New Action Items

  • See items in carry-overs and new in bold above.


  • None.

-- RobertGardner - 14 Apr 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback