r3 - 13 May 2009 - 14:41:31 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesMay13

MinutesMay13

Introduction

Minutes of the Facilities Integration Program meeting, May 13, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, John, Saul, Rob, Charles, Fred, Wei, Horst, Patrick, Kaushik, Mark, Bob, Rich, Jim, Nurcan, Jim C, Doug, Hiro, Douglas
  • Apologies: none

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Planning issues
    • Discussion of ATLAS schedule: in particular STEP09 analysis challenge, May 25-June 6. 300M events are going to be produced. Two cosmic runs. June 25 reprocessing, Cosmics, analysis
    • SL5 migration preparations are underway at CERN. Need to schedule this within the facility. Execution of migration in July.
    • HEP-SPEC benchmark, see CapacitySummary.
  • Other remarks
    • ATLAS schedule - about 2 weeks away from STEP09
    • Need to understand performance issues regarding retrieval of data from tape - as it pertains to reconstruction performance; so this is an important part of the activity as well as analysis.
    • We're getting close to these important activities

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • MegaJam
      • JF17 sample, unbiased all SM processes turned on.
      • 88M event evgens already produced; subscribed to US
      • ATLFAST2
      • Borut will start 200M evgen tomorrow
      • This will be high priority jobs
      • Tier 3 request - have some fraction at every Tier 2 for testing access.
      • Tier 2: reserve 15-20 TB for this. MCDISK
  • this week:
    • Getting back into full production after several site upgrades
    • US Cloud last cloud to move to the CERN oracle panda server - so we expect a few days of issues related to monitoring; archivedb is still in mysql. Will need to change port manually if there are problems.
    • dCache upgrade at BNL yesterday - everything came back normally. Need to keep an eye on Panda mover.
    • Need to get IU and AGLT2 online asap.
    • Large number of production requests - including STEP09 production; expect a huge backlog.
      • 5 independent runs - 2 runs per Tier 2. (10 pb-1 per run, 100 M)
      • Same estimate as before: 15-20 TB
    • Concerns about central-cleanup operations. Expect 20-70 TB of data to be removed from DATADISK. Cleaning of obsolete and aborted datasets at MCDISK.
    • Will continue with Pandamover rather than DQ2.

STEP09 (Jim C)

  • Move June 11 date
  • Would like to do dq2-get testing going - for D3DP? fetching (pathena output)
  • Bulk of production will be ~ 10 days. Announcement for coordinated day of testing & pre-testing. Use JF17 sample for testing. Make a pre-test container.
  • Please try and attend Tier 3 workshop at least virtually.

Shifters report (Mark)

Analysis queues (Nurcan)

  • this meeting:
    • FacilityWGAPMinutesMay12
    • Status of test jobs:
      • TAG selection job was submitted to SWT2 yesterday after understanding the xrootd related problems last week. There are still failures, need to be looked at today from the stresstest link above.
      • I was trying to setup a job that runs on AOD/DPD and requires DB access. Contacted with Sasha Vanyashin, he said such a job exist and he referred to Katharina Fiekas. I asked for instructions to be tested and put into HammerCloud.
    • Status of integrating job types into HammerCloud: SUSYValidation, D3PD? making and TAG selection jobs are now integrated, see at: https://twiki.cern.ch/twiki/bin/view/Atlas/StressTestJobs. I have requested tests last week, due to changes in HammerCloud submission mechanism and the dq2 catalog problems last week tests have been submitted this week:
    • Sites please look at the failures. Mostly input file problems, checksum failures (NET2), not finding input files (AGLT2).
    • We will need to look at the metrics and compare site performances. Any site admin is interested in doing this comparison? Please help.
    • Saul will look into checksum errors; also will help summarizing failures

DDM Operations (Hiro)

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
 
  • last week:
    • Notes from the call:
                                Notes from Throughput Meeting
                               ==========================
Attending: Sarah, Shawn, Horst, Rich, Neng, Jay
Apologies: Saul
  • perfSONAR. Next Knoppix release still sometime in May (late). Jay’s graphs now have the average values plotted.
  • No report on data transfer…
  • Site reports
    • BNL: No report
    • AGLT2: Working on dCache upgrades and other issues. Network has been stable.
    • MWT2: No much on throughput. Working on dCache upgrade. Problems with storage node crashes. Working to resolve crashes.
    • NET2: No report
    • SWT2: OK there. OU plans to involve summer student and wants to run 10GE tests (with AGLT2 and ??)
    • WT2: No report
    • Wisconsin: One perfSONAR server up https://atlas-perfsonar2.chtc.wisc.edu/. Needs to be reconfigured as BWCTL. Other node seems to be broken and needs work. Request in place to fix it. (Sites should review the perfSONAR instructions at http://code.google.com/p/perfsonar-ps/wiki/NPToolkitQuickStart )
  • AOB. Hadoop discussion. Jay discussed perfSONAR configurations, data/gridftp tests and plans.

  • this week:
    • Sites are too busy to schedule throughput tests
    • Next release - late June
    • Next week during throughput timeslot - will be open for other uses, but no organized meeting.

GGUS-GOC-RT Ticketing

  • last week
    • There was a ticket from the GGUS->GOC that didn't make it into the RT, went neglected for a while. #12323. Jason is maintaining the RT system. There is a manual process - needs to be automated.
    • Jason will discuss w/ Dantong, and will follow-up w/ the GOC. Keeping Fred in the loop.
  • this week
    • No report.

Site news and issues (all sites)

  • T1:
    • last week: Lots of storage management issues over the past week due to job profile. 120K requests in tape queue through dcache, clogged up; had to clean-up, then ran fine over the weekend. Ordered 10 more tape drives - arrived, early next week in production. 2x data rate to/from.
    • this week: condor-g job submission progress - new code from Condor team - to address scaling probs on submit host. Want to stress test the new code - found submission rate decreased 1/3 from before. Condor developers investigating. Postgres PNFS database upgraded to 64 bit machine yesterday - all went well. Increased memory 48 GB. BNL-AGLT2 circuit now in place. Next is SLAC.

  • AGLT2:
    • last week: running well right now. Working on getting rid of dark data on MCDISK.
    • this week: Upgrade dCache and Chimera. Seemed to working over the weekend. Building power test sent an EPO signal to storage servers, one got RAID reconfigured mistakenly. PNFS mounting issue.

  • NET2:
    • last week(s): Saul: Inventory of corrupted files is being replaced. RSV problems being looked into. John: in replacing data, wrote a script that does lcg-cp to replace. Sometimes hangs - data probably on tape. Probably should delete bad files, and re-subscribe. Jobs running at Harvard.
  • this week: HU ramped up to 500 jobs. Helping Tufts setup a Tier3 as a production end point; 300 cores opportunistic. Not too concerned about support issues. Still have a problem with 10G NIC, still producing checksum errors 1/(1K-10K files). Have a replacement Myricom NIC on order.

  • MWT2:
    • last week(s): last week we reported on loss of a couple of pools; working on new network settings for wan access. MCDISK cleanup. Also adjustments for dCache for direct writes from WAN to servers on public nodes.
    • this week: kernel & OS upgrades last week. Upgraded dcache to 1.9.2-5 - have gPlazma timeouts, but have new changes. UC back online. IU - difficulties changing site status and getting test jobs running.

  • SWT2 (UTA):
    • last week:
      • covered above
      • FTS proxy delegation issue - happened twice. Hiro is planning a patch to FTS tomorrow.
    • this week: New DQ2 site services interacts more with SRM - causes Bestman instability - change memory setting. Ibrix issue on UTA_SWT2.

  • SWT2 (OU):
    • last week: All seems to be running fine, but slow jobs? perfsonar tests are going much better now - there was a fix on the OU network side, but not sure.
    • this week: all is well.

  • WT2:
    • last week: SRM problem persisting - possibly due to a bad client. Had to setup firewall to block traffic, killed client. Worked fine afterwards. Central deletion, but not very fast 2-10 Hz. Doesn't run all the time. Power outage tomorrow.
      • And a new baby boy for Wei!
    • this week:
      • Upgraded SRM to the latest version. Running fine now, but there's not much load. Seeing some network driver messages intermittently (possible known prob with driver and kernel).

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • AGLT2 - two servers setup, working, talking to a front-end at BNL
    • Presentation at ATLAS Computing Workshop 4/15: Slides
  • this week:
    • Fred is looking into the Squid install instructions - Doug will help
    • Will try out instructions on MWT2; Patrck ~ 2 weeks; Saul will discuss with John.

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:

Tier 3 coordination plans (Doug, Jim C)

  • last report: * Upcoming meeting an ANL
    • Sent survey to computing-contacts
    • There is an Tier3 support list being setup.
    • Need an RT queue for Tier3
  • this report:
    • OSG is organizing a Tier 3 support group.
    • Needs a small data sample for the workshop ~ J17.

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
    • http://events.internet2.edu/2009/spring-mm/index.html
    • Engage with the CIOs and program managers
    • Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
    • Another session added for Wednesday, 2-4 pm.
  • this week

Local Site Mover

AOB

  • last week
    • dCache service interruption tomorrow. postgres vacuum seems to flush the write ahead logs to disk frequently. Will increase logging buffer (w/ check point segments) for 1-2GB, as well as write-ahead logging buffers. To descrease the load while vacuuming. May need to do another one at some point. Will publish settings.
    • OSG 1.0.1 to be shortly released.
  • this week


-- RobertGardner - 12 May 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback