MinutesMay6
Introduction
Minutes of the Facilities Integration Program meeting, May 6, 2009
- Previous meetings and background : IntegrationProgram
- Coordinates: Wednesdays, 1:00pm Eastern
- (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.
Attending
- Meeting attendees:
- Apologies:
Integration program update (Rob, Michael)
- IntegrationPhase9 - FY09Q3
- Special meetings
- Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
- Tuesday (12 noon CDT) : Data management
- Tuesday (2pm CDT): Throughput meetings
- Friday (1pm CDT): Frontier/Squid
- Upcoming related meetings:
- US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
- Planning issues
- Discussion of ATLAS schedule: in particular STEP09 analysis challenge, May 25-June 6. 300M events are going to be produced. Two cosmic runs. June 25 reprocessing, Cosmics, analysis
- SL5 migration preparations are underway at CERN. Need to schedule this within the facility. Execution of migration in July.
- HEP-SPEC benchmark, see CapacitySummary.
- Other remarks
Operations overview: Production (Kaushik)
- Reference:
- last meeting(s):
- Lots of simulation samples to do
- There was a throughput issue getting input files to sites at some point, but improved about 12 hours ago.
- Not quite finished with reprocessing jobs ~ 1 days worth. This will then clear the tape backlog for BNL and SLAC
- Plenty of activated jobs everywhere; 30K jobs
- Lots of site issues, mainly regarding
- MegaJam
- JF17 sample, unbiased all SM processes turned on.
- 88M event evgens already produced; subscribed to US
- ATLFAST2
- Borut will start 200M evgen tomorrow
- This will be high priority jobs
- Tier 3 request - have some fraction at every Tier 2 for testing access.
- Tier 2: reserve 15-20 TB for this. MCDISK
Shifters report (Mark)
- Reference
- last meeting:
- 134K completed production jobs completed world-wide per day @ 90%
- Ilinois issue - there were some missing rpm's on worker nodes - good success rate now. Need correct page for missing rpms (compat libs) for 64 bit rhel OS.
- LFC issues at swt2-cpb resolved, back into production. UTA_SWT2 - cleaning out files on gatekeeper probably from leftover gridmanager/pbs files.
- Tasks 595223, 59222 large failure rates over the weekend.
- Three clouds have migrated to the oracle backend database
- New pilot version 36h.
- this meeting:
Analysis queues (Nurcan)
- Reference:
- last meeting:
- See AnalysisQueueJobTests, from FacilityWGAP
- FacilityWGAPMinutesApr21
- Paul reported that all metrics used in HammerCloud are now available for Panda jobs. Changes are made into the dev pilot, the new pilot will be released after finishing the new pilot testing framework within a week.
- HammerCloud is running in the US now, http://gangarobot.cern.ch/st/, with pie charts showing number of successes/failures. A code was provided by Alden for Panda metrics to be put in to the HammerCloud code. Dan has implemented this and now doing some tests. I encourage site admins start looking at their site's performance. Two new mailing lists are setup to coordinate the stress testing efforts and for site admins to share their experience in understanding how their clusters perform under load and if any change in local configurations needed, etc. They will be announced after ADC Oper meeting tomorrow.
- Stress testing:
- First round of stress testing is almost done with SUSYValidation job. Only AGLT2 needs be tested one more time (to have a success rate of 95%). See your site's certification status from AnalysisSiteCertification.
- The missing files in SUSY datasets at NET2, MWT2 and AGLT2 were recovered.
- The lfc errors (lfc_creatg failed, lfc_setfsizeg failed, lfc_creatg failed, lfc_statg failed ) at BNL were understood. Wensheng reported that this was due to an expired voms proxy extension.
- Truncated files on disk were found at AGLT2. Shawn deleted them, there seems to be a problem still with some input files (pilot: Get error: Copy command returned error code 256). Being investigated.
- I setup a second job, D3PD? making with TopPhysTools, instructions are available at AnalysisQueueJobTests. I ran a stress test on all sites yesterday (4/21), results can be seen at: http://panda.cern.ch:25880/server/pandamon/query?dash=analysis&processingType=stresstest&reload=yes (broken currently). These are long jobs, 12 to 16 hours, filling up the queues nicely.
- Results to be looked at today. Will run on the contanier dataset once the missing tid datasets are replicated at MWT2 and NET2 (mc08.105200.T1_McAtNlo_Jimmy.recon.AOD.e357_s462_r579/).
- I asked Mark Slater to put SUSYValidation and D3PD? making job into HammerCloud? .
- Will continue to setting up other jobs listed at AnalysisQueueJobTests.
- this meeting:
- FacilityWGAPMinutesApr28
- D3PD? making jobs are stress tested at the sites. All sites passed, except MWT2, Rob wanted to run another test after dCache problems.
- TAG selection jobs now run at all sites. Issues at NET2 and SWT2 resolved. All sites passed except SWT2, 22% failure rate due to looping jobs, needs to be investigated.
- Next a job that uses DB access will be set up and tested.
- Progress on integrating stress test jobs into HammerCloud: SUSYValidation job is integrated already, D3PD? making and TAG selection jobs are in progress.
- Preparations for STEP09: weekly EVO meetings, Monday 5pm CERN time. First meeting this Monday, 5/4, minutes sent to atlas-dist-analysis-stress-testing-coord@cern.ch.
- More mailing lists came up for discussing stress testing activities: HN-StressTest@bnl.gov, hn-atlas-distAnalysisReadinessTests@cern.ch besides the other two advertised already atlas-dist-analysis-stress-testing-coord@cern.ch, atlas-dist-analysis-stress-testing@cern.ch.
DDM Operations (Hiro)
- Reference
- last meeting(s):
- AGLT2 - DQ2 has been upgraded. There are some caveats. Went relatively easy. Suggest using Campfire chat.
- BNL_DQ2 - now checking bad transfers. Coming in a few a day (1/45K). These don't get registered in the LFC. Panda job will be failed.
- Saul: 1/10K corruption.
- this meeting:
Data Management & Storage Validation (Kaushik)
- Reference
- last week(s):
- this week:
Throughput Initiative (Shawn)
- Notes from meeting this week:
- last week(s):
- Perfsonar working well; getting deployment simplified for easy maintenance. Michael suggests having a dedicated session to review how the probe information is being presented. Perhaps a tutorial at some point, when we're in a robust situation.
- Getting closer to doing throughput benchmark tests.
- Next week: NET2, AGLT2
- Hiro changed TCP buffer size as 8M. (FTS client setting). This crashed the machine. Rich suggests using autotuning on the host.
- Internet2 meeting next week in DC. Chip will speak Monday. Tuesday - a 10 minute slot for Michael.
- Meeting for CIOs to understand LHC usage.
- this week:
Notes from Throughput Meeting
==========================
Attending: Sarah, Shawn, Horst, Rich, Neng, Jay
Apologies: Saul
- perfSONAR. Next Knoppix release still sometime in May (late). Jay’s graphs now have the average values plotted.
- No report on data transfer…
- Site reports
- BNL: No report
- AGLT2: Working on dCache upgrades and other issues. Network has been stable.
- MWT2: No much on throughput. Working on dCache upgrade. Problems with storage node crashes. Working to resolve crashes.
- NET2: No report
- SWT2: OK there. OU plans to involve summer student and wants to run 10GE tests (with AGLT2 and ??)
- WT2: No report
- Wisconsin: One perfSONAR server up https://atlas-perfsonar2.chtc.wisc.edu/. Needs to be reconfigured as BWCTL. Other node seems to be broken and needs work. Request in place to fix it. (Sites should review the perfSONAR instructions at http://code.google.com/p/perfsonar-ps/wiki/NPToolkitQuickStart )
- AOB. Hadoop discussion. Jay discussed perfSONAR configurations, data/gridftp tests and plans.
Meet next week at the usual time. Please send along edits/comments/suggestions via email.
Thanks,
Shawn
GGUS-GOC-RT Ticketing
- last week
- There was a ticket from the GGUS->GOC that didn't make it into the RT, went neglected for a while. #12323. Jason is maintaining the RT system. There is a manual process - needs to be automated.
- Jason will discuss w/ Dantong, and will follow-up w/ the GOC. Keeping Fred in the loop.
- this week
Site news and issues (all sites)
- T1:
- last week: Lots of storage management issues over the past week due to job profile. 120K requests in tape queue through dcache, clogged up; had to clean-up, then ran fine over the weekend. Ordered 10 more tape drives - arrived, early next week in production. 2x data rate to/from.
- this week:
- AGLT2:
- last week: running well right now. Working on getting rid of dark data on MCDISK.
- this week:
- NET2:
- last week(s): Saul: Inventory of corrupted files is being replaced. RSV problems being looked into. John: in replacing data, wrote a script that does lcg-cp to replace. Sometimes hangs - data probably on tape. Probably should delete bad files, and re-subscribe. Jobs running at Harvard.
- this week:
- MWT2:
- last week(s): last week we reported on loss of a couple of pools; working on new network settings for wan access. MCDISK cleanup. Also adjustments for dCache for direct writes from WAN to servers on public nodes.
- this week:
- SWT2 (UTA):
- last week:
- covered above
- FTS proxy delegation issue - happened twice. Hiro is planning a patch to FTS tomorrow.
- this week:
- SWT2 (OU):
- last week: All seems to be running fine, but slow jobs? perfsonar tests are going much better now - there was a fix on the OU network side, but not sure.
- this week:
- WT2:
- last week: SRM problem persisting - possibly due to a bad client. Had to setup firewall to block traffic, killed client. Worked fine afterwards. Central deletion, but not very fast 2-10 Hz. Doesn't run all the time. Power outage tomorrow.
- And a new baby boy for Wei!
- this week:
Carryover issues (any updates?)
Squids and Frontier (John DeStefano)
- Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
- last meeting(s):
- AGLT2 - two servers setup, working, talking to a front-end at BNL
- Presentation at ATLAS Computing Workshop 4/15: Slides
- this week:
Release installation, validation (Xin)
The issue of validating presence, completeness of releases on sites.
- last meeting
- The new system is in production.
- Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
- Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
- Testing release installation publication against the development portal. Will move to the production portal next week.
- Future: define a job that compares whats at a site with what is in the portal.
- Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
- Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
- Xin will develop a plan and present in 3 weeks.
- this meeting:
Tier 3 coordination plans (Doug, Jim C)
- last report: * Upcoming meeting an ANL
- Sent survey to computing-contacts
- There is an Tier3 support list being setup.
- Need an RT queue for Tier3
- this report:
HTTP interface to LFC (Charles)
VDT Bestman, Bestman-Xrootd
- See BestMan page for more instructions & references
- last week
- Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
- Need to communicate w/ CERN regarding how this will work with FTS.
- this week
Tier3 networking (Rich)
- last week
- Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
- http://events.internet2.edu/2009/spring-mm/index.html
- Engage with the CIOs and program managers
- Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
- Another session added for Wednesday, 2-4 pm.
- this week
Local Site Mover
AOB
- last week
- dCache service interruption tomorrow. postgres vacuum seems to flush the write ahead logs to disk frequently. Will increase logging buffer (w/ check point segments) for 1-2GB, as well as write-ahead logging buffers. To descrease the load while vacuuming. May need to do another one at some point. Will publish settings.
- OSG 1.0.1 to be shortly released.
- this week
--
RobertGardner - 22 Apr 2009
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments