r4 - 06 May 2009 - 13:36:30 - ShawnMckeeYou are here: TWiki >  Admins Web > MinutesMay6

MinutesMay6

Introduction

Minutes of the Facilities Integration Program meeting, May 6, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees:
  • Apologies:

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Planning issues
    • Discussion of ATLAS schedule: in particular STEP09 analysis challenge, May 25-June 6. 300M events are going to be produced. Two cosmic runs. June 25 reprocessing, Cosmics, analysis
    • SL5 migration preparations are underway at CERN. Need to schedule this within the facility. Execution of migration in July.
    • HEP-SPEC benchmark, see CapacitySummary.
  • Other remarks

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Lots of simulation samples to do
    • There was a throughput issue getting input files to sites at some point, but improved about 12 hours ago.
    • Not quite finished with reprocessing jobs ~ 1 days worth. This will then clear the tape backlog for BNL and SLAC
    • Plenty of activated jobs everywhere; 30K jobs
    • Lots of site issues, mainly regarding
    • MegaJam
      • JF17 sample, unbiased all SM processes turned on.
      • 88M event evgens already produced; subscribed to US
      • ATLFAST2
      • Borut will start 200M evgen tomorrow
      • This will be high priority jobs
      • Tier 3 request - have some fraction at every Tier 2 for testing access.
      • Tier 2: reserve 15-20 TB for this. MCDISK

  • this week:

Shifters report (Mark)

  • Reference
  • last meeting:
    • 134K completed production jobs completed world-wide per day @ 90%
    • Ilinois issue - there were some missing rpm's on worker nodes - good success rate now. Need correct page for missing rpms (compat libs) for 64 bit rhel OS.
    • LFC issues at swt2-cpb resolved, back into production. UTA_SWT2 - cleaning out files on gatekeeper probably from leftover gridmanager/pbs files.
    • Tasks 595223, 59222 large failure rates over the weekend.
    • Three clouds have migrated to the oracle backend database
    • New pilot version 36h.
  • this meeting:

Analysis queues (Nurcan)

DDM Operations (Hiro)

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
 
  • last week(s):
    • Perfsonar working well; getting deployment simplified for easy maintenance. Michael suggests having a dedicated session to review how the probe information is being presented. Perhaps a tutorial at some point, when we're in a robust situation.
    • Getting closer to doing throughput benchmark tests.
    • Next week: NET2, AGLT2
    • Hiro changed TCP buffer size as 8M. (FTS client setting). This crashed the machine. Rich suggests using autotuning on the host.
    • Internet2 meeting next week in DC. Chip will speak Monday. Tuesday - a 10 minute slot for Michael.
    • Meeting for CIOs to understand LHC usage.
  • this week:
    • Notes from the call:
                                Notes from Throughput Meeting
                               ==========================

Attending: Sarah, Shawn, Horst, Rich, Neng, Jay
Apologies: Saul
  • perfSONAR. Next Knoppix release still sometime in May (late). Jay’s graphs now have the average values plotted.
  • No report on data transfer…
  • Site reports
    • BNL: No report
    • AGLT2: Working on dCache upgrades and other issues. Network has been stable.
    • MWT2: No much on throughput. Working on dCache upgrade. Problems with storage node crashes. Working to resolve crashes.
    • NET2: No report
    • SWT2: OK there. OU plans to involve summer student and wants to run 10GE tests (with AGLT2 and ??)
    • WT2: No report
    • Wisconsin: One perfSONAR server up https://atlas-perfsonar2.chtc.wisc.edu/. Needs to be reconfigured as BWCTL. Other node seems to be broken and needs work. Request in place to fix it. (Sites should review the perfSONAR instructions at http://code.google.com/p/perfsonar-ps/wiki/NPToolkitQuickStart )
  • AOB. Hadoop discussion. Jay discussed perfSONAR configurations, data/gridftp tests and plans.
Meet next week at the usual time.   Please send along edits/comments/suggestions via email.

Thanks,

Shawn

GGUS-GOC-RT Ticketing

  • last week
    • There was a ticket from the GGUS->GOC that didn't make it into the RT, went neglected for a while. #12323. Jason is maintaining the RT system. There is a manual process - needs to be automated.
    • Jason will discuss w/ Dantong, and will follow-up w/ the GOC. Keeping Fred in the loop.
  • this week

Site news and issues (all sites)

  • T1:
    • last week: Lots of storage management issues over the past week due to job profile. 120K requests in tape queue through dcache, clogged up; had to clean-up, then ran fine over the weekend. Ordered 10 more tape drives - arrived, early next week in production. 2x data rate to/from.
    • this week:

  • AGLT2:
    • last week: running well right now. Working on getting rid of dark data on MCDISK.
    • this week:

  • NET2:
    • last week(s): Saul: Inventory of corrupted files is being replaced. RSV problems being looked into. John: in replacing data, wrote a script that does lcg-cp to replace. Sometimes hangs - data probably on tape. Probably should delete bad files, and re-subscribe. Jobs running at Harvard.
  • this week:

  • MWT2:
    • last week(s): last week we reported on loss of a couple of pools; working on new network settings for wan access. MCDISK cleanup. Also adjustments for dCache for direct writes from WAN to servers on public nodes.
    • this week:

  • SWT2 (UTA):
    • last week:
      • covered above
      • FTS proxy delegation issue - happened twice. Hiro is planning a patch to FTS tomorrow.
    • this week:

  • SWT2 (OU):
    • last week: All seems to be running fine, but slow jobs? perfsonar tests are going much better now - there was a fix on the OU network side, but not sure.
    • this week:

  • WT2:
    • last week: SRM problem persisting - possibly due to a bad client. Had to setup firewall to block traffic, killed client. Worked fine afterwards. Central deletion, but not very fast 2-10 Hz. Doesn't run all the time. Power outage tomorrow.
      • And a new baby boy for Wei!
    • this week:

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
    • AGLT2 - two servers setup, working, talking to a front-end at BNL
    • Presentation at ATLAS Computing Workshop 4/15: Slides
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • The new system is in production.
    • Discussion to add pacball creation into official release procedure; waiting for this for 15.0.0 - not ready yet. Issue is getting pacballs created quickly.
    • Trying to get the procedures standardized so it can be done by the production team. Fred will try to get Stan Thompson to do this.
    • Testing release installation publication against the development portal. Will move to the production portal next week.
    • Future: define a job that compares whats at a site with what is in the portal.
    • Tier 3 sites - this is difficult for Panda - the site needs to have a production queue. Probably need a new procedure.
    • Question; how are production caches installed in releases? Its in its own pacball, can be installed in the directory of the release that its patching. Should Xin be a member of the SIT? Fred will discuss next week.
    • Xin will develop a plan and present in 3 weeks.
  • this meeting:

Tier 3 coordination plans (Doug, Jim C)

  • last report: * Upcoming meeting an ANL
    • Sent survey to computing-contacts
    • There is an Tier3 support list being setup.
    • Need an RT queue for Tier3
  • this report:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Tier3 networking (Rich)

  • last week
    • Reminder to advise campus infrastructure: Internet2 member meeting, April 27-29, in DC
    • http://events.internet2.edu/2009/spring-mm/index.html
    • Engage with the CIOs and program managers
    • Session 2:30-3:30 on Monday, 27-29 to focus on Tier 3 issues
    • Another session added for Wednesday, 2-4 pm.
  • this week

Local Site Mover

AOB

  • last week
    • dCache service interruption tomorrow. postgres vacuum seems to flush the write ahead logs to disk frequently. Will increase logging buffer (w/ check point segments) for 1-2GB, as well as write-ahead logging buffers. To descrease the load while vacuuming. May need to do another one at some point. Will publish settings.
    • OSG 1.0.1 to be shortly released.
  • this week


-- RobertGardner - 22 Apr 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback