r6 - 16 Sep 2009 - 14:38:15 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep16

MinutesSep16

Introduction

Minutes of the Facilities Integration Program meeting, Sep 16, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Fred, Saul, Tom, Rob, Charles, Aaron, John DeStefano, Patrick, Shawn, Wei, Rik, Doug, John B, Horst, Karthik, Xin, Bob, Wensheng
  • Apologies: Kaushik, Michael

Integration program update (Rob, Michael)

  • Introducing: SiteCertificationP10 - FY09Q04
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Site certification table has a new column for the lcg-utils update, as well as curl. Will update as needed. Note: needed more critically by PandaMover host.
      • ATLAS software week: http://indico.cern.ch/conferenceDisplay.py?confId=50976
      • Important issue: DQ2 site services consolidation
        • Will run at BNL for all sites in the US cloud - predicated on FTS 2.2 for remote checksum; if this works, we'll consolidate. New FTS is being tested currently. Needs to work for all storage back ends, including Bestman. AI: need to follow-up w/ Simone, Hiro and Wei, to bring a test instance to BNL, test with Bestman sites.
      • Storage requirements: SpaceManagement
      • FabricUpgradeP10 - procurement discussion
      • Latest on lcg-utils and LFC:
         Begin forwarded message:
        For LFC: Just yesterday I got it building on one platform and hope to have it building on multiple platforms today. So it's in good shape.
        For lcg-utils: I upgraded (it was a painful set of changes in my build, but it's done) and Tanya did a test that showed everything working well. But about 30 minutes ago ago, Brian Bockelman told me that I might need to upgrade again to avoid a bug--I just contacted some folks in EGEE, and they confirmed that I should upgrade to a new version. *sigh* Hopefully I get can this done today as well.
        All of that said: I can almost certainly give you something for testing this week.
        -alain
      • Specifically: GFAL: 1.11.8-1 (or maybe 1.11.9-1), lcg-utils: 1.7.6-1. These were certified recently, and 1.7.6-1 fixes a bug that was critical for CMS: https://savannah.cern.ch/bugs/index.php?52485 -alain
    • this week:

Tier 3 (Rik, Doug)

  • A statement of support for T3GS? from the facility is needed
  • T3 interviews in progress - half the institutions have responded
  • About 5 T3GS? 's have been identified (LT, UTD, U of I, Wisc, Tufts, Hampton)
  • 25 interviews so far - half are starting from scratch - have some funds. Other half is sites w/ T3G? of some kind. Others have departmental clusters to build on. 30-60K budgets. About half have no resources at all.
  • T3GS? 's will require a support structure that is not available - will rely on the community support. Will cooperate with OSG getting instructions together for some T3 documentation.
  • pcache for T3's.
  • lsm for T3's.

Operations overview: Production (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri is on vacation this week -- weekly summary will resume once he's back.
    
    [ ESD reprocessing -- still awaiting final s/w validation.  Unlikely to start before next week. ]
    [Production generally running very smoothly this past week -- most tasks ("queue fillers") have low error rates. ]
    
    1)  Over this past weekend dq2 maintenance performed at NET2 to repair datasets affected by a permission problem related to files written by bestman.  See for example RT # 13955.
    2)  9/3: AGLT2_PRODDISK to BNL-OSG2_MCDISK transfer errors -- issue resolved.  From Shawn: 
    This was caused by a missing /pnfs mount on UMFS08.  It has been fixed now. In fact I changed the way we do the /pnfs mount back to the dCache standard method of doing this in the /etc/init.d/dcache service.  I am propagating it to all the storage nodes.
    3)  9/4: AGLT2_CALIBDISK transfer errors -- from Shawn:
    The /var partition on the dCache headnode filled up. Cleaning up the space now so postgresql has sufficient working area available.  Restarting postgresql and dCache services.
    4)  9/5: Jobs were failing at AGLT2 with stage-out errors like "Put error: Error in copying the file from job workdir to localSE."  From Shawn:
    I believe these errors where due to umfs13.aglt2.org.   The /pnfs mount point on this system was in a bad state.  I was unable to mount/remount it so I rebooted.
    5)  9/7: Fileserver problem at UTD-HEP -- site set off-line while maintenance is being done on the system. RT # 13969.
    6)  9/8-9/9: Issue with lack of pilots running at HU_ATLAS_Tier2 -- Xin suggested cleaning out the area /opt/osg-1.0.0/globus/tmp/gram_job_state/ , since the symptoms on the submit host side at BNL indicated that the condor grid manager was experiencing problems updating the status of jobs.  Seems to have done the trick -- pilots are flowing and production jobs are running.
    7)  9/8-9/9: AGLT2 -- stage-out errors ("Put error: Error in copying the file from job workdir to localSE") - pilot log has errors like:
    0 bytes 0.00 KB/sec avg 0.00 KB/sec instglobus_xio_tcp_driver.c:globus_l_xio_tcp_system_connect_cb:1799:
    Unable to connect to 0.0.0.0:2811  globus_xio_system_select.c:globus_l_xio_system_handle_write:1886:  System error in connect: Connection refused
    globus_xio: A system call failed: Connection refused.
    Does not affect all jobs, at a level of ~20-25%.  See RT 13981.
    
    Follow-ups from earlier reports: 
    (i)  7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS.  Significant progress, but still a few remaining issues.
    (ii)  SLC5 upgrades will most likely happen sometime during the month of September at most sites.

  • this meeting:
    Yuri is on vacation this week -- weekly summary will resume once he's back.
    
    [ ESD reprocessing -- began over this past weekend, but large numbers of jobs were failing due to s/w problem.  Will require a new cache -- expected to be ready by  ~9/21. 
    Until then shifters don't need to report reprocessing-related errors. ] See: https://twiki.cern.ch/twiki/bin/view/Atlas/ADCReproFromESDAug2009 [Other production generally running very smoothly this past week -- most tasks ("queue fillers") have low error rates. ] 1) 9/11: WISC_DATADISK and WISC_MCDISK transfer errors: "unable to create dir." From site admin: This problem is solved. Because the ownership of the cns directory is changed to another user (no idea why). I changed it back now. It works now. 2) 9/11: Failed jobs at MWT2_IU & IU_OSG with LFC replica errors -- due to some files getting deleted from the storage during a prod disk cleanup. 3) 9/13: Access problem with the LFC at AGLT2 resolved -- modified firewall settings to open needed ports. 4) 9/14: New pilot versions: -- (39a) * The internal pilot time-out command has been updated with some minor fixes by Charles. * In case a job definition contains a non-null/empty cmtconfig value, the pilot will use it instead of the static schedconfig value for cmtconfig validation. * "hotdisk" (highest priority) and "bnlt0d1" has been added to the LFC sorting algorithm. * Batch system identification now uses QSUB_REQNAME as BQS identifyer (used at Lyon) as an alternative to BQSCLUSTER. * Sites using dCacheLFCSiteMover (dccplfc copytool) can now use direct access mode in user analysis jobs. * In case of get failure of an input file, the file info is now added to the metadata for later diagnostics. (39b) * The pilot has been patched to correct an issue seen with analysis jobs using the AtlasProduction cache ($SITEROOT not set leading to problems with the release installation verification). 5) 9/14: A/C issue at AGLT2 -- from Bob: The air-conditioning seems to have stabilized now. We will be turning machines back on again shortly to pick up jobs. Note that we are taking this opportunity to apply patched kernels to the compute nodes. So we will generally be running somewhat lighter loads over the next few days as we do so. 6) 9/14 - 9/15: Transfer errors at AGLT2 -- :[FILE_E XISTS] and [DQ2] Transfer validation failed -- understood, issue resolved. https://gus.fzk.de/ws/ticket_info.php?ticket=51506 7) 9/15: Test jobs submitted to UCITB_EDGE7 site -- quickly end with the pilot status "Output late or N/A." Suchandra is investigating. Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades will most likely happen sometime during the month of September at most sites. (iii) UTD-HEP is working on installing a new RAID controller in their fileserver. Will use this opportunity to do a clean-up of old data in their storage.
    • Sunday reprocessing jobs were okay
    • Monday - initial tasks were subscribed only at BNL - 1000s of failed jobs
    • A new software cache is needed - expect at least first part of next week. Decision was to let some of the jobs to continue to fail to learn what the problems are.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • We're making progress on getting TAG selection jobs into HC.
    • db access jobs - Fred and David Front are in the loop.
    • Two new analysis shifters! Waiting to hear from 3 more people
    • What about the stress test w/ "large container" analysis.
    • 135M events in the container has been distributed to all all T2's; can we get users to start using them?
    • The rest still need to be merged, then replicated. This is about 50 TB. Will take about a week.
  • this meeting:
    • CosmicsAnalysis job using DB access is successfully tested at FZK using Frontier and at DESY-HH using Squid/Frontier (by Johannes). The job has been put into HammerCloud and now being tested at DE_PANDA, no submission to US sites yet.
    • TAG selection job has been put into HammerCloud and is now being tested (in DE cloud).
    • We have now 3 new analysis shifters confirmed, still waiting to hear from one person. I'm planing a training for them in October.
    • Jim C. contacted with us on the status of the large containers for the stress test. Kaushik reported that we have total ~500M events produced. Only the first bunch replicated to Tier2's as I had validated them (step09.00000011.jetStream_medcut.recon.AOD.a84/ with 97.69M events and step09.00000011.jetStream_lowcut.recon.AOD.a84/ with 27.49M events). Others are at BNL, waiting to be merged and put into new containers. Depending on the time scale of the stress test this can be done in a few days as Kaushik reported.

DDM Operations (Hiro)

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • last week
  • this week
    • See Dario's presentation from close-out plenary at http://indico.cern.ch/conferenceDisplay.py?confId=50976
    • ATLAS recommendation: All Tier 2 sites to install one Squid server per 500 user analysis batch slots
    • Minutes from yesterday's meeting: https://lists.bnl.gov/pipermail/racf-frontier-l/2009-September/000485.html
    • HOTDISK space token areas need to be setup: needs to be done within a week. 2 TB.
    • Poolfilecatalog updates at sites: 2 weeks at all the sites - working with Alessandro, Xin, Rod Walker, Fred - update next week
    • Will need a solution for Tier 3's
    • client available in 15.5 and up
    • Question about running multiple squids on a single host, using different port numbers (yes this is possible). Only possible concern is running out of network bandwidth.

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • last week(s):
                    Notes for US ATLAS Throughput Meeting September 8, 2009
                  ===============================================
    
    Attending: Dave, Doug, Sarah, Shawn, Horst, Karthik, Hiro, Rich
    
    1)	perfSONAR status – RC4 out ?  Some good feedback provided by current RC3 testers.    Need to verify status of target release date (September 21 ?).  Karthik reported RC3 worked pretty well but some services are disabled when they shouldn’t be.  Aaron will be following-up (in RC4).   Plan for the near future is to have all US ATLAS Tier-2’s deploy the “released” perfSONAR within 1 week of its release (by the end of September).   Then  we need to verify that it works as expected and gain enough experience with its configuration and use to be able to make a recommendation within a month (by the end of October).   The recommendation would be concerning whether or not Tier-3’s should deploy this version of perfSONAR (presumably 1 box devoted to bandwidth testing) OR await the next perfSONAR version (in case of too many issues still being present that prevent wider deployment).
    2)	Updates on “Tier-3” related testing/throughput -  Lots of discussion which touched upon (US)ATLAS policy, DQ2 and related topics.  Our group is  focused upon throughput and testing and we need to determine what to do for Tier-3’s.   Draft idea is that each Tier-3 should be assigned a Tier-2 as a testing partner.   Data movement tests will be configured using Hiro’s data transfer testing.  Tier-3’s will have only 7-8 test files sent from their associated Tier-2 once per day.  This is compared to the Tier-1 to Tier-2 tests which send 20 files, twice per day to two space-token areas.   This testing along with a properly configured perfSONAR instance at each Tier-3 should provide sufficient initial monitoring/testing. Also, Shawn is assembling pages on the Twiki Rob has setup which are the beginning of a “how-to/cookbook” for throughput debugging, tuning and testing.   **Please send along any URLs or info that  would be appropriate to include**.  Also Doug pointed out that Tier-3’s need examples of existing  storage systems and configurations in place already to help guide their selection of hardware purchases.     **We need to get feedback (also on the Twiki) from each of the Tier-2’s on their hardware and configurations which can be used by the Tier-3’s**.
    3)	Automated data-movement testing status – Logging of errors is to be added.   In addition Hiro will update the software to allow the “3rd party mode” which will enable tests from any site to any site in the US.  Rob had some requests for updates to the plots that Sarah and Hiro will follow-up on.
    4)	Virtual circuit testing status – Still waiting to hear that the UC networking folks have the path instrumented so we can repeat the tests.   **Shawn will send another email asking for a status update**.  If things are ready we will try to reschedule the test sometime in the next few days.
    5)	Site reports – Skipped till next meeting
    6)	AOB – Skipped till next meeting
    
    Please send any corrections or additions to the list.   We plan to meet again next week at the regular time. Shawn
    
    • From Jason: Just to clarify, we still have a many to address before the real release. We expect an RC4 to release around 9/11 and the 'real' release the week of 9/21 if everything stays on schedule. Also I would like to again thank those that are helping in the beta test process - the only reason we are able to find and fix the issues as fast as we are is due to their diligence in debugging with us.
    • Want to get this deployed at Tier 2's as soon as release 4 is available. There have been a number of problems discovered at the beta sites with RC3.
    • Throughput tests showing probs w/ dCache 1.9.4-2, considering downgrading.
    • Automated plots
    • NET2 to be added to plot
    • BNL-BNL plot
    • SRM not working at Duke
    • Plan to redo load tests for UC tests.
    • What about transfers into Tier 3g's?
    • Action: gather information on network/host tunings as guide for Tier 3
  • this week:
     
    • pefsonar release next week (RC4). Would like to have a quick deployment once its available. In a week's time span.
    • Following week would be to focus on configuration
    • October time frame - hope to have an enough experience to make recommendations
    • VC test BNL-UC - problem went away, but unknown.

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • Update to OSG 1.2.1 release this past week for pyopenssl patch
  • this week:

OIM issue (Xin)

  • Registration information change for bm-xroot in OIM - Wei will follow-up

Site news and issues (all sites)

  • T1:
    • last week(s): apart from glitch w/ site services; 120 worker nodes is out: 8 core nodes, 2.66 GHz Nahelms, 24 GB ram x-series, 5550. Fast memory. R410s. Can put up to 4 drives in these 1U servers. 1000 analysis cores hitting DDN storage. SSD 160 GB drives for $400.
    • this week:

  • AGLT2:
    • last week: Looking at next purchase round - dell and sun. SSD's versus extra spindles. Looking at R610. Need to test. DDN storage (~ $320/useable TB). May need downtime for dCache.
    • this week: next procurements - hope to order by next week. waiting on dell and sun. conversion to sl5. few issues w/ sl5 build - probs compiling athena jobs - looking at whats different; hope to transition into over the next few weeks, would like to deprecate sl4 support. applied new kernels on all nodes.

  • NET2:
    • last week(s): running smoothly at BU - investigating a problem at HU - not getting pilots.
    • this week: running smoothly except for 17 athena crashes from this morning - investigating. getting ready for procurements - bu and hu. all space an infrast defined. getting bids. looking at blades.

  • MWT2:
    • last week(s): Aaron van Meerton coming on board at UC. Distributed xrootd layout studies, setting this up. Bestman on top of dcache (looking at performance issues). We did experience some partial drains, this is because of really the short jobs.
    • this week: circuit tests of last week caused dcache pools to crash on the receiving end -want to reinvestigate after sl5.3 using kdump. power interruption.

  • SWT2 (UTA):
    • last week: all is well. Looking at purchase options for next round. Need to think about what to do w/ the old SWT2_UTA cluster. Blocking off a few days for the SL5 and OSG upgrade.
    • this week: OSG 1.2 and sl5 upgrade first/second week of october. planning for storage upgrade. looking at 1PB of disk in the next purchase. upgrading storage at SWT2_UTA - deprecating ibrix.

  • SWT2 (OU):
    • last week: no issues
    • this week: waiting on storage purchase - Joel investigating. Upgrade after storage.

  • WT2:
    • last week(s): hardware purchases are proceeding. a new algorithm xrood-client being improved by developer; a new algorithm will be tested in a new HC test. Will likely have another power outage.
    • this week: sl5 migration - rhel5 - setup a test machine and queue. setting up additional squid servers.

Tier 3 data transfers (Doug, Kaushik)

  • last week
    • There have been discussions this week at CERN regarding this topic. Expect a solution very soon, see Simone's. Next week.
  • this week

Carryover issues (any updates?)

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
  • this week
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5

Getting OIM registrations correct for WLCG installed pledged capacity

  • last week
    • Note - if you have more than one CE, the availability will take the "OR".
    • Make sure installed capacity is no greater than the pledge.
    • Storage capacity is given the GIP by one of two information providers (one for dCache, one for Posix-like filesystem) - requires OSG 1.0.4 or later. Note - not important for WLCG, its not passed on. Karthik notes we have two ATLAS sites that are reporting zero. This is a bit tricky.
    • Have not seen yet a draft report.
    • Double check the accounting name doesn't get erased. There was a big in OIM - should be fixed, but checked.
  • this week

AOB

  • last week
  • this week


-- RobertGardner - 15 Sep 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback