r3 - 13 Jan 2010 - 15:57:48 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJan13



Minutes of the Facilities Integration Program meeting, January 13, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Shawn, Michael, Rob, Fred, Jason (I2), Wei, Justin, John, Hiro, Charles, Booker, Torre, Saul, Tom, Rik, Doug, Patrick, Kaushik, Armen, Jim
  • Apologies: Mark

Integration program update (Rob, Michael)

  • SiteCertificationP12 - FY10Q2
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
      • Quarterly reports due!
      • Mid-feb machine shutdown ends
      • Our readiness - need site reports below; site upgrades must be completed by the end of the month; SL5 and OSG 1.2
      • Another round of reprocessing coming up, analysis has been ramping up
    • this week
      • Planning for OSG all-hands meeting in full swing; there will be industry representatives showing productions for compute and storate
      • March Monday 8 -Wednesday 10, March 11 (Council) - at Fermilab
      • Possible joint session w/ US CMS - eg., covering topics such as storage
      • WLCG - capacity reporting; was discussed yesterday's meeting. Its high up on the list now. MUPJ - multi-user pilot jobs. Timeline for deployment being determined by WLCG.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • there is a vm cluster at ANL which is in a good configuration to use as a reference pattern for small sites. Now testing user functionality - dq2, pathena, condor-submits; need to test instructions
    • More hardware arrived at ANL - will be setting up a real Tier 3 model cluster
    • Progress on Panda; Pacballs for SL5
    • Panda - Torre working on it w/ Tadashi and others
    • Pacballs - new releases in native SLC5; old releases are still in SLC4, older version of gcc - will still need compat libs.
    • Tier3 data deletion - new program from Hiro
    • Still no definitive answer for gridftp alone versus srm+gridftp (Hiro will check)
    • Frontier+Squid - is there anything Tier 3 specific? Yes - SE and HOTDISK
  • this week:
    • Panda running on Duke Tier 3. Pilots are working fine. Secure https based on a grid cert is used for communication with a panda server. Checking based on worker-node names for a queue. Working on getting real jobs working. Local submission via condor, not condor-G, to get the pilots there.
    • ADC meeting - want to mandate functional tests for all sites including Tier 3s. Agreed to this, appropriate to Tier 3 scale, and confined to a particular space token. Will need to deal with blacklisting issues.
      • Hiro is thinking of a proposal for getting data to Tier 3s, user-initiated, something less than the current Tier 2-like infrastructure. Needs further discussion.
      • Can Tier 3's also publish to BDII - for functional testing purposes.
      • Need a separate discussion on these topics. Friday 11 EST.
    • Working on reference Tier 3 site. Studying basic management using xrootd. Directory creation, file removal, etc. Need xrootdFS.
    • Tufts requesting GPFS access to BU's Tier 2. Should discourage this close coupling between the Tiers.
    • Tier 3 meeting at CERN - draft agenda from Massimo.

Operations overview: Production (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  12/31: Final summary of the reprocessing project (from Alexei):
    All steps of December09 reprocessing campaign are finished today.
    ESD production was finished a few days ago. AOD and DESD production,
    Histogram and Ntuple merging were run successfully afterward.
    A corrected version provided by software team yesterday allowed to merge
    TAG_COMM files in due time.
    ALL outputs are subscribed to Tiers of ATLAS according
    to the requested pattern. (More than 95% of data are already replicated,
    it is close to 100% for Tier-1s.)
    We are happy to announce that reprocessing was fully done before end
    of 2009, as it was requested by the Reprocessing Coordinator.
    2)  1/4: From Sarah at MWT2, in response to transfer error alerts:
    One of our pools was down, due to running low on memory. I've boosted the memory allocation and the pool is back up. We should see these files transfer successfully soon.
    Follow-up comment from Shawn:
    We have seen this "locality is UNAVAILABLE" at AGLT2 as well. Seems to be new in the sense that before running 1.9.5 I don't recall having these messages.
    3)  1/4:  NET2 -- jobs were failing with errors about missing input files.  This was tracked down to the fact that central deletions were removing files from the PRODDISK space token.  Hiro requested that this action be stopped.  Missing files were replaced, site back to 'online'.  See RT 14984, eLog 8422.
    4)  1/4:  schedconfigdb misconfiguration for the test site ANALY_MWT2_X was fixed by Alden.  (This resolved an issue where the site was not receiving pilots.)
    5)   1/5:   MWT2 -- maintenance outage -- from Sarah:
    We're performing network maintenance this morning on the switch that supports MWT2_IU.  The maintenance will occur between 8am and 12pm, and is expected to last 30 minutes.
    6)  1/6:  Maintenance outage at SLAC to inspect a failed fan on a storage box.  13:00-17:00 PST / 21:00-1:00 UTC.
    7)  1/6: BNL -- Conditions db maintenance completed:
    The BNL US ATLAS Conditions Database maintenance has been successfully done. OS and Database memory configuration in the cluster nodes have been adjusted to the new memory available.  No service interruption observed during this intervention.
    8)  1/6:  AGLT2 -- transfer errors like:
    >From CERN-PROD_DATATAPE to AGLT2_CALIBDISK is failing at high rate: Fail(88.0)/Success(0.0)
    number of errors with following message: 88
    Error message from FTS: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries]
    Resolved -- from Shawn:
    I found the dCache srm service stopped on head01.  There was a 'pg_dump' which had been running for 55 cpu minutes.  I restarted postgres and all dCache services on head01. SRM is again operational.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    ii)  12/18 (ongoing): IU_OSG -- site was originally set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.
    ii)  12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others).  New e-mail address: schedconfig@gmail.com
    iv)  12/22: UTA_SWT2 -- Maintenance outage (SL5, many other s/w upgrades) is completed.  atlas s/w releases are being re-installed by Xin (this was necessary since the old storage was replaced).  Test jobs have finished successfully -- will resume production once the atlas releases are ready.  
    1/6:  Investigating one final issue with the s/w installation pilots.

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/6:  SLAC -- Maintenance outage completed -- from Wei:
    The outage is over. SUN was not able to fix the problem but the ATLAS operation will go on.
    2)  1/6 - 1/7: Issue with missing input datasets for reprocessing tasks running at BNL resolved.  eLog 8502, Savannah 61123.
    3)  1/7 - 1/8: Maintenance outage at MWT2_UC to upgrade dCache headnodes to SL5.3 completed.
    4)  1/7: Pilot update from Paul (v41d):
    A new minor pilot version has been released. US LFC registrations now use the host name given by schedconfig.se, instead of $LFC_HOST. Requested by Hironori Ito.
    5)  1/7: From Hiro to site admins:
    Please check your DQ2 SS configuration (dq2.cfg) to make sure that all of physics group area at your T2 sites are being served by the site
    service. To find which physics group area is defined, just check TiersOfATLASCache.py file located in your DQ2 SS
    (/var/tmp/.something/ToACache.py) or get it on the web at:
    6)  1/7: BNL -- storage maintenance completed -- description of the work:
    Approximately 1/3 of the back-end nodes providing storage for the dCache/SRM service will be intermittently offline for network connection changes and firmware upgrades. Each unit is expected to be offline for less than an hour. The SRM service overall will remain available, but particular files may be inaccessible during the maintenance, resulting in occasional file transfer failures and possibly associated job failures.
    eLog 8506.
    7)  1/8: UTA_SWT2: site has been set back to 'online' following the completion of a major upgrade to the cluster.  A final lingering issue with the NIC driver on the new nodes added during the outage was resolved.
    8)  1/8: Recovery tasks from reprocessing in December are done -- from Alexei:
    Reprocessing recovering step was finished yesterday.  The final number : 99.983% of all data are reprocessed, 23 jobs are not finished successfully and the reasons are reported to the experts.  The datasets produced during recovery (Jan 6,7) are subscribed to ATLAS Tiers and CERN.
    9)  1/8: Jobs were failing at the NET2 sites with "lsm-put failed" errors.  Resolved -- from John:
    We tracked this down to some backed-up du cron jobs (for the space availability reporting) that were degrading file system performance. Those jobs have been killed and everything looks back to normal.  eLog 8540, ggus 54537, RT 15008.
    Later, job with errors like "[SE][GetSpaceTokens][] httpg://atlas.bu.edu:8443/srm/v2/server: CGSI-gSOAP: Error reading token data header: Connection reset by peer" were noticed.  From John:
    Bestman on atlas.bu.edu got in a bad state this morning and giving that error. Restarting it cleared up the problems. We restarted it just after
    setting the site back online (after being offline for different reasons -- see above).  So it should be fixed now.  eLog 8538, RT 15014.
    10)  1/11: SLAC -- job failures with the error "lfc_getreplicas failed with: 1004, Timed out."  Understood, from Wei:
    SLAC's LFC DB server was rebooted to pickup new Linux kernel. I think things are back to normal now.
    eLog 8576.
    11)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    12)  1/12: BNL -- ATLAS/OSG Grid NFS filesystems (/usatlas/grid, /usatlas/OSG) were successfully migrated to new hardware.  eLog 8599.
    13)  1/13: maintenance outage at SLAC -- from Wei:
    We will take an outage from 9am to 5pm to physically move storage boxes, and to try again fixing a failed fan.
    Follow-ups from earlier reports:
    i) Reminder: analysis jamboree at ANL 1/18 - 1/20.
    ii)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    iii)  12/18 (ongoing): IU_OSG -- site was originally set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.
    iv)  12/19: Discussions about the best way to submit / track change requests for schedconfigdb (Alden, others).  New e-mail address: schedconfig@gmail.com
    v) Reminder: analysis jamboree at BNL 2/9 - 2/12. 

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • The new tool from Hiro can potentially delete data at T3s
    • We need to formulate a sensible dataset deletion policy for T3s with regards to obsolete datasets
    • Would this create a mess in the central catalogs?
    • Action item: Rik, Doug, Hiro, Rob, Charles, (Michael & Jim as observers) to formulate a position for US ATLAS
    • PRODDISK deletion centrally as noted above, stopped. resolved.
    • GROUPDISK needs a tweak in each DQ2 SS. Hiro will send out an email.
  • this meeting:
    • All is well.
    • FTS 2.2 still not yet available at WLCG level. Still not considered stable. Still need to wait for more information, ~ 10 days, before making a decision on moving to FTS 2.2 and starting site consolidation. NET2 would be the first site to move - discussions started.
    • Use Friday's meeting to address: Action item: Rik, Doug, Hiro, Rob, Charles, (Michael & Jim as observers) to formulate a position for US ATLAS
    • Doug will try out Hiro's new tool for data deletion
    • Hiro: will gridftp stand alone work for Tier 3's - still not tested checksum w/ gridftp and xrootd

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • no issues for this audience
    • there have been some frontier bug reports at the ATLAS level, working to close those
  • this week
    • Developments in frontier servlet
    • New squid rpm for t1, t2's
    • Lots of traffic from HU, sustained; local users most likely (Saul)
    • Slides for post-mortem next week - contributions welcomed; include Tier 1 stress test results
    • Discussion regarding Tier 3 squid install recommendation
    • Douglas Smith would like to run squid test at US Tier 2's via Hammer Cloud. Is there a problem routing pilot jobs from HC via CERN to OSG sites - has this been resolved? Are the CERN pilots using the most up to date code? Torre will investigate.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • Each T2 must test against all other T2s
    • Check spreadsheet for correctness
    • Asymmetries between certain pairs - need more data
    • Will start a transaction-type test (large number of small files; check summing needed)
  • this week:
    • Attending: Shawn, Jason, Aaron, Sarah, Charles, Andy, Horst, Zafar, Yee, Karthik, Hiro, Doug
    • 1) perfSONAR discussion and status. Issue at IU about firewall partially resolved. URL for firewall ports at: http://code.google.com/p/perfsonar-ps/wiki/NPToolkit (scroll to the Q: Can I Use a Firewall?). Aaron reported a stopped service at UC for throughput. Sarah reports that the software. AGLT2 running all services OK. MSU has issues with "runaway" processes which is being worked on. MWT2 has some other issues with some stopped services. Spreadsheet was used to verify and add in missing sites. SLAC has a kernel issue. OU has many services in "Not Running" mode for both throughput and latency nodes. Next release scheduled for the end of this month: NTP security updated, New kernel, Cleaning up OWAMP files. Need beta testers in about 2 weeks.
    • 2) Test and milestones for this quarter. Throughput tests (from Hiro) to Tier-3's have recently been failing because some "source" datasets have been cleaned up. Action item Doug will provide list of Tier-3 affinities. Action item Hiro to check source data files and restage if needed. FTS version at BNL still not "production" for checksum version. Maybe soon (see BNL site report below). Action item Hiro is to create a transaction test targeting a rate of about 8K/hour small files...details to be determined by Hiro. Next issue: Revisit prior milestones and redo/reverify? Is it worthwhile to redo load-tests (measuring site "burst" capability) quarterly? Concerns are the amount of effort required versus the value of the measurements. Opinions on both sides. Can help find otherwise undetected bottlenecks. Could be disruptive to our production systems (but historically has had minimal adverse incidents). Summary: try to do such testing quarterly but realize cumstances may not permit doing that now that LHC is running. Minimum of 2/year of milestone revalidations should be done. Eventually include Tier-3 milestones when appropriate.
    • 3) BNL - FTS version with checksum is not yet validated by WLCG but is released by GLite. BNL will decide soon on what it will do. Once checksum version is ready, want to verify throughput and transaction results in our infrastructure. SLAC/WT2 - Volunteered for perfSONAR beta release testing. Hopes to have perfSONAR running once kernel is OK. OU - Nothing else to report...another month for 10GE stuff after which will revisit load-tests. MWT2_UC/IU - The IU firewall has the software update in place to allow perfSONAR to work correctly. Next step is getting list of ports needed by perfSONAR in place. Also some interest in redoing load-tests after inter-site testing. AGLT2 - Interested in redoing load-testing now that changes last fall have improved throughput. Seeing some Myricom issues in 'dmesg' but may not be impacting transfers.
    • AOB? Next meeting is planned for January 26th (meetings are now every two weeks). Please send along any corrections or additions to the list.

Site news and issues (all sites)

  • T1:
    • last week(s): Maintenance: network upgrades and installation of new 700 TB NEXAN disk behind Thor server (FC connected), will have 4.5 PB (useable).
    • this week: A couple of scheduled interventions - updating storage servers, adding 700 TB completed. New storage appliance for job submission infrastructure - now have Blue Arc. Millions of sub dirs possible. Wicked fast.

  • AGLT2:
    • last week: all wn's running SL5, have been running OSG 1.2 for a while. Transition to new network address space. Will install new blade chasis at UM. 1.9.5-11 dCache upgrade. Running well.
    • this week: all systems up including blades. 2800 job slots. Downtime Feb 4 to complete remaining servers. Minor update to dcache. And a few other issues (bios, firmware). Other servers and systems will be updated not requiring downtime. SL4 completely eliminated. HS06 measurements - may need to update the capacity spreadsheet.

  • NET2:
    • last week(s): Recovered from data deletion problem. Shutdown in near future - in a week or so. LFC to be upgraded; all wn already at RH5. GK and interactive node to be upgraded to RH5, and new OSG 1.2. Facility upgrade: getting two racks of Dell Nahelms at HU Odyssey, one rack of storage at BU.
    • this week: All is well.

  • MWT2:
    • last week(s): Downtime tomorrow for SL5 for dcache head nodes; next week there will be more downtimes.
    • this week: downtime tomorrow to complete SL5 update for worker nodes;

  • SWT2 (UTA):
    • last week: still working on the topics above; Note Dell R610 servers sometimes dropping the NIC. There is a kernel driver problem in the default SL 5.3. Justin has a solution. Next: will do CPB upgrade. 200 TB of storage waiting to be installed. Ordered 400 TB storage, hopefully mid-Feb. Broadcom NIC and BNX2 driver.
    • this week: installing latest bm-xrootd. expect another downtime. cpb cluster - expecting new delivery of 400 TB.

  • SWT2 (OU):
    • last week: Equipment delivered. 80TB SATA drive order for DDN. Final quote for additional nodes in the next week.
    • this week: New compute node order in preparation. Awaiting for storage to arrive.

  • WT2:
    • last week(s): RH5 done. OSG 1.2 finished. WLCG-client installation, validation. Two outages: this PM, failing fan on storage node. Next week probably as well for storage sys rearrangements. Also will make available scratch and localuser groups. Will probably delete some old releases. Can we delete release 13? Will put in a new NFS server for these releases.
    • this week: Down today to move storage servers to a new rack. Extra long space tokens. Question again: can we delete release 13? Need sufficient and timely information from ADC providing guidance. No we cannot delete release 13.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week
    • There was an ATLAS-OSG-Xrootd meeting this week
    • > 64 storage nodes will go into the next release.
    • current release in VDT is good enough for production use, supports long space tokens
    •  === Attendees ===
        * Hanushevsky, Andy 
        * Gardner, Rob
        * Kroeger, Wilko
        * Levshina, Tanya
        * Roy, Alain
        * Williams, Sarah
        * Yang, Wei
      ===  Update on OSG plans ===
      Tanya asked about including XRootd clients separately from the XRootd server. Alain is happy to do it but needs help dividing up packages (exactly what files should be in the XRootd client?). Andy confirmed that there is no existing documentation about the clie. We included that probably the two commands needed in the client are xrd and xrdcp. Wei and Anday will meet today with Doug Benjamin, then let us know exactly what is needed. 
      === Resolution of >65 nodes in Xrootd ===
      In Tanya, there was a discussion about sites having problems when there are more than 65 nodes. Should we be worried? When will fix be released? Etc.
      Andy says that the core cause was insufficient testing of the underlying feature in the CMS (it's hard to test with more than 65 nodes, and largest site didn't move to the CMS). It was hard to debug, due to an unusual configuration. They found the bugs and have fixed them internally. 
      Current release in OSG 1.2.5 has the problem. It will be fixed in the next release, they can provide patches to the existing release, if we need it--question: do we need the new release when it's ready, or do they need to push us patches more quickly? They will talk to Doug this afternoon to investigate the urgency. 
      Will these be a problem for the Tier-3s? It will affect sites if they install on all worker nodes, less likely to affect sites with a few central data servers. Again, they will talk to Doug about the urgency. 
      === Update on XRootd plans ===
      ALICE is the driving force for new version due to their own internal pressures. XRootd team would like a new version as well since it's been three months since their last release. It's currently in testing with ALICE. They hope to have a new release in the next week or so. 
      There are improvements to cnsd, and they are working with Doug to understand what is really needed. Tanya is looking at problems with cnsd log. 
      Wei will update XRootdFS to support more than 64 nodes. 
      === Update on USATLAS needs and expectations ====
      Rob said ATLAS is trying out current release from OSG 1.2.5. They are deploying XRootd across two sites then accessing it with Panda. After testing the functionality, they will try a scalability test. Andy offered to look at their configuration and offer advice. 
      === Open business ===
      It would be better for Andy if Tier-2 discussions at the OSG All-Hands meeting were not on Monday, because he arrives late on Monday. Alain will let Ruth know.
      Please correct any mistakes I've made. Thanks,

  • Tier3-xrootd meeting:
    Subject: Minutes of Tier 3 and Xrootd meeting
    Attendees: Rik, Doug, Andy, Wilko, Booker, Richard, Wei
    Xrootd client usage for Tier 3:
    Ordinary users will only need xrdcp to move data in and out of a Tier 3
    Xrootd system. More complicated data management can be done via XrootdFS by
    people who manage the Xrootd space.
    A site probably only need one instance of XrootdFS/Fuse to do data
    management. Some version of SL5 has Fuse 2.7.4 packed in and can be used
    without worrying about manual update of Fuse kernel module. New XrootdFS can
    function without CNS (the OSG default configuration still uses CNS).
    ATLAS doesn't need separate packages for xrootd client and xrootd server.
    (clients like gridftp and xrootdfs are independent anyway).
    Keep track of datasets in Xrootd system:
    ATLAS Tier 3 sites prefer to have a way to keep trace of what datasets are
    in their xrootd system, and whether a dataset at a site is a complete set or
    incomplete set.
    We agree that this function is independent of the Xrootd system and should
    be provided a 3rd party tool.
    Know what files are hosted by a data server:
    ATLAS Tier 3 site (and T2 perhaps) would like a way to tell what files are
    in a given data server, for proof scheduling and disaster recovery purposes.
    Xrootd's inventory function provides this info (through not in real time).
    Wei will provide instructions.
    Wei Yang  |  yangw@slac.stanford.edu  |  650-926-3338(O)

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Need to schedule call with Brian
  • this meeting
    • Notes from meeting with Brian:
       Meeting with Karthik and Rob G; Jan 8, 2010.
      Starter questions:
      - Q: What are the current requirements for the report and do we have any documents describing this in detail?
        - A: Yes, we have the installed capacity document.
        - https://osg-docdb.opensciencegrid.org:440/cgi-bin/ShowDocument?docid=826.
        - The most relevant one is the "requirements document".
        - We can probably update the operational document as needed.
      - Q: Are the upper and lower limits still relevant or do we go with a single number to compare against?
        - A: For future numbers - specifically HS06 - we'll just go with a single number as CMS seems to be less concerned with the 2-number usage case.
      - Q: What are the software requirements from the site's point of view and who should be involved?
        - A: The software requirements are to keep the kSI2K, HS06, and # of TB at their site in OIM consistent with their MoU agreements.
        - A: From the GIP point of view, a OSG CE 1.2.x is required.  Keeping correct GIP numbers is nice, but not necessary for WLCG reporting.
      - What is OSG's responsibility in this?
        - A: The OSG provides the technology so WLCG sites can publish installed capacity.
        - A: The OSG negotiates on behalf of the USLHC sites with the WLCG on how to transmit the information for WLCG reports.
        - A: The OSG provides an operational infrastructure so WLCG can utilize its solution.
      - Who are the stakeholders and how does this report impact their bottom line?
        - The stakeholders are USLHC VOs and their respective sites.
        - This affects the WLCG reports (as of yet, unseen) pertaining to their site; this report may make it back to US funding agencies
      - Does the WLCG Collaboration Board ever discuss these numbers?  B: not that I know of.
      - R: Can we get a summary report showing this data so we get this input correct before WLCG asks for it?  B: Yes, let's do this
      - R: Still a bit confused on what numbers are needed for WLCG?
        - B: Sum for each site, KSI2K, HEP, # of TB.
      - B: we'll need a concise table for the most important, management-level numbers
        - Karthik will be implementing this table and adding it to the top of his current report.
        - The current table will remain, but be of less interest to most folks
        - We should try to keep on sites to have consistent numbers, but this will be less important.
      - R: Will be sending off link for ATLAS HEPSPEC 06
        - http://www.usatlas.bnl.gov/twiki/bin/view/Admins/CapacitySummary.html
      Sample table format of installed capacity numbers going to WLCG 
      Site	kSI2K	HS06	TB Installed
      MWT2_UC	1000	400	300
      MWT2_IU	1500	500	200
      Total: US-MWT2	2500	900	500
      USATLAS Total:	1234	5678	91011


  • last week
  • this week

-- RobertGardner - 12 Jan 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback