r3 - 28 Jan 2010 - 18:58:51 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJan27



Minutes of the Facilities Integration Program meeting, January 27, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Shawn, Tom, Wei, Rik, Jason, Saul, John, Nurcan, Patrick, Kaushik, Michael, John de Stefano, Rob
  • Apologies: Mark

Integration program update (Rob, Michael)

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • there is a vm cluster at ANL which is in a good configuration to use as a reference pattern for small sites. Now testing user functionality - dq2, pathena, condor-submits; need to test instructions
    • More hardware arrived at ANL - will be setting up a real Tier 3 model cluster
    • Progress on Panda; Pacballs for SL5
    • Panda - Torre working on it w/ Tadashi and others
    • Pacballs - new releases in native SLC5; old releases are still in SLC4, older version of gcc - will still need compat libs.
    • Tier3 data deletion - new program from Hiro
    • Still no definitive answer for gridftp alone versus srm+gridftp (Hiro will check)
    • Frontier+Squid - is there anything Tier 3 specific? Yes - SE and HOTDISK
    • Panda running on Duke Tier 3. Pilots are working fine. Secure https based on a grid cert is used for communication with a panda server. Checking based on worker-node names for a queue. Working on getting real jobs working. Local submission via condor, not condor-G, to get the pilots there.
    • ADC meeting - want to mandate functional tests for all sites including Tier 3s. Agreed to this, appropriate to Tier 3 scale, and confined to a particular space token. Will need to deal with blacklisting issues.
      • Hiro is thinking of a proposal for getting data to Tier 3s, user-initiated, something less than the current Tier 2-like infrastructure. Needs further discussion.
      • Can Tier 3's also publish to BDII - for functional testing purposes.
      • Need a separate discussion on these topics. Friday 11 EST.
    • Working on reference Tier 3 site. Studying basic management using xrootd. Directory creation, file removal, etc. Need xrootdFS.
    • Tufts requesting GPFS access to BU's Tier 2. Should discourage this close coupling between the Tiers.
    • Tier 3 meeting at CERN - draft agenda from Massimo.
  • this week: (Rik)
    • Benchmarking I/O performance of virtual machines - could be fed into the joint US CMS- US ATLAS meeting at OSG AH
    • Report on ATLAS Tier 3 workshop
    • 70 registrants, IT auditorium completely full, > 30 EVO participants
    • Reports from all regions on status of Tier 3 efforts. Most are fully grid-compliant type T3, or co-located w/ T2,T1
    • Off-grid T3 are mostly US, Japan, Canada
    • Working groups and program of work have been formed for 3 months - prototype phase - covering: distributed storage (eg. Lustre, xrootd, multi-site technology)
    • T3 support - desire to use something like HC
    • PROOF - mentioned but few groups are using actively. What are the limitations, issues?
    • Data distribution for T3 - as we have discussed several times; dq2-like tools being contemplated for this
    • CVMFS - to distribute / cache ATLAS releases
    • SW installation from Canada (Asoka) - looks promising
    • Try to move in direction creating standards for ATLAS-wide standards working with ADC
    • Q: what about US ATLAS position regarding data replication; then dq2-get via FTS w/ queueing; no cataloging, thus not managed; what about policy.
    • There is also the question of sharing data across T3s.
    • Can FTS 2.2 be modified to support gridftp-only endpoint discussed at WLCG MB - 2 weeks.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • Reprocessing completed 12/31/09 - see summary from Alexei from today's shift summary
    • NET2 job failures due to missing input files - they were being centrally removed from PRODDISK(!); missing files replaced, production resumed.
    • SLAC job failures - pilot failing a curl command - update from Paul: not sure why, will put in error handling.
    • Large number of jobs submitted by Pavel - stuck in waiting but not sure why. Are there dependencies - an input dataset missing?
  • this week:
    • Ran out of jobs over the weekend
    • Now considering to backfill w/ regional production, polling US physics groups
    • There was a Panda DB incident, cause unknown, affects analysis users; Torre intervened .. pilots seem to be flowing again now.

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/13: AGLT2 -- jobs failing with the error "530 Authorization Service failed: gplazma.authz.AuthorizationException: authRequestID 520177631 Message to gPlazma timed out."  From Shawn:
    GUMS is OK.  The last log entry for gPlazma was 7:43 AM today (which may be OK).   In any case I restarted the gPlazma service on head01.
    2)  1/14 - 1/15: Maintenance outage at MWT2_UC completed.  After the site was set back to 'on-line' some LFC inconsistency issues were observed.  Problem resolved.
    3)  1/14: Maintenance outage at SLAC completed.
    4)  1/14: Status of the new (test) site Nebraska-Lincoln-red, from Xin:
    Nebraska doesn't have any atlas releases installed yet, all installation jobs failed (expired) there. I think Alden is checking the queue configuration to see why pilots don't pick up the real job.  Please keep this site in "test" state, as it's certainly not production ready.
    5)  1/15: AGLT2 -- failed jobs with errors like:
    15 Jan 2010 16:40:03| !!WARNING!!2999!! dccp failed:
    15 Jan 2010 16:40:03| !!WARNING!!2999!! dccp get was timed out after 18000 seconds
    >From Bob:
    The issue is fixed now with dCache restarts on our admin nodes, and a few servers.
    Test jobs completed successfully, site set back to 'on-line'.  RT 15072, eLog 8658.
    6)  1/17: File transfer errors at BNL-OSG2_DATADISK & BNL-OSG2_DATATAPE -- from Michael:
    Following a failover of a network switch earlier today 2 (out of 7) gridftp doors lost their WAN connectivity and were no longer accessible from outside BNL. This has caused a few transfer failures. These 2 doors were taken out of operation at 9:40 am EST. This has resolved the issue.
    eLog 8707.
    7)  1/17: NET2 -- proxy used for dq2 site services expired (point 1 shifter noticed unprocessed subscriptions for the site).  Issue resolved.  eLog 8708.
    8)  1/18: File transfer errors with BNL as the source -- the files are on tape and hence fail the first attempt while they're being staged to disk.
    From Michael:  These files are part of the BNLPANDA instance. Transfers time out due to the time required to stage them from tape. Retry will succeed.  To circumvent these intermittent failures we will investigate/implement prestaging.
    >From Hiro:  TOA for BNLPANDA entry will be changed today (shortly) to call srmbringonline, which should eliminate these errors.
    eLog 8742.
    9)  1/19: SLAC is offline due to a major power outage at the site.  Ongoing.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    ii)  12/18 (ongoing): IU_OSG -- site was originally set off-line due to pilot failures that were blocking jobs from other VO's -- nothing unusual seen on the pilot submit host -- tried some test jobs yesterday (12/22) -- 9 of 10 failed with the error " Pilot has decided to kill looping job."  
    Jobs were seemingly "stuck" on the WN's -- problem still under investigation.  Was this issue resolved??
    iii)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    iv) Reminder: analysis jamboree at BNL 2/9 - 2/12.  

  • this meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  1/20: File transfer errors at UTA_SWT2 related to an issue with one of the xrootd modules that was installed during an upgrade.  From Patrick:
    This is a problem related to one component in the newly installed version of Xrootd on the cluter.  We have reverted to an older version of the component and cluster has been stable ever since.  We are awaiting an updated version of the problematic component from the Xrootd developers, 
    but can continue to operate in the current configuration.  RT 15099.
    2)  1/21: SLAC -- power restored following the outage on 1/19.
    3)  1/21 - 1/23: BU -- Jobs were failing due to a problem with atlas release 15.6.1 at the site.  From Xin:
    15.6.1 and caches are re-installed at BU, validation run finished o.k.  ggus 54825, RT 15107.
    4)  1/22: Failed jobs at AGLT2 with errors like "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2702, Bad credentials)."  >From Shawn:
    We had one of our 3 AFS servers drop 3 of four disks last night.   We fixed it this morning and AFS was functional by around 9 PM Eastern.
    This could be related (but I'm not sure exactly how yet).   eLog 8848, Savannah 61769.
    5)  1/22: John at BU noticed an issue with failed dq2 transfers failing with LFC errors:
    LFC exception [Cannot connect to LFC [lfc://lfc.usatlas.bnl.gov:/grid/atlas]]
    This was due to an expired vo.racf.bnl.gov.30105.pem certificate.  Update available:
    6)  1/23: Transfer errors at AGLT2_HOTDISK -- due to cert / proxy issue in 5) above.  Resolved.  ggus 54938, RT 15130.
    7)  1/23 - 1/27: U.S. cloud has been well below capacity with MC production.  Need new tasks to be assigned.
    8)  1/24: Failed jobs (missing input file errors) at BNL due to a problematic storage server.  System back on-line.   eLog 8907.
    9)  1/23 - 1/26: File transfer errors at BNL-OSG2_MCDISK due to very old (~2 years) missing files in the storage.  From Hiro:
    I have no idea why these old files were physically missing. I am guessing that there were some problem with (very old) DQ2. But, this dataset
    should be in LYON since that is the original site?  I removed the bad entries in BNL LFC which is very easy to find since BNL LFC contains PNFSID/storage unique-id entry of file. All of these files are showing "-1", which means they were never there.  ggus 54937, RT 15129.
    Follow-ups from earlier reports:
    i)  12/29:  SLAC -- Failed jobs with the pilot error:
    !!FAILED!!1999!! Exception caught in pilot: [Errno 5] Input/output error: \'curl.config\'.  Update from Paul (1/6):
    For some reason, the file system did not allow the curl config file to be created so the open command threw an exception. I will improve the error handling in a later pilot version (the error occurred in the _Curl class responsible for the getJob operation used by the pilot).
    ii)  1/11: From Paul, discussing possible changes to the pilot to deal with the issue of orphaned processes left behind following athena crashes (seen by John at BU and Horst at OU):
    Since the parent pid is set to 1 (a result of athena core dumping?), the pilot doesn't identify it as belonging to its chain of processes. I will see if the pilot can identify it by e.g. also looking at the pgid. [Added to the todo list].
    iii) Reminder: analysis jamboree at BNL 2/9 - 2/12. 

Analysis queues (Nurcan)

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • The new tool from Hiro can potentially delete data at T3s
    • We need to formulate a sensible dataset deletion policy for T3s with regards to obsolete datasets
    • Would this create a mess in the central catalogs?
    • Action item: Rik, Doug, Hiro, Rob, Charles, (Michael & Jim as observers) to formulate a position for US ATLAS done
    • PRODDISK deletion centrally as noted above, stopped. resolved.
    • GROUPDISK needs a tweak in each DQ2 SS. Hiro will send out an email.
    • All is well.
    • FTS 2.2 still not yet available at WLCG level. Still not considered stable. Still need to wait for more information, ~ 10 days, before making a decision on moving to FTS 2.2 and starting site consolidation. NET2 would be the first site to move - discussions started.
    • Use Friday's meeting to address: Action item: Rik, Doug, Hiro, Rob, Charles, (Michael & Jim as observers) to formulate a position for US ATLAS
    • Doug will try out Hiro's new tool for data deletion
    • Hiro: will gridftp stand alone work for Tier 3's - still not tested checksum w/ gridftp and xrootd
  • this meeting:
    • Postpone discussion of proposal on renaming sites for consistency
      I am wondering if we can agree on the consistent site naming convention
      for various services in the ATLAS production system used in US.  There
      seems to be confusions among people/shifters outside of the US to
      identify the actual responsible site from various names used in the US
      production services/queues.   In fact, some of them are openly
      commenting the frustration of the difficulty in the computing log. 
      Hence, I am wondering if we can/should put the effort to use the
      consistent naming conventions for the site name used in the various
      systems.    In the below, I have identified some of the systems which
      could help users if the consistent naming were being used. 
      1.  PANDA site name
      2.  DDM site name
      3.  BDII site name
      At least, since these three names come to the front of the major ATLAS
      computing monitoring system, the good consistent naming for each site in
      these three separate systems should help ease problems encountered by
      the other people.   So, is it possible to change any of the name?  ( I
      know some of them are pain to change.   If needed, I can make a table of
      names for each site used in these three system. )
    • FTS 2.2 coming soon - will demand update 2 weeks after certification --> close to when we can consolidate site services
    • Be prepared ---> will consolidation of DQ2 site services from Tier 2s; week following FTS upgrade

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week
    • Developments in frontier servlet
    • New squid rpm for t1, t2's
    • Lots of traffic from HU, sustained; local users most likely (Saul)
    • Slides for post-mortem next week - contributions welcomed; include Tier 1 stress test results
    • Discussion regarding Tier 3 squid install recommendation
    • Douglas Smith would like to run squid test at US Tier 2's via Hammer Cloud. Is there a problem routing pilot jobs from HC via CERN to OSG sites - has this been resolved? Are the CERN pilots using the most up to date code? Torre will investigate.
  • this week
    • New frontier client released, included in new LCG release - 2.7.12;
    • Now in WLCG Tier 1 coordination effort
    • Is there a prob w/ BU - Douglas ran into probs w/ HC test (needs follow-up offline)
    • Waiting for new squid packages for sites - has patch, but wait for rpms (been waiting for a while)

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • last week(s):
    • Attending: Shawn, Jason, Aaron, Sarah, Charles, Andy, Horst, Zafar, Yee, Karthik, Hiro, Doug
    • 1) perfSONAR discussion and status. Issue at IU about firewall partially resolved. URL for firewall ports at: http://code.google.com/p/perfsonar-ps/wiki/NPToolkit (scroll to the Q: Can I Use a Firewall?). Aaron reported a stopped service at UC for throughput. Sarah reports that the software. AGLT2 running all services OK. MSU has issues with "runaway" processes which is being worked on. MWT2 has some other issues with some stopped services. Spreadsheet was used to verify and add in missing sites. SLAC has a kernel issue. OU has many services in "Not Running" mode for both throughput and latency nodes. Next release scheduled for the end of this month: NTP security updated, New kernel, Cleaning up OWAMP files. Need beta testers in about 2 weeks.
    • 2) Test and milestones for this quarter. Throughput tests (from Hiro) to Tier-3's have recently been failing because some "source" datasets have been cleaned up. Action item Doug will provide list of Tier-3 affinities. Action item Hiro to check source data files and restage if needed. FTS version at BNL still not "production" for checksum version. Maybe soon (see BNL site report below). Action item Hiro is to create a transaction test targeting a rate of about 8K/hour small files...details to be determined by Hiro. Next issue: Revisit prior milestones and redo/reverify? Is it worthwhile to redo load-tests (measuring site "burst" capability) quarterly? Concerns are the amount of effort required versus the value of the measurements. Opinions on both sides. Can help find otherwise undetected bottlenecks. Could be disruptive to our production systems (but historically has had minimal adverse incidents). Summary: try to do such testing quarterly but realize cumstances may not permit doing that now that LHC is running. Minimum of 2/year of milestone revalidations should be done. Eventually include Tier-3 milestones when appropriate.
    • 3) BNL - FTS version with checksum is not yet validated by WLCG but is released by GLite. BNL will decide soon on what it will do. Once checksum version is ready, want to verify throughput and transaction results in our infrastructure. SLAC/WT2 - Volunteered for perfSONAR beta release testing. Hopes to have perfSONAR running once kernel is OK. OU - Nothing else to report...another month for 10GE stuff after which will revisit load-tests. MWT2_UC/IU - The IU firewall has the software update in place to allow perfSONAR to work correctly. Next step is getting list of ports needed by perfSONAR in place. Also some interest in redoing load-tests after inter-site testing. AGLT2 - Interested in redoing load-testing now that changes last fall have improved throughput. Seeing some Myricom issues in 'dmesg' but may not be impacting transfers.
    • AOB? Next meeting is planned for January 26th (meetings are now every two weeks). Please send along any corrections or additions to the list.
  • this week:
    • focus was on perfsonar - new release to be out this friday; Jason: no show stoppers. All sites should upgrade next week.
    • Fixes bugs identified at our sites - hope this release resilient enough for T3 recommendations
    • Next meeting Feb 9

Site news and issues (all sites)

  • T1:
    • last week(s): A couple of scheduled interventions - updating storage servers, adding 700 TB completed. New storage appliance for job submission infrastructure - now have Blue Arc. Millions of sub dirs possible. Wicked fast.
    • this week: working on LAN updates, another Force10 added for redundancy. Ordering another 200 worker-nodes; R410s most likely. Note Intel has announced 32 nm based cpus out in March/April , will probably have big effect on pricing. Working on a 2 PB procurement.

  • AGLT2:
    • last week: all wn's running SL5, have been running OSG 1.2 for a while. Transition to new network address space. Will install new blade chasis at UM. 1.9.5-11 dCache upgrade. Running well. all systems up including blades. 2800 job slots. Downtime Feb 4 to complete remaining servers. Minor update to dcache. And a few other issues (bios, firmware). Other servers and systems will be updated not requiring downtime. SL4 completely eliminated. HS06 measurements - may need to update the capacity spreadsheet.
    • this week: Working on preparing for downtime, Feb 4. Migrating to new hardware, much of it live.

  • NET2:
    • last week(s): Recovered from data deletion problem. Shutdown in near future - in a week or so. LFC to be upgraded; all wn already at RH5. GK and interactive node to be upgraded to RH5, and new OSG 1.2. Facility upgrade: getting two racks of Dell Nahelms at HU Odyssey, one rack of storage at BU.
    • this week: Upgrading gatekeeper tomorrow; two racks of worker nodes arrived from Dell - operational soon. Finding some probs w/ HC tests failing, pilots failing quickly, cause not known.

  • MWT2:
    • last week(s): Downtime tomorrow to complete SL5 update for worker nodes;
    • this week: _IU and _UC in process of being upgraded to SL 5.3. Gatekeeper updated to OSG 1.2.5.

  • SWT2 (UTA):
    • last week: still working on the topics above; Note Dell R610 servers sometimes dropping the NIC. There is a kernel driver problem in the default SL 5.3. Justin has a solution. Next: will do CPB upgrade. 200 TB of storage waiting to be installed. Ordered 400 TB storage, hopefully mid-Feb. Broadcom NIC and BNX2 driver. installing latest bm-xrootd. expect another downtime. cpb cluster - expecting new delivery of 400 TB.
    • this week: OSG 1.2.5, all components to SL 5.4, 200 TB storage - probably a couple of days. xrootd very stable.

  • SWT2 (OU):
    • last week: Equipment delivered. 80TB SATA drive order for DDN. Final quote for additional nodes in the next week. New compute node order in preparation. Awaiting for storage to arrive.
    • this week: Started getting more equip from storage order; continue to wait for hardware.

  • WT2:
    • last week(s): RH5 done. OSG 1.2 finished. WLCG-client installation, validation. Two outages: this PM, failing fan on storage node. Next week probably as well for storage sys rearrangements. Also will make available scratch and localuser groups. Will probably delete some old releases. Can we delete release 13? Will put in a new NFS server for these releases. Down today to move storage servers to a new rack. Extra long space tokens. Question again: can we delete release 13? Need sufficient and timely information from ADC providing guidance. No we cannot delete release 13.
    • this week: A problem with a storage node - will vacate of data and replace. Latest xrootd the namespace agent has probs under heavy load. Losing events under load. At SLAC using old namespace agent.

Carryover issues (any updates?)

OSG 1.2 deployment (Rob, Xin)

  • last week:
    • BNL updated
  • this week:
    • Any new updates?

OIM issue (Xin)

  • last week:
    • Registration information change for bm-xroot in OIM - Wei will follow-up
    • SRM V2 tag - Brian says nothing to do but watch for the change at the end of the month.
  • this week:

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week(s)
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
    • There was an ATLAS-OSG-Xrootd meeting this week
    • > 64 storage nodes will go into the next release.
    • current release in VDT is good enough for production use, supports long space tokens
    •  === Attendees ===
        * Hanushevsky, Andy 
        * Gardner, Rob
        * Kroeger, Wilko
        * Levshina, Tanya
        * Roy, Alain
        * Williams, Sarah
        * Yang, Wei
      ===  Update on OSG plans ===
      Tanya asked about including XRootd clients separately from the XRootd server. Alain is happy to do it but needs help dividing up packages (exactly what files should be in the XRootd client?). Andy confirmed that there is no existing documentation about the clie. We included that probably the two commands needed in the client are xrd and xrdcp. Wei and Anday will meet today with Doug Benjamin, then let us know exactly what is needed. 
      === Resolution of >65 nodes in Xrootd ===
      In Tanya, there was a discussion about sites having problems when there are more than 65 nodes. Should we be worried? When will fix be released? Etc.
      Andy says that the core cause was insufficient testing of the underlying feature in the CMS (it's hard to test with more than 65 nodes, and largest site didn't move to the CMS). It was hard to debug, due to an unusual configuration. They found the bugs and have fixed them internally. 
      Current release in OSG 1.2.5 has the problem. It will be fixed in the next release, they can provide patches to the existing release, if we need it--question: do we need the new release when it's ready, or do they need to push us patches more quickly? They will talk to Doug this afternoon to investigate the urgency. 
      Will these be a problem for the Tier-3s? It will affect sites if they install on all worker nodes, less likely to affect sites with a few central data servers. Again, they will talk to Doug about the urgency. 
      === Update on XRootd plans ===
      ALICE is the driving force for new version due to their own internal pressures. XRootd team would like a new version as well since it's been three months since their last release. It's currently in testing with ALICE. They hope to have a new release in the next week or so. 
      There are improvements to cnsd, and they are working with Doug to understand what is really needed. Tanya is looking at problems with cnsd log. 
      Wei will update XRootdFS to support more than 64 nodes. 
      === Update on USATLAS needs and expectations ====
      Rob said ATLAS is trying out current release from OSG 1.2.5. They are deploying XRootd across two sites then accessing it with Panda. After testing the functionality, they will try a scalability test. Andy offered to look at their configuration and offer advice. 
      === Open business ===
      It would be better for Andy if Tier-2 discussions at the OSG All-Hands meeting were not on Monday, because he arrives late on Monday. Alain will let Ruth know.
      Please correct any mistakes I've made. Thanks,

  • Tier3-xrootd meeting:
    Subject: Minutes of Tier 3 and Xrootd meeting
    Attendees: Rik, Doug, Andy, Wilko, Booker, Richard, Wei
    Xrootd client usage for Tier 3:
    Ordinary users will only need xrdcp to move data in and out of a Tier 3
    Xrootd system. More complicated data management can be done via XrootdFS by
    people who manage the Xrootd space.
    A site probably only need one instance of XrootdFS/Fuse to do data
    management. Some version of SL5 has Fuse 2.7.4 packed in and can be used
    without worrying about manual update of Fuse kernel module. New XrootdFS can
    function without CNS (the OSG default configuration still uses CNS).
    ATLAS doesn't need separate packages for xrootd client and xrootd server.
    (clients like gridftp and xrootdfs are independent anyway).
    Keep track of datasets in Xrootd system:
    ATLAS Tier 3 sites prefer to have a way to keep trace of what datasets are
    in their xrootd system, and whether a dataset at a site is a complete set or
    incomplete set.
    We agree that this function is independent of the Xrootd system and should
    be provided a 3rd party tool.
    Know what files are hosted by a data server:
    ATLAS Tier 3 site (and T2 perhaps) would like a way to tell what files are
    in a given data server, for proof scheduling and disaster recovery purposes.
    Xrootd's inventory function provides this info (through not in real time).
    Wei will provide instructions.
    Wei Yang  |  yangw@slac.stanford.edu  |  650-926-3338(O)

  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • BNL has a lsm-get implemented and they're just finishing implementing test cases [Pedro]

Gratia transfer probes @ Tier 2 sites

Hot topic: SL5 migration

  • last weeks:
    • ACD ops action items, http://indico.cern.ch/getFile.py/access?resId=0&materialId=2&confId=66075
    • Kaushik: we have the green light to do this from ATLAS; however there are some validation jobs still going on and there are some problems to solve. If anyone wants to migrate, go ahead, but not pushing right now. Want to have plenty of time before data comes (means next month or two at the latest). Wait until reprocessing is done - anywhere between 2-7 weeks from now, for both SL5 and OSG 1.2.
    • Consensus: start mid-September for both SL5 and OSG 1.2
    • Shawn: considering rolling part of AGT2 infrastructure to SL 5 - should they not do this? Probably okay - Michael. Would get some good information. Sites: use this time to sort out migration issues.
    • Milestone: my mid-October all sites should be migrated.
    • What to do about validation? Xin notes that compat libs are needed
    • Consult UpgradeSL5
  • this week

WLCG Capacity Reporting (Karthik)

  • last discussion(s):
    • Need to schedule call with Brian
    • Notes from meeting with Brian:
       Meeting with Karthik and Rob G; Jan 8, 2010.
      Starter questions:
        - Q: What are the current requirements for the report and do we have any documents describing this in detail?
        - A: Yes, we have the installed capacity document.
        - https://osg-docdb.opensciencegrid.org:440/cgi-bin/ShowDocument?docid=826.
        - The most relevant one is the "requirements document".
        - We can probably update the operational document as needed.
      - Q: Are the upper and lower limits still relevant or do we go with a single number to compare against?
        - A: For future numbers - specifically HS06 - we'll just go with a single number as CMS seems to be less concerned with the 2-number usage case.
      - Q: What are the software requirements from the site's point of view and who should be involved?
        - A: The software requirements are to keep the kSI2K, HS06, and # of TB at their site in OIM consistent with their MoU agreements.
        - A: From the GIP point of view, a OSG CE 1.2.x is required.  Keeping correct GIP numbers is nice, but not necessary for WLCG reporting.
      - What is OSG's responsibility in this?
        - A: The OSG provides the technology so WLCG sites can publish installed capacity.
        - A: The OSG negotiates on behalf of the USLHC sites with the WLCG on how to transmit the information for WLCG reports.
        - A: The OSG provides an operational infrastructure so WLCG can utilize its solution.
      - Who are the stakeholders and how does this report impact their bottom line?
        - The stakeholders are USLHC VOs and their respective sites.
        - This affects the WLCG reports (as of yet, unseen) pertaining to their site; this report may make it back to US funding agencies
      - Does the WLCG Collaboration Board ever discuss these numbers?  B: not that I know of.
      - R: Can we get a summary report showing this data so we get this input correct before WLCG asks for it?  B: Yes, let's do this
      - R: Still a bit confused on what numbers are needed for WLCG?
        - B: Sum for each site, KSI2K, HEP, # of TB.
      - B: we'll need a concise table for the most important, management-level numbers
        - Karthik will be implementing this table and adding it to the top of his current report.
        - The current table will remain, but be of less interest to most folks
        - We should try to keep on sites to have consistent numbers, but this will be less important.
      - R: Will be sending off link for ATLAS HEPSPEC 06
        - http://www.usatlas.bnl.gov/twiki/bin/view/Admins/CapacitySummary.html
      Sample table format of installed capacity numbers going to WLCG 
      Site	kSI2K	HS06	TB Installed
      MWT2_UC	1000	400	300
      MWT2_IU	1500	500	200
      Total: US-MWT2	2500	900	500
      USATLAS Total:	1234	5678	91011
  • this meeting
    • Preliminary capacity report is now working:
      This is a report of pledged installed computing and storage capacity at sites.
      Report date:  2010-01-25
       #       | Site                   |      KSI2K |       HS06 |         TB |
       1.      | AGLT2                  |      1,570 |     10,400 |          0 |
       2.      | AGLT2_CE_2             |        100 |        640 |          0 |
       3.      | AGLT2_SE               |          0 |          0 |      1,060 |
       Total:  | US-AGLT2               |      1,670 |     11,040 |      1,060 |
               |                        |            |            |            |
       4.      | BU_ATLAS_Tier2         |      1,910 |          0 |        400 |
       Total:  | US-NET2                |      1,910 |          0 |        400 |
               |                        |            |            |            |
       5.      | BNL_ATLAS_1            |          0 |          0 |          0 |
       6.      | BNL_ATLAS_2            |          0 |          0 |          1 |
       7.      | BNL_ATLAS_SE           |          0 |          0 |          0 |
       Total:  | US-T1-BNL              |          0 |          0 |          1 |
               |                        |            |            |            |
       8.      | MWT2_IU                |      3,276 |          0 |          0 |
       9.      | MWT2_IU_SE             |          0 |          0 |        179 |
       10.     | MWT2_UC                |      3,276 |          0 |          0 |
       11.     | MWT2_UC_SE             |          0 |          0 |        200 |
       Total:  | US-MWT2                |      6,552 |          0 |        379 |
               |                        |            |            |            |
       12.     | OU_OCHEP_SWT2          |        464 |          0 |         16 |
       13.     | SWT2_CPB               |      1,383 |          0 |        235 |
       14.     | UTA_SWT2               |        493 |          0 |         15 |
       Total:  | US-SWT2                |      2,340 |          0 |        266 |
       Total:  | All US ATLAS           |     12,472 |     11,040 |      2,106 |
    • Debugging underway


  • last week
  • this week

-- RobertGardner - 26 Jan 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback