r5 - 13 Jul 2012 - 17:58:12 - MarkSosebeeYou are here: TWiki >  Admins Web > MinutesJuly11

MinutesJuly11

Introduction

Minutes of the Facilities Integration Program meeting, July 11, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Fred, Bob, Rob, Saul, Michael, Sarah, Hiro, Ilija, Wei, Mark, Alden, Armen, Horst, Tom, John B
  • Apologies: Jason, Patrick, Shawn
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • IntegrationPhase21, SiteCertificationP21
      • OmniPoP switching in Chicago; update from I2 today regarding LHCONE.
      • Tier 2 guidance - coming: basis for discussion is we have more CPU than pledge, so ratio cpu/disk implies an adjustment is necessary. Michael has been figuring a sensible basis; will need discussion with ADC management. Discussion with Borut regarding projections - there is a comprehensive paper. Most of the Tier 2's would need to install 2-3 PB of disk to create balance; this would not be reasonable in next fiscal year. Will plan on a site-by-site basis the plan. Need to get to a resolution on resource requirements for the coming year, and next fiscal years beyond. Expect sometime to have guidance sometime next week. Might expect more resident data for analysis.
      • 2013, 2014 pledges request from LCG office, new figures by September 30, for RRB meeting October.
      • Torre: plans for SSD-based storage, for hierarchical storage? We have been looking into this at BNL and SLAC - there is a prospect for this. AGLT2 - looking at LSI cascade solution (block level); potentially easy, but requires use of Dell SSDs which are expensive; Shawn working with Dell on this.
      • LFC consolidation - status?
        • Asking for max thread count to database to be increased - currently 100; will install a new one.
        • Hiro discussion with Cedric and others - merge script for Oracle.
        • Timeline: few days to merge the UTA_SWT2 site to Oracle, and then run in production. Patrick notes he runs proddisk often; this may complicate things.
      • Deploying federation infrastructure - widely. New bi-weekly meeting series.
      • Time date for next facilities f2f meeting (@ Santa Cruz). Mid-November time frame.
      • Accounting and statistics data: aggregation across computing elements, appearing as individual sites. ICB message to address accounting issues appropriately. Discussing with Jarka as to the specific aggregations that are occurring.
      • At last GDB meeting in mid-June there was a discussion about migration of glue 1.3 to 2.0, a large change. There was a deadline by September 30(!!). If sites are not ready then they'll be dropped. OSG apparently was not contacted, but they've taken this as an action item.
      • Next generation computing element. Two alternatives - a thin condor CE, and CREAM. Two teams are looking into these. BNL and Nebraska evaluating these; July 12 is next blueprint meeting in Madison.
      • Status of OSG 3.x deployment.
        • MWT2 - have OSG 3.x running, will update to latest
        • AGLT2 - will setup a test gatekeeper to try out the installation; will need to inform APF.
        • NET2 - installed on a new gatekeeper; will bring it in as a new Panda site. Working on BU - should be ready soon.
        • WT2 - we have two gatekeepers, will switch one of the two. Timeframe: 4-6 weeks. Not confident about stability of information system.
        • SWT2 - UTA: installed rpm-based Bestman at one of our clusters, that went fine. Using this as a basis for setting up builds for new devices. Is LFC available as an rpm? Yes. Will be taking a downtime anyway.
        • SWT2 - OU: working with OSCER admins to install OSG 3 on their new gatekeepers - this gain experience.
    • this week
      • IntegrationPhase21 ended June 30
      • Updates needed by next Friday, July 20:
      • 2013 pledge targets: https://twiki.cern.ch/twiki/pub/Atlas/ComputingModel/ATLAS_Resources_2012_2014.pdf
      • Regarding the ratio - we can't draw conclusions from the current usage pattern; there will be more derived data from 2012 will fill our disks, and will change the processing pattern. According to Borut, we need to work towards balancing the cpu/disk.
      • Realize that cannot be achieved on the spot - it will be a gradual change ~ 2 years.
      • Michael and Rob will discuss this with each team, to come up to a plan towards balancing. To set expectations on procurements for sites for this year, and FY13
      • Wei: targets can use both this year's and next year's budget. Ans: yes.
      • Need to keep in mind that CPUs are aging - retirements are happening. Would like to keep more than three generations of processing at Tier 2s.
      • We also have aging disk and networking
Site CPU [HS06]   Disk [TB]    
  2013 Pledge Installed 6/12 2013 Pledge Installed 6/12 Balanced target
AGLT2 15.0 37.5 2132 2160 5332
MWT2 22.5 53.0 3200 2732 7538
NET2 15.0 37.0 2132 2100 5261
SWT2 15.0 30.0 2132 1610 4264
WT2 15.0 29.4 2132 2143 4181
      • Deployment of multi-core queues; there is a queue at WT2, and seems to be working on it. Wei notes there are brief periods of large memory usage.
        • What are the requirements for a multi-core machine? Multiply by number of cores.
        • 8 logical cores per job slot - focus on this
        • Maximum 50 multicore jobs
        • At BNL, let Condor manage this - do no hard partion
        • If group accounting, you'll need Condor 7.8.1. Documentation: AthenaMPFacilityConfiguration
        • At SLAC, its hard-partitioned; Don't have an lsf configuration.
        • Patrick is working on a PBS configuration
      • Next face-to-face facility meeting will be at UC Santa Cruz - will be co-located with an OSG Campus Infrastructures Community meeting
      • Extended run - Dec 15 p-p; technical stop for 4 weeks. Mid-Jan, heavy ion run, 4-weeks.

Update on LFC consolidation

  • From Patrick: Hiro provided me with a dump file of his LFC contents that pertain to my site. This is similar to what is done for the Tier3's. I need to look at the dump and modify/adapt the PandaMover? Cleanse script to delete items from his LFC. I am hoping to look at that this week, since early next week will be lost on the maintenance.
  • Hiro's ddm-l message:
    The following shows the LFC migration steps and current status in detail. 
    
    1.  Continuous replication of T2 LFC's MySQL to BNL MySQL.
    a.  Need access from BNL.
    b.  The account needs "super" privilege (for triggers)
    c.  rubyrep(http://www.rubyrep.org/)   It will replicate all
    insert/update/delete. The delay is in seconds.
    d. modify trigger in BNL MySQL to record all delete to a separate table 
    (used later time)
    
    NOTE:  If T2 can't provide me the access to LFC from BNL, the continuous
    replication is not possible.  This will increase the downtime of site
    when LFC is switched. (maybe a day)  Using the continuous update, the
    downtime is very minimum since all records are updated. 
    
    2.  Update BNL T2 LFC from BNL MySQL
    a.  a script has been created/tested.
    b.  a script will insert all existing entries (in BNL MySQL) to T2 LFC
    at the time of its execution. 
    c.  a separate script will insert any new entries to T2 LFC. It can be
    run many times.
    d.  a separate script will delete any new deletes from T2 LFC (the rows
    from 1.d is used)  It can be run many times.
    
    3.  Switch LFC
    a.  Change ToA
    b.  Change scheconf
    c.  In the case of continuous migration, there is no need to drain or
    wait for active jobs to finish. 
    d.  In the case of not using continuous migration, we must wait to
    drain, create/transfer MySQL dump, do the step #2, which can take a day.
    
    4.  Run HC jobs.
    
    5.  Availability of dump.
    a.  a script has been created.  it can be run daily (or even sooner if
    required)
    b.  a script will make a dump, register it to DDM (BNL's SCRATCHDISK)
    and subscribe to T2 (datadisk or maybe scratch) area.  The name of the
    dataset is user.HironoriIto.T2Dump.SITE.DATE_TIME (the file has .db for
    sqlite file)
    c.  The dump includes the following information: guid, lfn, csumtype,
    csumvalue, ctime, filesiz and sfn.
    d.  The format is in sqlite. 
    .schema
    CREATE TABLE files (id integer primary key, guid char(36), lfn
    varchar(255), csumtype varchar(2), csumvalue varchar(32), ctime integer,
    fsize integer, sfn text);
    CREATE INDEX ctime on files ('ctime');
    CREATE INDEX guid  on files ('guid');
    CREATE INDEX lfn  on files ('lfn');
    
    
    Status:
    The step 1,2 and 5 in the above have been tested using UTA_SWT2 LFC
    (gk03.swt2.uta.edu)
    The step 3 and 4 will be tested soon with UTA-SWT2
    The cleanup script will need adjustment for use the a new dump. 
    
    LFC version. 
    Waiting for the 1.8.4 release which is in EMI certification right
    now.    The version includes an increased number of maximum allowed
    thread from 100 to 1000.
    Meantime, the developer (Ricardo) is trying to port to exiting 1.8.3 for
    testing.   BNL T2 LFC is currently the version 1.8.3.1-1
    
  • Need to test proddisk-cleanse against BNL
  • Turning off SWT2_UTA LFC, testing with HC
  • Okay - discussion lead to WT2 as the next site; will require full downtime to do the Mysql.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
  • this meeting:
    • We did well providing computing for the Higgs crunch - but we now have a large backlog.

Data Management and Storage Validation (Armen)

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=197445
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-7_2_2012.html
    
    1)  6/27: AGLT2 - failed file transfers with SRM errors ("failed to contact on remote SRM at head01.aglt2.org:8443/srm/managerv2").  Shawn reported that 
    an iptables missed port 8443 on the head01 host.  An additional issue with 2 of 6 gridftp doors not restarting iptables was also fixed.  The next day a new type 
    of error was posted to the ticket ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred]").  ggus 83648 in-progress, eLog 37201.
    Update 6/29: Shawn reported all issues had been resolved - ggus ticket closed.  (See https://ggus.eu/ws/ticket_info.php?ticket=83648 for additional details.)
    2)  6/28: BNL-OSG2_SCRATCHDISK - file transfer failures (" File name is too long").  This is not a site issue (in this case the filenames are > 200 characters, 
    beyond the limit imposed by the DDM system).  Hiro suggested that the tranfer(s) be canceled.  ggus 83687 in-progress, eLog 37200.
    Update 6/29: ggus ticket closed, information forwarded to the DDM team.
    3)  6/29: CERN - power cut at the site affected most computing services.  Power was restored after ~one hour - eventually all services back up. 
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/37210.
    4)  7/2: "Leap Second" java bug affected BNL Frontier servers.  Issue resolved - see details from John DeStefano:
    http://www-hep.uta.edu/~sosebee/ADCoS/leap-second-java-bug.html - ggus 83732 closed.
    5)  7/2: SLAC - job failures with the error "Put error: lfc-mkdir failed: LFC_HOST atl-lfc.slac.stanford.edu cannot create..."  ggus 83772 in-progress, eLog 37261.
    
    Follow-ups from earlier reports:
    
    (i)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    Update 6/14: The issue with NERSC not being in the BDII system should get fixed during an upcoming maintenance outage.
    (ii)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (iii)  6/16: BNL_ATLAS_RCF jobs failures with "lost heartbeat" errors.  From John Hover: This queue is running on non-dedicated resources, so ATLAS jobs are subject 
    to eviction without warning when the resource owner submits a lot of jobs. We'd expect "lost heartbeat" errors when that happens. We'll check to make sure that is what 
    happened, and look into mitigating the impact of evictions.  ggus 83321 in-progress, eLog 36938.
    Update 6/25: ggus 83552 was opened for the same issue.  eLog 37146.
    Update 6/27: This issue is being followed elsewhere, so the two ggus tickets were closed.
    (iv)  6/17: AGLT2_DATADISK - Functional test failures with the error "file exists, overwrite is not allowed."  https://savannah.cern.ch/support/index.php?129560 
    (ADC Ops Support Savannah), eLog 36963.
    (v)  6/26: UPENN_LOCALGROUPDISK - file transfer errors (SRM).  ggus 83602 in-progress, eLog 37168.
    Update 6/28: No further errors observed to UPENN_LOCALGROUPDISK - ggus 83602 closed.  eLog 37185.
    Update 6/29: SRM errors returned, new ggus ticket 83714 opened (with reference to ggus 83602).
    Update 7/2: Issue resolved, ggus 83714 closed.  (This was a different problem compared to the earlier ggus ticket.  An OSG s/w update broke the BeStMan installation.  
    Needed changes were applied to a BeStMan configuration - this fixed the problem.)  eLog 37260.
    

  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=slides&confId=199838
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-7_9_2012.html
    
    1)  7/7: BNL - from Michael - At BNL we are currently experiencing a problem with the storage element. The problem is associated with the SRM
    Space Manager process.  Later: The problem was solved at 14:40 UTC by restarting the SRM and the dCache admin domains.  More details in: 
    https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/37395.
    2)  7/7: From Rob at MWT2 - Evidently as a result of the extreme temperatures in the Midwest in the past few days we've had a CRAC unit fail last night around 2 AM local.   
    As a result a number of compute nodes have been automatically shutdown.  These will result in job failures.  eLog 37396. 
    3)  7/9: BNL - file transfer errors ("Cannot get a connection, pool error Timeout waiting for idle object").  Issue resolved - from Michael: The problem is already resolved 
    by a restart of the SRM domain; transfers resumed.  ggus 84028 closed, eLog 37458.
    4)  7/10: SWT2_CPB - file transfer failures with SRM errors.  A storage server lost it's network connection when a NIC fan failed.  Problem fixed, and files again 
    transferring successfully.  ggus 84112 / RT 22247 closed, eLog 37546.
    
    Follow-ups from earlier reports:
    
    (i)  4/10: New type of DDM error seen at NERSC, and also UPENN: "Invalid SRM version."  ggus tickets 81011, 81012 & 81110 all related to this issue.
    Update: As of 4/19 this issue being tracked in ggus 81012 - ggus 81011, 81110 closed.
    Update 6/14: The issue with NERSC not being in the BDII system should get fixed during an upcoming maintenance outage.
    Update 7/7: Both NERSC and UPENN now appear in BDII.  ggus 81012 closed.
    (ii)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (iii)  6/17: AGLT2_DATADISK - Functional test failures with the error "file exists, overwrite is not allowed."  https://savannah.cern.ch/support/index.php?129560 
    (ADC Ops Support Savannah), eLog 36963.
    Update 7/5: FT's have been working for the past ~one week, so issue appears to be resolved.  Savannah ticket closed - eLog 37338.
    (iv)  7/2: SLAC - job failures with the error "Put error: lfc-mkdir failed: LFC_HOST atl-lfc.slac.stanford.edu cannot create..."  ggus 83772 in-progress, eLog 37261.
    
    • On shift - been busy.
    • Low number of US-specific issues in the past week.
    • Otherwise things are smoothly.

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting:
    • See notes sent yesterday.
    • There are some issues with the collector on the dashboard; Tom looking into getting an additional collector. This is thought to be responsible for empty measurements.
    • Throughput involving Trimuf and sites in Germany. Hopefully getting sites on LHCONE to avoid.
    • 10G perfsonar sites?
      • NET2: up - to be configured.
      • WT2: have to check with networking people; equipment arrived.
      • SWT2: have hosts, need cables.
      • OU: have equipment.
      • BNL - equpiment on hand
  • this meeting:
    • Michael: we are running behind with connecting AGLT2 and MWT2 to LHCONE
    • Should start with the other Tier 2's: WT2 should be easy. Saul will start the process at NET2.
    • Sites need to apply the pressure or it site

Federated Xrootd deployment in the US (Wei, Ilija)

last week(s) this week:
  • Wei: Global redirector at CERN - no progress. Can't get access machines. Need to talk to the right person. DPM - they've moved to better hardware. You can access via the local path, but not the global path. Something is not working. Discussion with Fabrizio Furano about this part of the designed, still testing; with David Smith.
  • Ilija - DE cloud has N2N? . Wuppertal and LRZ. Its going to work.
  • Monitoring update - see update talk by Rob in ADC Operations yesterday.

US analysis queue performance (Ilija)

last two meetings
  • Twiki to document progress is here: AnalysisQueuePerformance
  • Organizational tools in place
  • Some tools to help monitor and inform about performance still in development
  • Base-line measurement is not yet established.
  • A lot of issues to investigate are cropping up.

this week:

  • Had a meeting last week - some issues were solved, and some new ones have appeared.
  • Considering direct access versus stage-in by site.
  • We observe that when switching modes there is 2 minute stage-out time that seems to be added. Will need to be investigated. Has been in discussion with Paul.
  • Went through all the sites to consider performance. NET2 - Saul will be in touch with Ilija.

Site news and issues (all sites)

  • T1:
    • last meeting(s):
    • this meeting: not much to report. Over the weekend we have a SE incident - looked like an SRM -to- db communication issue. Still puzzled as to cause. Updating OpenStack? , increasing number of job slots. Adding accounts and auth info to support dynamic creation of Tier3 resources.

  • AGLT2:
    • last meeting(s):
    • this meeting: Found network getting hammered after switching to direct access. Updated MSU to SL5.8 + ROCKS 5.5 on all worker nodes. UM will be updated to same set over next couple weeks. Found local 20G link saturated, 2400 to 3100 analy jobs. 4-5GB/sec from pools to wn's. Inter-site traffic work as expected - large pulse when jobs first started up.

  • NET2:
    • last meeting(s):
    • this meeting: Weekend incident - accumulated large number of thread to sites with low-bandwidth. Eventually squeeze out slots for other clients. Failed to get a dbrelease file. Ramp up in planning for move to Holyoke - 2013-Q2. Going to move PBS to OpenGridEngine at BU (checked with Alain that its supported). Direct reading tests from long ago showed GPFS.

  • MWT2:
    • last meeting(s):
    • this meeting: Site in test-mode, requested pilots. Having problems with the submit grid09 - not accepting transfers.

  • SWT2 (UTA):
    • last meeting(s):
    • this meeting: We have deployed our new worker nodes and our preparing our storage for rollout. We intend to take a downtime on Monday and Tuesday next week to implement a number of delayed maintenance items. OSG upgrade. Swap everything to new UPS. Pressing issue - one of the storage servers went offline; cooling fans on 10G NIC failed.

  • SWT2 (OU):
    • last meeting(s):
    • this meeting:

  • WT2:
    • last meeting(s): LFC backend database issue, there was an upgrade and will do a migration. Will update Bestman.
    • this meeting: Upgraded Bestman to rpm-based installation. SSD's cache created, not sure of its performance; working to separate analysis and production. LFC - database group working on migrating Mysql database from old to new hardware, expect timeouts to be reduced. But now have an ACL problem - used to update hourly for Nurcan's DN; missed those updates for the ADC DN's. LFC versus DDM database comparison.

Carryover issues (any updates?)

rpm-based OSG 3.0 CE install

last meeting(s)
  • In production BNLs
  • Horst claims there are two issues: RSV bug, and Condor not in a standard location.
  • NET2: Saul: have a new gatekeeper - will bring up with new OSG.
  • AGLT2: March 7 is a possibility - will be doing upgrade.
  • MWT2: done.
  • SWT2: will take a downtime; have new hardware to bring online. Complicated with install of new UPS - expect delivery, which will take a downtime.
  • WT2: has two gatekeepers. Will use one and attempt to transition without a downtime.

this meeeting

  • Any updates?
  • AGLT2
  • MWT2
  • SWT2
  • NET2
  • WT2

OSG Opportunistic Access

See AccessOSG for instructions supporting OSG VOs for opportunistic access.

last week(s)

this week

AOB

last week
  • Next generation Intel CPU; Sandybridge from Dell and IBM under eval. The performance improvements do not seem to justify the current pricing. Will still be buying Westmere machines for RHIC. There is some indication the sandybridge machines may not show up in the dell matrix. Shawn: not for near term. Might be September.
this week


-- RobertGardner - 09 Jul 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


jpg screenshot_01.jpg (48.9K) | RobertGardner, 11 Jul 2012 - 12:04 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback