r7 - 19 Sep 2012 - 14:56:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesSep192012

MinutesSep192012

Introduction

Minutes of the Facilities Integration Program meeting, Sep 19, 2012
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • USA Toll-Free: 888-273-3658
    • USA Caller Paid/International Toll : 213-270-2124
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Jason, Mark, Dave, Saul, Torre, Bob, Fred, Patrick, Wei, Maxim, Xin, Michael, Rob, John, Sarah, Horst, Ilija, Alden, Tom, Hiro
  • Apologies:
  • Guests:

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon Central, bi-weekly - convened by Kaushik) : Data management
    • Tuesday (2pm Central, bi-weekly - convened by Shawn): North American Throughput meetings
    • Monday (10 am Central, bi-weekly - convened by Wei or Rob): Federated Xrootd
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • New Integration program for FY12Q4, IntegrationPhase22 and SiteCertificationP22
      • LFC consolidation is a highest priority.
      • Rob: review of high level milestones for the facility. Thanks to all for updating the site certification table.
      • Michael: Pledges for 2013, 2014 are being declared this month. Because of the run extension there are additional resources being required. There has been a concerted effort to keep the requests at a reasonable level. There are now some solid numbers for the US facility, according to the 23% MOU share, to be discussed tomorrow at the L2/L3 meeting tomorrow.
      • Michael: multicore slots are now going to used for a validation campaign. Jose is working on getting these jobs going - so we expect the MCORE queues to be utilized. Hopefully will lead to getting AthenaMP? into production.
      • Michael: accounting statistics are being analyzed at the ICB to get valuable information about how resources are being used.
      • Michael: future of PandaMover in the US. Historically it has been quite valuable (e.g. when DDM was not available enough); e.g. staging data from tape. Kaushik: useful as a backup, especially as we transition to Rucio. Last time we tried ran into DQ2 load issues on SRM. Network load may go up to 30%, since files are re-used, which is not normal.
      • Michael: all should be re-visited. A factor of 2 or 3 should not be an argument. Note - PandaMover related issues are hard to debug.
      • Kaushik: deletion service immediately deletes datasets after the jobs.
      • Rob: could you make the change for a single site? Kaushik: yes.
      • Hiro; does not like it.
    • this week
      • Please register for the UC Santa Cruz meeting - https://indico.cern.ch/conferenceDisplay.py?confId=201788
      • Reviewing the use of Pandamover in our environment. We need to organize and effort to review status, and outlook for the future. Timeframe - arrive at conclusion as to which way we should go. Action item
      • Add LHCONE peering to next quarter's program; we are already late. SWT2 is a concern. NET2 should be straightforward. SLAC should be an internal issue. We should review the sites in Europe already peered.

Wrapping up GlideinWMS/PanDA testing in the US cloud (Maxim)

Executive summary

We investigated the feasibility of integration of the GlideinWMS system and the ATLAS PanDA Workload Management System. Generation and distribution of PanDA pilot jobs was the focus of this integration effort. This was done in the process of expanding the combined system in the US cloud, based on our prior experience with European sites. Additional components have been deployed at BNL as required by the test.

From the technical standpoint, the configuration and deployment issues are well understood as demonstrated by continuous and stable operation of a few designated US ATLAS sites in the integrated mode, over a period of a month and a half, while carrying up to 20% of the US ATLAS total production workload.

Based on our observations and experience, we have identified potential throughput bottlenecks, and ways to scale the combined system. We determined that due to the very large number of ATLAS job being processed at any given time, an additional investment would be required into scaling up the GlideinWMS Factory components, since the current infrastructure does not have enough capacity to support both ATLAS and CMS (and other potential users). Increasing the capacity of this resource is likely to require more maintenance. This reduces the benefits we can hope to extract from the proposed integration (the expectation was that the schedd pool would be the hot spot in this setup, based on previous discussion with GlideinWMS personnel).

We also conclude that at least one other scalability requirement (increasing the Condor schedd node count as well as the adding nodes for the GlideinWMS Front End) may effectively result in fragmentation of the Condor pool, which would make use of the Condor single-pool fair share mechanism in PanDA impractical. While using a Global Policy in Condor is possible, this would result in further increase of complexity in the system. This negates one of the anticipated benefits of integration. Load-balancing individual chains consisting of APF, the schedd pool, Front End and ultimately the Factory becomes an additional challenge.

Finally, even though using system in glexec-capable mode was not possible at this time on the US production sites, we can still point to some issues pertaining to this, based on prior work done on glexec integration into PanDA. Since the identity switch in the current Condor version happens at precisely the moment when the job is starting on the remote site, 100% of communications between the Pilot job and PanDA server must take place with the end-user identity. This will require significant changes in the server code and other elements of logic employed in PanDA.

Another look at costs and benefits

Benefits

From the experience derived from deployment and operation of the integrated system, we have come to the following conclusions regarding the benefits as formulated at the start of this project:
  • Fair Share: Using the Condor fair share and other resource allocation mechanisms will be problematic because of the need to scale up the Condor schedd pool
  • GLEXEC: Based on prior experience with glexec, starting a pilot on a WN under the end-user identity may present serious technical difficulties, because of the complexity of the PanDA pilot functionality and various instances of authentication and authorization required at various points in its lifecycle. This can be contrasted with switching from the production to user identity inside the pilot, which is more manageable
  • MULTICORE: From what we have seen during the testing, multi-core jobs can be handled in PanDA by the existing AutoPyFactory functionality, hence at present time there is no clear advantage in switching to GlideinWMS just for that reason
In summary, while theoretically possible, the benefits no longer look compelling.

Costs

We believe that the following costs of deployment and running the combined GlideinWMS/PanDA system at scale will be incurred:
  • Additional hardware to host Condor schedd pools
  • A necessary commitment to deployment of additional hardware for more GlideinWMS Factory instances, and potential commitment to support personnel operating the Factory
  • Deployment and support of Front End instances at CERN or elsewhere
  • Extra effort to load balance the combined pilot submission chain, and the need to manage a few individual systems consisting of APF+Schedd+!FrontEnd+Factory
  • More complex troubleshooting in case of possible failures of individual components or the network, as compared to “conventional” APF

Conclusion

There is enough confidence in our ability to deploy the integrated GlideinWMS/PanDA pilot distribution system at scale. However, based on our experience with running this system in the production environment, our cost/benefit analysis of such implementation is not favorable.

Status and known issues for OSG CE deployments (Xin)

  • Two things are happening. OSG Production is pressing all sites to upgrade old releases, mainly for security provided.
  • Most US ATLAS sites have already upgraded.
  • OSG 3 rpm-based deployment. There had been an issue with the globus job manager at SWT2. A patch was provided, and appears to be working fine.
  • A new release is coming next Tuesday.
  • Suggest US ATLAS sites to upgrade to the newest release.
  • Globus-lsf job manager being worked on at OU "Boomer" - seems to be working well. lsf-gratia probe has an issue. Expect to test on Friday.
  • OU_OSCER_ATLAS will be put into production.
  • SWT2- on cluster has been upgraded. Will work on the second next week.
  • NET2 - installed at BU, but there are a couple of minor problems. HU - an lsf site - will follow Horst. BU is SGE; not yet in production - will be adding new Panda queues.
  • MWT2 and AGLT2 have already deployed rpm-based 3.x, all is okay.
  • John Hover has recommended that we upgrade the current OSG 3.x releases; run for a couple weeks.
  • BNL - running an old OSG 3.x release. Will upgrade sometime afterward.

Multi-core deployment progress (Rob)

last meeting:
  • Will be a standing item until we have a MC queue at each site
  • BNL DONE
  • WT2 - SLACXRD_MP8 DONE
  • MWT2_MCORE available DONE
  • AGLT2 - in preparation - next week sometime.
  • NET2 - will be starting work on this today - next week. There are questions about controlling command-line options for the scheduler. Will consult Alden & Paul.
  • SWT2 - will do this by end of the week. Close.
this meeting, reviewing status:
  • AGLT2_MCORE DONE
  • NET2 - still working on it. HU - not sure how to setup with schedconfig or autopyfactory. Need Panda guidance - would be great to add it here: AthenaMPFacilityConfiguration. Alden will send a note of clarification to the t2-l.
  • SWT2_OU - Horst will inquire with OSG about multicore scheduling with the LSF cluster.
  • SWT2_UTA - since LFC migration is complete - hope to work on this week.

Update on LFC consolidation (Hiro, Patrick, Wei)

last week(s):
  • See Patrick's RFC for the clean up
  • Replica info into an SQLite database
  • Can sites just use Patrick's pandamover-cleanup script?
  • Will put this into Hiro's git repo
  • Hiro can create something on-demand. AGLT2 - does this every 12 hours.
  • Please update with ConsolidateLFC known issues.
  • Production role is needed
  • Three sites UTA, SLAC, AGLT2 now converted. UTA seems to have problems, consulting w/ Patrick.
  • Mark notes that PandaMover needs fixing. Check of the replica succeeds, incorrectly. Patrick has contacted Tadashi, and Mark has created a ticket. Only became apparent after SLAC consolidated. Thinks this is simple.
  • Pause on new sites.
  • Hiro: need to revisit CCC. Shawn has not tried it at AGLT2 yet.
  • Shawn - is checking for production errors.
  • BU is scheduled next - perhaps next Monday. John in communication with Hiro.
this week:
  • T3 LFC has been migrated, waiting on an AGIS update.
  • Only sites left are OU and MWT2.
  • Patrick - the CCC script can run with the dump file. Pandamover - will place files at a site, but DQ2 doesn't have them. Case of Tier2D? . Foreign Tier 1 subscribes via DQ2. HIro's solution is to use a different domain.
  • Modify domain of proddisk as it exists in DQ2 now, in TOA.
  • Sarah - using an older version of code that will cleanup files that might be used as input to jobs.
  • Code is in the git. Patrick will look at Sarah.
  • MWT2 - on Monday. Sarah working with Hiro
  • OU after that.

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • All is well.
  • this meeting:
    • 25-30% of MC12 had a bug, parameter bug from MC group.
    • Tomasz Schwindt's jobs -- massive production, high priority. Send Kaushik any comments, reports about these jobs.
    • Send any issues found to DAST help - "hn-atlas-dist-analysis-help (Distributed Analysis Help)" <hn-atlas-dist-analysis-help@cern.ch>
    • Jeff at MWT2 has a script that looks for low-efficiency jobs.

Data Management and Storage Validation (Armen)

  • Reference
  • last meetings(s):
    • Hiro - will send user disk cleanup reminder
    • Armen - localgroupdisk issues - what about policies? Do we need one? Generally no deletion there. Recent issue with SLAC - a 100 TB request. Wei added this space there, about 500 TB of LOCAGROUPDISK (reduces the pledge). How to get users to clean up? Situation is different in various places.
    • Hiro: why is ToA? number different than what Ueda quotes? Where does 1.2 PB come from? Michael: comes from pledge.
    • Armen - expect more flow into localgroupdisk, since DATADISK is undergoing some deletion, or moving into GROUPDISK token areas.
    • Notes a spike in DQ2.
    • Will restart in two week.
    • Can we do something to improve DDM subscriptions to LOCALGROUPDISK.
    • Kaushik notes we will have accounting.
    • Alden: send any issues to DAST
  • this meeting:

Shift Operations (Mark)

  • last week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=paper&confId=208132
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_10_2012.html
    
    1)  9/5: NERSC - file transfer failures with SRM errors.  Site was in a scheduled downtime, but the outage ran over the endtime a bit, hence the errors.  
    No more errors once the site was back on-line.  ggus 85835 closed, eLog 39127.
    2)  9/6: UPENN file transfer errors ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred (possibly the destination disk is full)").  
    Admin reported: I believe this is caused by the gridftp server terminating due to a timeout on the disk side and misinterpreting it as an EOF.  
    ggus 85916 in-progess, eLog 39145.
    Update 9/12: http://savannah.cern.ch/support/?132023 was also opened for this issue.  eLog 39288.
    3)  9/8: BNL - job failures with "lost heartbeat" errors.  Known issue where jobs running opportunistically at BNL_ATLAS_RCF are evicted by higher 
    priority jobs.  ggus 85943 closed, eLog 39188.
    4)  9/9: HU_ATLAS_Tier2 jobs failures with the error "Connection on "ATLASDD" cannot be established ( CORAL : "ConnectionPool::getSessionFromNewConnection" 
    from "CORAL/Services/ConnectionService."   Issue was due to problems with the FRONTIER_SERVER environment variable not getting set correctly at the two 
    NET2 sites.  Resolved (see ticket for details) - ggus 85957 closed, eLog 39214.
    5)  9/10: BNL - jobs failing with an error like "FATAL out of memory."  Site admin investigated some of the failed jobs - appears to be an issue with the software
    (i.e., ATLAS jobs) hitting the 4 GB maximum VMEM size for a 32-bit process.  So, not a site problem.  ggus 85959 closed, eLog 39252.  
    ( https://savannah.cern.ch/support/index.php?131978 was opened to track the 32-bit code issue.)
    6)  9/11: NET2 LFC migration => BNL.  eLog 39236/61.  Update 9/11 afternoon: migration completed, HC test jobs at the site successful, queues back on-line.
    7)  9/11 p.m. jobs failing heavily at US sites already migrated to BNL LFC.  Problem was due to the ToA => AGIS testing, which introduced a problem in PanDA.  
    Tadashi fixed the problem.  eLog 39268.
    8)  9/12 a.m. Most US cloud sites were auto-excluded by HC testing due to connection timeouts trying to access frontier01.racf.bnl.gov.  Possibly another 
    effect of the ToA => AGIS testing.  John DeStefano reported: After (drastically) increasing the number of available system and user processes on the servers, 
    we seem to be able to handle the increased load; things are normalizing, and US queues are being reinstated.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in 
    https://savannah.cern.ch/support/?129468.  See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/9: new ggus ticket 85951 opened for this same issue. 
    (iii)  7/29: File transfers to UPENN_LOCALGROUPDISK failing with checksum mismatch.  https://savannah.cern.ch/bugs/index.php?96424 (Savannah DDM), 
    eLog 38037. 
    Update: site admin reported that this issue had been resolved.  Closed https://savannah.cern.ch/bugs/index.php?96424.
    (iv)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site 
    exclusion ticket), eLog 38795.
    (v)  9/1: AGLT2: file transfer failures with "locality is unavailable" errors.  From Bob: Currently we do not route to TRIUMF or several other locations, due to the 
    shutdown of the UltraLight router. We are working to establish alternate routing.  ggus 85712 in-progress, eLog 39010.
    Update 9/10 from Shawn: We had some dCache "congestion" because of our site configuration. These were resolved by rebalancing our pool usage over the 
    last few days. We are not experiencing locality issues anymore so I am closing this ticket.
    (vi)  9/4: UTA_SWT2: Jobs failing with an error like "cp: cannot stat '/xrd/atlasproddisk/panda/dis/12/07/22/...': No such file or directory."  Problem is being 
    investigated.  ggus 85771 / RT 22442 in-progress, eLog 39114.
    (vii)  9/4: AGLT2 - LFC migration to BNL host underway.  http://savannah.cern.ch/support/?131669 (Savannah site exclusion).
    Update 9/9: migration completed - https://savannah.cern.ch/support/?131669 closed.
    

  • this week: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=0&confId=209111 (presented this week by Helmut Wolters)
    or
    http://www-hep.uta.edu/~sosebee/ADCoS/ADCoS-status-summary-9_17_2012.html
    
    1)  9/12: BNL file transfer errors ("FIRST_MARKER_TIMEOUT] First non-zero marker not received within 180 seconds]").  Admins reported there was 
    a problem with a storage host, and it was rebooted.  Issue resolved.  ggus 86061 closed, eLog 39308.
    2)  9/12: Daily Atlas RSV vs WLCG report (currently sent out via e-mail) will be replaced on 31/Oct/2012 with a web only summary available here:
    http://rsv.opensciencegrid.org/daily-reports/2012/09/11/ATLAS_Replacement_Report-2012-09-11.html.  Please address any concerns or questions to 
    steige@iu.edu or open a ticket here: https://ticket.grid.iu.edu/goc/submit.
    3)  9/13: The site BNL_ATLAS_RCF had activated jobs, while the old site BNL_CVMFS_1 had no jobs in activated.  Issue understood - from Michael: 
    Pandamover jobs were stuck, presumably because of local area network problems yesterday afternoon.  We expext jobs will get to activated soon (and they were).
    4)  9/14: Testing of the AGIS configuration completed - rolled back to ToA in DDM operations.  eLog 39331.
    5)  9/14: Maxim announced that the glideinWMS/PanDA evaluation project has been successfully completed.  The system could have continued to be 
    run in this configuration, but since it's an additional maintenance item, it was decided to revert back to conventional APF - now done.  More details:
    http://www-hep.uta.edu/~sosebee/ADCoS/Winding-down-glideinWMS_PanDA%20evaluation-project.html
    6)  9/14: AGLT2 - from Bob: We are having VMWare issues.  I have set both our queues offline as we'll  need to stop all services for the necessary repairs.  
    eLog 39341.  Issue reported to be solved as of late evening the same day.
    7)  9/15: ggus 86121 was opened for file transfer errors between BNL & TRIUMF (and assigned to BNL).  Michael reported the issue was on the TRIUIMF side 
    (The direct circuit between BNL and TRIUMF is not functioning properly and prevents the backup via CERN to take over), and hence not a problem at BNL.  
    ggus ticket was closed, and ggus 86120 for TRIUMF was updated.  eLog 39365/66 (latter describes partial resolution of the issue).
    8)  9/15: NET2 admins reported that the Harvard sites (HU_ATLAS_Tier2, ANALY_HU_ATLAS_Tier2) were recovering from a power outage at the site, and 
    in the process of ramping back up.
    9)  9/17: Comp@P1 shifter reported: There have been over the day about 20 T0 transfer failures to US DATATAPE per hour, representing some 10% failure rate.
    Issue understood and resolved - from Michael: The failures were caused by a faulty network switch module which was replaced. Since then we don't observe 
    such failures any more.  eLog 39415.
    10)  9/18: MWT2_UC PRODDISK (as source) file transfer errors ("[SOURCE error during TRANSFER_PREPARATION phase: [GENERAL_FAILURE] AsyncWait]").  
    From Sarah: We had a dCache storage node go offline last night. We are working on bringing it back up now.  Issue resolved - ggus 86173 closed, eLog 39441.
    11)  9/18: SWT2_CPB LFC migrated to BNL host.
    
    Follow-ups from earlier reports:
    
    (i)  6/13: Issue where some sites using CVMFS see the occasional error: "Error: cmtsite command was timed out" was raised in https://savannah.cern.ch/support/?129468.  
    See more details in the discussion therein.
    (ii)  7/12: NET2 DDM deletion errors - ggus 84189 marked 'solved' 7/13, but the errors reappeared on 7/14.  Ticket 'in-progress', eLog 37613/50.
    Update 9/9: new ggus ticket 85951 opened for this same issue. 
    Update 9/17: following the LFC migration to BNL it is expected that a permission problem should be fixed.  Closing this ticket - will continue to track the deletion 
    errors issue in ggus 84189.
    Update 9/16: ggus 86135 also opened for the deletion errors problem - and closed to allow tracking in 84189.  eLog 39401.
    (iii)  8/24: NERSC_SCRATCHDISK - file transfer failures with SRM errors.  Issue was due to the fact that the site admins had previously taken the token off-line
    to protect it from undesired auto-deletion of files.  ggus 85490 closed, eLog 38791.  https://savannah.cern.ch/support/index.php?131508 (Savannah site 
    exclusion ticket), eLog 38795.
    (iv)  9/4: UTA_SWT2: Jobs failing with an error like "cp: cannot stat '/xrd/atlasproddisk/panda/dis/12/07/22/...': No such file or directory."  Problem is being investigated.  
    ggus 85771 / RT 22442 in-progress, eLog 39114.
    Update 9/14: some site issues resolved - no recent errors of the type reported in the tickets - ggus 85771 / RT 22422 closed.  eLog 39334.
    (v)  9/6: UPENN file transfer errors ("[GRIDFTP_ERROR] an end-of-file was reached globus_xio: An end of file occurred (possibly the destination disk is full)").  
    Admin reported: I believe this is caused by the gridftp server terminating due to a timeout on the disk side and misinterpreting it as an EOF.  ggus 85916 in-progess, 
    eLog 39145.
    Update 9/12: http://savannah.cern.ch/support/?132023 was also opened for this issue.  eLog 39288.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last meeting(s):
    • Please try to get new equipment into production.
      • NET2 - has systems, not yet in production; plan was to do this at BU, not clear about HU but maybe.
      • UTA - waiting on a 10G port; working internally on which optics - SR, LR; then will buy cable.
      • SLAC - has machines, trying to get them supported in a standard way, as appliances
  • this meeting:

Federated Xrootd deployment in the US (Wei, Ilija)

FAX status & reference

last week(s)

  • Preliminary format of new monitoring messages agreed upon. Estimated time to implement in xrootd 2-3 months, dCache and collector shortly afterwards. Have to investigate if we can get some additional information from current monitoring.
  • got first real user(s). need to advertise.
  • Wei: new xrootd release had problems, expect another today or tomorrow.
  • work starting on joining LRZ into federation.
this week
  • Need upgrades to Xrootd
  • Adding sites in Germany and UK
  • Ilija - presented at FAX at the PAT meeting; will have tutorial. 50 particpant.

US analysis queue performance (Ilija)

last two meetings
  • Produced site-by-site comparison for direct and copy2scratch modes. Results can be found here. To be able to advice site on optimal access mode for analysis jobs one would need to do a stress test with a large number of test jobs or preferably switch ANALY queue from its current mode during one or two days .
  • Most sites show reasonable efficiency. a few strange results will need to be addressed.
  • It is obvious that an even loop CPU efficiency is still not the biggest part of the total execution time. Needs further investigations.
  • Two sites reported details on their investigations. No obvious weak spots found. Will continue investigating.
this week:
  • No meeting last week due to CERN and Lyon meetings.

Site news and issues (all sites)

  • T1:
    • last meeting(s): New version of FTS 3, ready for testing though far from complete; Hiro installing. 2400 MCORE slots. Not quite happy with the way Condor handles multi-core reservations are handled. On Friday, Dan Fraser and Todd Tannenbaum will discuss shortcomings and path going forward. WLCG has discussed taking up the federated ID management capability, again. There will be a pilot program, and BNL will be involved. Looking at modern storage management solutions - eg. Hadoop-based installations. There are other companies that are addressing shortcomings in the native version. Company called http://www.mapr.com. NFS interface; distributed metadata; quotas.
    • this meeting: FTS3 deployed for testing.

  • AGLT2:
    • last meeting(s): Back online with both queues. About to purchase. Dell visit - new dense storage, available for ordering. MD3260 node - front-end R-baud node; MD3060e; 60 disks in 4U. Dynamic disk pool RAID - much faster rebuild, dynamically size; price was an issue. R720 head node for one of these. More storage behind a single headnode than previously, so evaluating potential bottlenecks. 1/2 PB in 18U! Now have one domain per pool, than than six per; more flexibility. Updated Condor 7.8.2 installed. 2.0.13 to 2.0.18 cvmfs; most recent wn-client.
    • this meeting: Seem to getting a lot of inefficient jobs - might have taken a dcache node offline. R720 ordered, 3260 storage ordered with RBOD. 10G NIC went offline, disabled flow-control on all 10G switches.

  • NET2:
    • last meeting(s): Preparing to purchase storage. Lots of other work on-going.
    • this meeting:

  • MWT2:
    • last meeting(s): LHCONE migration at UC. 800 slotes in MCORE queue. SRM failures on Friday - could have been SRM thread. Perfsonar upgraaded to 10G, and configured to test against LHCONE sites; configured with 10G/1G so we can test both properly. Converted condor config to use IP addresses, will be upgrading soon.
    • this meeting: DDM issue - SRM problem. Attempting to get an SRM thread dump.

  • SWT2 (UTA):
    • last meeting(s): Multicore configuration nearly finished. Working Pandamover issue with LFC consolidation. Available disk at SWT2? ~ 1600 TB. Updated today - it will be about 2100 TB.
    • this meeting:

  • SWT2 (OU):
    • last meeting(s): Ordering storage - quote in hand, placing this in the next two weeks; about 200 TB. Lustre issue was metadata server deadlock, fixed with reboot.
    • this meeting:

  • WT2:
    • last meeting(s): Migrated SRM and one of the CE's to virtual machines. Outage this weekend - Friday to Monday morning, extensive power work. PO for next storage went out, delivery: 1 PB usable. Also ordering 20 R510's for SLAC's local Tier 3 use, PROOF cluster expansion.
    • this meeting: 1 PB usable, R610 + MD1200. - starting.

AOB

last meeting
  • Shawn, Michael: not all ATLAS subnets are advertised from the PBR at BNL.
this meeting


-- RobertGardner - 18 Sep 2012

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback