r8 - 30 Jun 2011 - 12:47:00 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune29

MinutesJune29

Introduction

Minutes of the Facilities Integration Program meeting, June 29, 2011
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
  • Our Skype-capable conference line: (6 to mute) ** announce yourself in a quiet moment after you connect *
    • USA Toll-Free: (877)336-1839
    • USA Caller Paid/International Toll : (636)651-0008
    • ACCESS CODE: 3444755

Attending

  • Meeting attendees: Torre, Aaron, Nate, Charles, Shawn, Dave, John D, Sarah, Patrick, Fred, Saul, Kaushik, Armen, Mark, Wensheng, Booker, Hari, John B, Alden, Bob, Michael, Tom
  • Apologies: Horst, Jason, Wei

Integration program update (Rob, Michael)

  • Special meetings
    • Tuesday (12 noon CDT, weekly - convened by Kaushik) : Data management
    • Tuesday (2pm CDT, bi-weekly - convened by Shawn): Throughput meetings
  • Upcoming related meetings:
  • For reference:
  • Program notes:
    • last week(s)
      • Integration program from this quarter, FY11Q3
      • End of quarter approaching. From SiteCertificationP17, very easy updates to mark progress are:
        • FabricUpgradeP17 - very easy to fill out.
        • NetworkMonitoringP17 - as discussed in throughput calls, net diagrams, etc.
        • TestingCVMFS - this has been in an integration and testing phase, not all sites participating, in this case use led-gray
        • No FAX updates - debugging name translation plugin
        • UpdateOSG - green if updated to 1.2.19. May defer to next phase pending release of new wn-client (having new LFC python client bindings)
        • AccessOSG - green if the Tier 2 has enabled HCC
        • Please update by Friday, July 1.
        • Will also send out request for installed capacity updates.
      • 800 TB of raw data have been collected in 2011 so far - there will be derived data ~ 200 TB (AOD) which are targets for Tier 2s.
      • See Alexei's talk during ATLAS week this week - categories of data usage. ESDs no longer of interest. AODs are very popular, more so are ntuples (as seen from Datri, or from PD2P? ). 20 - copies of same datasets for Tier 2s. Good news since these are used for analysis and the services are successfully working to move where they are needed.
      • Management news - FY11 funding for Tier 2's granted, with no cuts. Still now word for FY12 - which requires the new cooperative agreement grant, no word yet from NSF.
      • June 29 would like to have a more comprehensive status report for Federated Xrootd project - going over the anticipated deliverables.
      • LHC had a number of hiccups - cryo lost. Perhaps beam tomorrow. increasing luminosity, >1200 bunches expected.
      • Summer conferences coming up, expect analysis activities to ramp up.
    • this week

Federated Xrootd deployment in the US (Charles, Doug, Hiro, Wei)

last week(s)
  • Items on track.
  • name translation plug-in problem.
  • waiting for 3.0.4 release
  • xprep option to be ready by end...
this week:
  • Update on progress towards milestones.
  • 3.0.4 rpm's available
  • Andy is working on integrating X509 code
  • CGW working on name translation module
  • Communication of Tier 3 sites - are they getting prepared and getting ready for deployment? Not sure.
  • More detail please next week?
  • At BNL - there is significant progress using federated namespace and FRM. Got it working, next will look at performance. Hiro - dq2-get now has a plugin that can work for federated or native xrootd. Heard that newer xrootd door in dCache is quite good.

OSG Opportunistic Access (Rob)

last week(s)
  • HCC update
    • May 1 - June 22: query
    • 50K CPU-hours since May. Represents ~ 7% of the total HCC output
    • Preparing to submit to SLAC: Wei sent an email asking for an update, since an agreement is in place.
    • Need to enable HCC at NET2 and email Derek. NET2 has been busy.
    • http://glidein.unl.edu/glidestats/
  • No updates from Engage.
this week
  • cf. AccessOSG
  • Will need to track down unknown category
  • No HCC issues
  • Engage - would like to start - do we have sites ready to enable Engage?

Operations overview: Production and Analysis (Kaushik)

  • Production reference:
  • last meeting(s):
    • All is smooth. 17K running jobs. No site issues, failure rate very low.
    • Intermittent problem with task brokerage - seemed to be related to BDII. Saul reports BDII stop reporting. Seems to have happened at several sites over past ~week, not necessarily related reasons.
  • this week:
    • All is well, 0% failure rate
    • UK was out of jobs for some reason
    • RAC meeting had okay'd SUSY production, so we're full.
    • Saw up-tick in analysis activity - perhaps as result of ATLAS week talks?
    • Pattern in user analysis seems to skim data on the grid, download to n-tuple laptops
    • Will analysis at T2s go up w/ more replication of n-tuples? Notes 100GB almost not worth doing it on the grid.
    • Group analysis now being done as group production
    • Effects of large amounts of data we have still to be seen - going back to
    • PD2P - discussion
      • not getting any data at T2's; why? waiting time going down, no user complaints, so what's the problem?
      • Note - RAW and ESD are not allowed; only AOD and highly skimmed ntuple
      • No evidence of physics-backlog, only a 'technical' backlog
      • 2 copies on first use; one based on brokerage, second based on MOU share
      • Increase pre-placed ntpules, "the old way"
      • Use closeness property of dq2
      • Grouping of Tier2's by performance, size and storage metrics
      • Will double amount of data to sites, at minimum
      • Q (Wensheng): what about datasets by users? PD2P - will not touch it, unless its in DATADISK. Will think about this.
      • Torre: rebrokerage - should we reduce threshold? Alden will repeat study
      • Will improve monitoring and logging - to improve knowledge of why copy was made; weights are in the logs.
      • Torre - Jarka's plots.

Data Management and Storage Validation (Armen)

  • Reference
  • last week(s):
    • No meeting this week.
    • All looks good mostly.
    • userdisk cleanup still on-going. generally okay. NET2 - there were some issues last week, srm timouts, solved. lfc acl errors persist at net2, but this is understood.
    • bnl rate - much more data, deletion taking longer. finally central deletion team acknowledging problems. only 1-2Hz!
    • local cleanup for legacy space tokens still going well.
    • atlas-cloud-support-us SSB email, annoying message about CALIB threshold (20 TB). Armen investigating.
  • this week:
    • No meeting this week, no major issues, following up from last week.

Shifters report (Mark)

  • Reference
  • last meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=143330
    
    1)  6/15: IllinoisHEP, from Dave: For some reason, many jobs in the IllinoisHEP production queue are failing.  I am not sure why just yet, so I have put 
    this queue offline.
    2)  6/19:  DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").  
    ggus 71675 in-progress, eLog 26572.
    3)  6/20: DDM transfer errors to NET2_* tokens (" failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server").  From Saul & John at NET2: 
    Our srm was having a problem picking up new credentials this morning and was rejecting most requests and needed to be re-started. All seems to be 
    fine now.  ggus 71701 closed, eLog 26603.
    4)  6/21 early a.m.: SWT2_CPB - file transfer errors ("/bin/mkdir: cannot create directory...").  Issue was a failed disk in one of the RAID's, which triggered 
    a re-build, but the controller hung up, which necessitated a reboot of the storage server.  System back up as of early evening - test jobs successful, 
    prod & analy queues back to 'on-line'.  ggus 71758 / RT 20237 closed, eLog 26680.
    http://savannah.cern.ch/support/?121682.
    5)  6/21: From Shawn at AGLT2: We have lost the current Condor job load at AGLT2. We had a problem with the iSCSI server that hosts the OSGHOME 
    and ATLAS release areas and a quick reboot turned into a much longer repair than anticipated.  All running Condor jobs are lost and will show up (over 
    the next N hours ) as lost-heartbeats I assume.  (Shifters were requested to ignore any associated lhb errors.)
    6)  6/21: SLAC - job failures with errors like "Put error: lfc_creatg failed with (1015, Internal error)|Log put error: lfc_creatg failed with (1015, Internal error)."  
    Wei reported the issue was a failed disk, now fixed.  ggus 71774 closed, eLog 26658, https://savannah.cern.ch/support/index.php?121698.
    7)  6/21-22: NET2: jobs not being brokered to the site.  Saul reported that tomcat died on the BU gatekeeper, causing the system to stop reporting to the 
    OSG bdii.  Re-started, will monitor.
    
    Follow-ups from earlier reports:
    
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    (ii)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, jobs 
    are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    (iii)  6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error, 
    consult log file."  Wei solved the problem by disabling the multi-job pilots.  Issue will be raised with panda / pilot developers.  ggus 71475 closed, 
    eLog 26382.
    (iv)  6/14: Job failures at HU_ATLAS_Tier2 with the error "lsm-get failed: time out after 5400 seconds."  ggus 71539, eLog 26438.
    Update 6/17 from Saul & John at NET2: Problem resolved by improving our LSM so that it can handle the whole Harvard site starting at once.  
    ggus 71539 closed.
    
  • this meeting: Operations summary:
    Yuri's summary from the weekly ADCoS meeting:
    http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=2&confId=143581
    
    1)  6/22: SLACXRD SRM errors (" failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]").  Later that day Wei 
    reported the problem had been fixed.  ggus 71834 closed, eLog 26697.
    2)  6/23 early a.m.: NET2 - low DDM transfer efficiency.  From Saul: we saw a big burst of adler32 checksumming of small USERDISK files overnight 
    (I suspect that this is part of an ATLAS-wide burst of user activity).  This caused our adler software to run out of I/O resources and eventually caused 
    bestman to stop.  We added more I/O resources and re-started bestman about 1.5 hours ago.  The adler backlog is down and we have been operating 
    normally since then.  ggus 71843 closed, eLog 26712.
    3)  6/23: (minor) pilot update from Paul (v47h): added debugging info in order to understand failures seen on dCache sites (TypeError: 'int' object is not 
    callable), related to Savannah ticket https://savannah.cern.ch/bugs/index.php?83380.
    4)  6/23: IllinoisHEP - job failures with the error "SyntaxError: invalid syntax."  ggus 71863, eLog 26723.  Production queue set off-line.
    Update 6/27-6/28: Dave reported that the issue was likely due to a problem with a squid server, which in turn impacted releases/cvmfs.  Machine was 
    taken off-line - test jobs completed successfully, site back => on-line.  (Following the re-start jobs were initially failing on one problematic WN, since removed.)  
    ggus 71863 closed, eLog 26886.
    5)  6/23: BNL - SE maintenance intervention.  Some file transfer / job errors, but went away once the work was completed.  eLog 26722.
    6)  6/24: Major issue with production across all clouds.  Issue was traced to an overloaded host (atlascomputing.web.cern.ch) which was being hit with large 
    numbers of 'wget' requests to download MC job options files.  (This system has been in place for several years, but over time the size of the job options .tgz 
    files has grown considerably.)
    Many tasks were either paused or aborted to relieve the load on the server.  Discussions underway about how to address this problem.  Some info in 
    eLog 26744, 52, 54-56, more in an e-mail thread.
    7)  6/25: ggus 71925 opened due to file transfer failures between IN2P3-CC & MWT2.  Incorrectly assigned to MWT2 - actually an issue in the IN2P3 side.  
    Awaiting a response from IN2P3 personnel.  ggus ticket closed, eLog 26781.  (Also see related ggus ticket 71933.)
    8)  6/25: BNL voms server was not accessible (a 'voms-proxy-init' against the server was hanging up).  From John at BNL: I checked the server and although 
    the process was running, voms-proxy-init was indeed failing. A service restart has restored the functionality.  ggus 71926 closed, eLog 26785.
    9)  6/25-6/26: NET2 - DDM errors ("failed to contact on remote SRM [httpg://atlas.bu.edu:8443/srm/v2/server]").  Issue was due to heavy SRM activity.  Saul 
    reported that changes were implemented to address the problem.  No additional errors as of early 6/26.  ggus 71923 closed, eLog 26778.
    10)  6/27: SWT2_CPB - a user reported that his jobs were failing with the error "No input file available - check availability of
    input dataset at site."  Issue understood and resolved - from Patrick: The problem was traced to how the input files were registered in our LFC.  The files 
    were registered in a compact form that causes problems for the run-athena transform because our system is configured to read ROOT files directly from 
    storage.  The problematic LFC registrations were isolated to a week long period in May when BNL began to run a new DQ2 Stite Service version.  
    ggus 71935 / RT 20296 closed.
    11)  6/28: Longstanding ggus ticket 69526 at NERSC closed (recent file transfer failures eventually succeeded on subsequent attempts).  eLog 26876.
    12)  6/28: ALGT2 - Bob reported that the site analysis queue was still set to 'brokeroff' after being auto-excluded by hammercloud testing on 6/25.  For some 
    reason the 'HC.Test.Me' comment wasn't set for the site.  This was corrected, but as of 6/29 a.m. ANALY_AGLT2 is still in the 'brokeroff' state? 
    
    Follow-ups from earlier reports:
    
    (i)  6/2: MWT2_UC - job failures with the error "taskBuffer: transfer timeout for..."  Not a site issue, but rather related to the problem seen recently with 
    transfers between US tier-2's and European destinations (under investigation).  ggus 71177 closed, eLog 26032.
    Update 6/7: still see large numbers of these kinds of job failures.  ggus 71314, eLog 26202.
    See also discussion in DDM ops Savannah: https://savannah.cern.ch/bugs/?82974.
    Update 6/14: ggus 71314 is still 'in-progress', but no recent updates from FZK/DE cloud.
    (ii)  6/10: HU_ATLAS* queues set off-line in preparation for a weekend maintenance downtime.  Outage completed as of early a.m. 6/13.  However, jobs 
    are not running at the site (brokerage) due to missing information about atlas s/w releases (BDII forwarding to CERN?).  Issue being tracked here: 
    https://ticket.grid.iu.edu/goc/viewer?id=10566.
    (iii)  6/13: SLAC - production job failures with the error "pilot: Exception caught by pilot watchdog: [Errno 10] No child processes trans: Unspecified error, 
    consult log file."  Wei solved the problem by disabling the multi-job pilots.  Issue will be raised with panda / pilot developers.  ggus 71475 closed, eLog 26382.
    (iv)  6/19:  DDM transfer errors to SLACXRD_PERF-JETS from multiple sources (" [DDM Site Services internal] Timelimit of 172800 seconds exceeded").  
    ggus 71675 in-progress, eLog 26572.
    Update 6/27 from Wei: I will trace this one via GGUS ticket system. It is not a bug anywhere, and I made agreement with US ATLAS computing management 
    that this looks like a long term small project.  ggus 71675 closed.
    

DDM Operations (Hiro)

Throughput and Networking (Shawn)

CVMFS

See TestingCVMFS

last week:

  • See email summary of actions from meeting 6/16/11:
    .  
    1. Follow-up on failed production jobs overnight at Illinois to understand cause (Dave et al)
    2. Alessandro will modify the installation & validation code to check for the presence of local site overrides to setup files for either:
    	a) traditional: use of pool file catalog file exported out of NFS plus conditions data in HOTDISK
    	b) cvmfs: use PFC and conditions data from cvmfs
       This will provide the option to roll-back changes if there are problems with cvmfs, and to test performance and other issues associated with having conditions data served from cvmfs.
       Test both modes at Illinois.
       Note dbrelease files are still required in HOTDISK (even if unused) for Panda brokering purposes.
       The ATLAS worker node client will continue to be supported with the OSG worker node client for the time being; we discussed dependency issues and testing required in the case that dq2 clients may be drawn from CVMFS itself (involves worker node client, local site mover, pilot). 
    3. Prepare first pass of OSG-specific documentation in the ATLAS twiki, https://twiki.cern.ch/twiki/bin/view/Atlas/CernVMFS#Setup_Instructions_for_OSG_Grid
    4. Broaden tests to include the following sites:
    	MWT2 (new queue) - Sarah, starting next week
    	SWT2_CPB - Patrick, starting in two weeks
    	BNL_ITB - Xin, starting next week
    5. Clearing of grid3-locations and re-validation and tagging of releases at sites from cvmfs.  Note Panda brokering requires (sites)
    6. Running of validation jobs over these sites - analysis, production, and HC (eg. this test as a template, http://hammercloud.cern.ch/atlas/10004919/test/)  (sites)
    7. Finalize any deployment instructions based on these tests.
  • Updates: job failures at Illinois with new setup not reproducible, new jobs completing successfully.

this week:

  • Illinois: there was a squid problem creating corruption, caused jobs to fail; flushing cache and restart
  • MWT2 - passed all of Alessandro's validation tests, and test jobs
  • Switching back and forth - HOTDISK and CVMFS
  • Stratum 1 server mirrored at BNL - done last week

Tier 3 Integration Program (Doug Benjamin)

Tier 3 References:
  • The link to ATLAS T3 working groups Twikis are here
  • T3g Setup guide is here
  • Users' guide to T3g is here
  • US ATLAS Tier3 RT Tickets

last week(s): this week:

Tier 3GS site reports (Doug Benjamin, Joe, AK, Taeksu)

last week:
  • UTD - all well
  • AK - will be repeating tests w/ Jason without the firewall in place to check performance; max of 10 MB/s in one direction. For bi-directional (simultaneous) transfers, we get 10 MB/s max for outbound and 5 MB/s max for inbound. Hiro will start a load test using the SRM endpoint.

this week:

  • AK - waiting for firewire by-pass
  • Hari - jobs running smoothly

Site news and issues (all sites)

  • T1:
    • last week:
      • smooth operations. Sharp increase in analysis jobs - 60K activated, so adjusted analysis share. LHCONE connectivity - dedicated fiber in service up to peering point in Manhattan, to be connected to MANLAN by next week. Once the distributed exchange point comes up
      • Chimera hardware arrived.
    • this week:
      • Hiro - working on dCache pnfs to chimera, testing-rehearsal. Target - August.

  • AGLT2:
    • last week(s):
      • Want to update dcache to 1.9.12-3, its now golden; downtime? wait for a couple weeks (wait for PLHC results to go out)
      • iSCSI server non-responsive, caused drop of job load (usatlas home). Not fully recovered - there was an automated script and job removal from Condor necessary to restore full loaded.
      • Waiting for dcache upgrade - will need a downtime within the next few weeks.
    • this week:
      • Tomorrow downtime for upgrades; re-do WAN (new AS number); current golden dCache release; reconsolidate OSG-NFS server to appropriate box, minor switch firmware updates.
      • Brokeroff issue: network hiccup, set analysis queue to brokeroff. Put back online but still. Supposed to add comment hc.test.me.
      • CVMFS - upgrading worker node rpms

  • NET2:
    • last week(s):
      • I/O upgrade progress: new software for dynamically spreading adler checksum load, re-checksummed most of inventory; improvements to LSM software for BU/HU/Tufts. Still to do: go to multiple gridftp endpoints; direct reading at BU, cluster-nfs direct reading at HU. Ready to pull the trigger on a second 10Gbps link to NoX? .
      • Lots of incoming data to the TOP-Physics group area (65TB so far)
      • Ramping up HU analysis to 500 jobs for the first time; 850MB/s BU->HU workers; ready to pull the trigger on second 10Gbps link to NoX? .
      • Major HU shutdown last weekend. Some minor troubles coming back from this.
      • Tufts using NET2 DDM via LSM ramping up in earnest for the first time. Upgrading Tufts LSM, use pcache.
      • Generally smooth T2/ BU T3/ HU T3 operations modulo the HU shutdown.
      • Note to self: enable HCC
    • this week:
      • Reached stable plateau in I/O upgrade project ~last weekend. Stable operations, can feed steady 950MB/s to HU workers via LSM, Tufts LSM working, Adler-spreader smoothing out spiky adler load. Lot's more to do. * About to place order for ~500TB from Dell * Will wait for new storage before getting 2d 10Gpbs NoX? link * Admin's setting up HCC * Tufts LSM upgraded * Smooth NET2/BU Tier 3/HU Tier 3 operations otherwise

  • MWT2:
    • last week:
      • UC: Finishing up rebuild of one of our storage servers (disk replacements, RAID firmware update, full testing). Working on new "bootstrap" machine integrating Cobbler+Puppet, DNS, DHCP services. Still tying up loose ends on server room (UPS readouts, Liebert monitoring in building controls system & load balancing).
      • IU: Working on CVMFS testing with new MWT2 Condor queue. Retiring Force10 switch, recabling to use 6248; considering new switch gear options with IU networking group. Preparing for rack rearrangement in server room, likely August downtime of ~three days.
      • UIUC, from Dave: I will be on vacation this week and might not be able to attend the integration meeting on Wednesday. Here is some information on CVMFS since the breakout last week. I restarted production on Thursday after the breakout session when I emailed you on possible Illinois squid issues. I have seen a few additional production and also user analysis jobs fail where they could not find files in CVMFS. I suspected a squid problem, so I added the UC squid server as a backup to ours in cvmfs (but not as a load balance server). I then rebooted the Illinois squid and cleared out its disk cache. During the reboot/cleanout time, cvmfs accesses did properly move to the UC squid. From what I could determine, all the nodes just started using the UC squid without any repercussions on the running jobs. cvmf-talk proxy info will give you the information as to which squid is being used by a node and it showed that UC was being used. So far, after rebooting and cleaning up the Illinois squid, I have not seen any problems with jobs and missing files in cvmfs, but, since it is rare....it will take some time to know if that problem is now gone. There is a web site that I do not think has a link on the CVMFS twiki page that folks might like to know about http://cernvm.cern.ch/portal. You can find the release notes for cvmfs, etc on this site. Also, the writeup by Jakob Blomer is very useful https://cernvm.cern.ch/project/trac/downloads/cernvm/cvmfstech-0.2.70-1.pdf. One piece of information in this document is that it claims the servers for the repositories should have access to the local squid. It does not explain why though. I doubt if my "missing files" problem is related to the fact that I did not have these servers in the allowed ACL but....just in case I have added them to my squid configuration file....which now looks like....
        acl our_networks src 192.17.18.32/28 192.168.207.0/24 128.174.118.0/24 127.0.0.1 cvmfs-stratum-one.cern.ch cernvmfs.gridpp.rl.ac.uk cvmfs.racf.bnl.gov
        http_access allow our_networks
        If this is really is a requirement of a squid for CVMFS, then I think it might be good to put this on the twiki as well. Perhaps Doug can get some clarification on this point.
    • this week:
      • Progress with unified MWT2 queue (Condor scheduler, running jobs at both sites, used for CVMFS testing)
      • Progress with new cobbler+puppet system -
      • Downtime next week, July 7

  • SWT2 (UTA):
    • last week:
      • April update of UTA_SWT2 of OSG - would up sending duplicate data to WLCG. Working with OSG accounting folks as to what to do.
      • Drained over the weekend, missing releases and no jobs brokered. Can also initiate request from Alessandro's database.
      • Outage yesterday at CPB due to some storage servers going offline.
      • Supporting CVMFS testing - using three nodes to act test nodes.
    • this week:
      • Creating a new OSG resource for CVMFS testing
      • User job that failed because of how data was registered in LFC - short versus long form - how will this be managed?
      • Engage - already running for quite a while.

  • SWT2 (OU):
    • last week:
      • on vacation
    • this week:

  • WT2:
    • last week(s):
      • Problem with LFC hardware yesterday, replaced.
      • DDM transfer failures from Germany and France - all logfiles. ROOT file are working fine. Is FTS failing to get these? Email sent to Hiro. NET2 also finding performance problems.
      • Hiro - notes many of these are never started, they're in the queue too long.
      • Suspects these are group production channels.
      • T2D? channels.
      • FZK to SLAC seems to be failing all the time. Official Tier 1 service contact for FZK?
    • this week:

Carryover issues (any updates?)

Python + LFC bindings, clients (Charles)

last week(s): this week:
  • New package from VDT delivered, but still has some missing dependencies (testing provided by Marco)

WLCG accounting

last week: this week:

HTPC configuration for AthenaMP testing (Horst, Dave)

last week
  • Dave reports successful jobs submitted by Douglas last week
this week

AOB

last week
  • None.
this week


-- RobertGardner - 28 Jun 2011

  • screenshot_01.jpg:
    screenshot_01.jpg

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


pdf service-configuration-session.pdf (27.7K) | RobertGardner, 29 Jun 2011 - 12:11 |
pdf USATLAS_Virtualization_Discussions.pdf (31.3K) | RobertGardner, 29 Jun 2011 - 12:11 |
jpg screenshot_01.jpg (221.5K) | RobertGardner, 29 Jun 2011 - 12:17 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback