r5 - 28 Apr 2010 - 14:19:53 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesApr28



Minutes of the Facilities Integration Program meeting, Apr 28, 2010
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (605) 715-4900, Access code: 735188; Dial *6 to mute/un-mute.


  • Meeting attendees: Fred, Shawn, Booker, Rob, John DeStefano, Saul, Armen, Mark, Bob, Michael, Charles, Aaron, Nathan, Xin, Karthik, Sarah, Torre, Wensheng, Wei, Doug, Tom
  • Apologies: Patrick, Jason, Kaushik

Integration program update (Rob, Michael)

  • SiteCertificationP13 - FY10Q3
  • Special meetings
    • Tuesday (9am CDT): Frontier/Squid
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • last week(s)
    • this week
      • As data taking ramps up we're seeing lots more analysis activity. The analy load is more evenly spread; now that user-cancelled jobs are no longer counted as failures, the US efficiency is quite high - about 2% failures.
      • We had a good store over the weekend - BNL was getting data ~GB/s for over 36 hours. As machine starts up again we'll see if this continues.
      • Reprocessing campaign has been completed quickly. All Tier 1's worked very well.
      • Data distribution to all Tier 1's and then to Tier 2's is going well. Not all Tier 2's have deployed their capacities, and PI's are urged to meet their pledges. There is a sharing plan under discussion in the RAC. Would be nice to fine tune the distribution formulas with usage patterns.

Tier 3 Integration Program (Doug Benjamin & Rik Yoshida)

  • last week(s):
    • The link to ATLAS T3 working groups Twikis are here
    • Draft users' guide to T3g is here
    • ANL analysis jamboree - used the model Tier 3g at ANL. Had four people doing analysis transfer their analysis there, as well as having users doing exercises.
    • Feedback was positive, user interface easy to use.
    • Some issues came up in the batch system - CVMFS reconfigured. Will be running more tests.
    • Next week there will be a Tier 3 session - all working groups will report at that meeting.
    • Close to coming up with a standard configuration for Tier 3g for ATLAS
    • Tier 3-Panda is working. There were some residual issues to sort out (to discuss w/ Torre). Working with Dan Van de Steer to get HC tests running on Tier 3.
    • Need to update Tier 3g build instructions - after the software week. Target mid-May
  • this week:
    • Tier 3 plans are gelling. CVMFS - implies T3's will require squids - therefore will need monitoring.
    • Request for second FTS at BNL for gridftp-only endpoints.
    • Work proceeding on Xrootd; srm check summing required for T3; Doug working on that.

Operations overview: Production and Analysis (Kaushik)

Data Management & Storage Validation (Kaushik)

Shifters report (Mark)

  • Reference
  • last meeting:
    Yuri's summary from the weekly ADCoS meeting:
    1)  4/14: No pilots were flowing to multiple sites (SLAC, HU) - issue was a defective RAID controller on one of the submit hosts - from Xin:
    There was a RAID controller error on the submit host (gridui11) this morning, system ended up being rebooted.
    It is back now.
    2)  4/14: A new job state in panda, "canceled," has been introduced.  >From Tadashi:
    We have introduced 'cancelled' state in panda as discussed in the BNL workshop. Jobs go to this state when users kill jobs or tasks are aborted, for example. The idea is to improve error reporting by separating things that are not real failures. For production jobs, cancelled state is converted to FAILED/DOBEDONE in prodDB, so that the change should be transparent to the production dashboard.
    3)  4/15: BNL - jobs were failing with stage-out errors:
    Checksum type: None Destination SE type: SRMv2 [SE][StatusOfPutRequest][ETIMEDOUT] httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: User timeout over lcg_cp: Connection timed out
    15 Apr 18:22:46|lcgcp2SiteMo| !!WARNING!!2990!! put_data failed: Status=256 Output=Using grid catalog type: UNKNOWN. Using grid catalog : lfc.usatlas.bnl.gov
    Issue resolved - from Pedro:
    I had to restart SRM due to ca certificates problems.  This may have lead to some transfer failures.
    4)  4/16: From Paul, new pilot version (43b) -
    After a request from Pavel Nevski et al, a new pilot version has been released. It contains an urgent fix needed for the re-processing validation (the alias for the xrdcp command interfered with the job parameters). It also contains a correction for the path to the Nordugrid job def file (code already tested by Andrej Filipcic), as well as a change in the schedconfig.timefloor 
    handling which now is measured in minutes instead of seconds (requested by Michael Ernst et al).
    5)  4/16: From Michael at BNL:
    We observed that the CERN/WLCG BDII has dropped BNL from its information base. Though information about the site resources are complete and correct when querying the OSG BDII it is not known by the BDII at CERN.  Resolution:
    Our theory was confirmed by Maarten. This is what he found in the BDDI log
    === 2010-04-16 22:11:07 START
    === 2010-04-16 22:12:23 END - elapsed: 76
    ERROR: OSG_BNL-ATLAS_BNL_ATLAS_1 is producing too much data!
    ERROR: OSG_BNL-ATLAS_BNL_ATLAS_2 is producing too much data!
    ERROR: OSG_BNL-ATLAS_BNL_ATLAS_5 is producing too much data!
    ERROR: OSG_BNL-ATLAS_BNL_ATLAS_SE is producing too much data!
    Further information from Maarten:
    After removing the additional CEs the size of BNL_ATLAS is ~3.5 MB, whereas the maximum is 5 MB.
    An increase to 10 MB per source is expected to be released soon.
    I will have a look on the BDII nodes that I have access to.
    6)  4/16: From Bob at AGLT2:
    I have temporarily stopped auto-pilots for AGLT2 and ANALY_AGLT2 queues as the gatekeeper got completely over-loaded.  Follow-up:
    OK, I think I got it.  condor_schedd was briefly responsive, I killed all the jobs that would not re-start, and now I am getting response and load again.  New pilots are trickling in and starting.  I will watch this for a while, the number of jobs is slowly rising with the low nqueue, and if everything seems clean for 30 minutes or so, then I will return everything to a more normal status.
    7)  4/16 - 4/17: Jobs were failing at HU_ATLAS_Tier2 with stage-in errors such as:
    16 Apr 17:12:17|LocalSiteMov| !!WARNING!!2995!! lsm-get failed (28169):
    16 Apr 17:12:18|SiteMover.py| Tracing report sent
    16 Apr 17:12:18|Mover.py    | !!FAILED!!2999!! Error in copying (attempt 1): 1099 - lsm-get failed (28169):
    16 Apr 17:12:18|Mover.py    | !!FAILED!!2999!! Failed to transfer HITS.118264._003170.pool.root.1: 1099 (Get error: Staging input file failed)
    >From John at HU:
    As I just posted in the usatlas-prodsys-l:
    all signs point to this being a side effect of the SIGBUS crashing, filesystem hammering, 15.6.3 G4 simluation jobs that other sites have seen, too.  I don't see any site-specific issues.  The pilot is killing these looping jobs as I write this... I might help them along, too, by doing some of my own killing.  After the jobs clear out I will cautiously ramp the site back up.  eLog 11742.
    8)  4/17: BNL - large number (>1800) pf failed jobs with stage-out errors like:
    17 Apr 03:07:25|futil.py | !!WARNING!!5000!! Abnormal termination: ecode=256, ec=1, sig=-, len(etext)=268
    17 Apr 03:07:25|futil.py | !!WARNING!!5000!! Error message: Using grid catalog type: UNKNOWN Using grid catalog
    : lfc.usatlas.bnl.gov VO name: atlas Checksum type: None Destination SE type: SRMv2
    [SE][StatusOfPutRequest][ETIMEDOUT] httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: User timeout over lcg_cp:
    Connection timed out. 
    Resolved (problem connecting to the SRM server).  ggus 57408, eLog 11639,
    9)  4/18 - 4/19: Large number of failed jobs in several clouds with the error "oracle://INTR/ATLAS_COOLONL_TRIGGER" cannot be established ( CORAL :"ConnectionPool::getSessionFromNewConnection" from "CORAL/Services/ConnectionService."  Problem was a high load on the database servers.  More details here:
    10)  4/19: Transfer errors at AGLT2_Perf-Muons:
    [Failed] FTS Retries [1] Reason [AGENT error during ALLOCATION phase: [CONFIGURATION_ERROR] No Channel found, Channel closed for your VO or VO not authorized for transferring between AGLT2 and AGLT2]
    >From Hiro:
    There was configuration error for AGLT2 site info created by the automated update script due to AGLT2's dropping
    out of BDII. This has been resolved.
    11)  4/19: Issue with host gridui11 at BNL - machine was recovered from a disk error, but Xin suggested migrating the panda pilot submitter to gridui12, which has now been done (Torre).
    12)  4/21: MWT2_UC, MWT2_IU, ANALY_MWT2 offline for kernel upgrades and network tests (in progress).  eLog 11724.
    Follow-ups from earlier reports:
    (i)  3/3: Consolidation of dq2 site services at BNL for the tier 2's by Hiro beginning.  Will take several days to complete all sites.  ==> Has this migration been completed?
    (ii) 3/24: Question regarding archival / custodial bit in dq2 deletions -- understood and/or resolved?
    (iii)  4/9 => present: SWT2_CPB has been experiencing storage problems.  The main issue is the fact that the majority of the storage servers in the cluster are very full, such that data IO ends up hitting just one or two of the less full servers.  The site is working on several longer-term solutions (more storage is about to come on-line, redistribute the data among the servers, xroot updates, etc.).  
    This issue is being tracked here: RT 15936, eLog 11301, ggus 57161.
    Update, 4/15:  SWT2_CPB was brought back on-line, but within 24 hours the issue with the storage resurfaced.  At this point a newer version of xrootd was deployed on all of the storage servers (prior to the restart this had been done on only one of them).  Test jobs were successful on 4/17, and the production queue was set back to on-line.  
    Analysis queue followed on 4/19.  Since that time the system has been stable.  Continuing to monitor the situation.  
    ggus 57161 & 57394, RT 15936 & 16021 closed.  Also closed RT 15945 - unrelated issue (non-atlas jobs wait due to the small number of available batch slots).
    (iv)  4/11 => present: Widespread issue with missing conditions/pool file catalog data at sites.  Some sites have been patched by hand, a permanent fix is under discussion.  Many tickets, mail threads - a sample:
    Q: Has this issue been resolved??
    (v)  4/13: Transfer errors at AGLT2, such as:
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [DESTINATION error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://head01.aglt2.org:8443/srm/managerv2]. Givin' up after 3 tries].  From Shawn:
    Issue seems to be related to the Postgresql DB underlying our SRM on the headnode which started around 1:30 AM. Problem resolved itself around 4 AM local time and has been OK since then. I have restarted SRM services around 8 AM just to be sure that the SRM is in a good state.  ggus 57219, RT 15965 (these will need to be closed), eLog 11463.
    Update, 4/16:  No additional errors of this type seen, so the issue is apparently resolved.  ggus & RT tickets closed.
    (vi)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    (vii)  4/7: BNL MC data transfer errors were reported in eLog 11239/61, ggus 57121, RT 15922.  Tickets still open.
    Update, 4/21: No recent transfer errors related to this issue seen -- ggus & RT tickets closed.
  • this meeting:
     Yuri's summary from the weekly ADCoS meeting:
    1)  4/21: Filled system disk at HU_ATLAS_Tier2 -- details from John:
    I moved /tmp to the local scratch disk and fixed up the corrupted /etc/services file.  I will have to follow up about:
    1) gratia-lsf uses up 6 GB of /tmp every 10 minutes because it makes a line-reversed copy of our entire LSF history file using tac, which requires a lot of tmp space.
    2) the 75,000 files were left in the top of /tmp over past 24 hours.
    I see pilots already finishing successfully (noting that the site is offline).  I'm going to turn the site back online, with a local cap of 100 jobs just to make sure everything is really working again, and then I'll remove the cap .
    2)  4/21: From Bob at AGLT2:
    I have set us off-line so that we can drain, and we will then fix a Condor problem we have here.  This has already caused a bunch of "lost heartbeat", and eventually a bunch more will report in.  The problem has a known cause.  Jobs that complete normally are unaffected.  We expect to be back online late this afternoon.
    Following the Condor fix test jobs were successful, site set back to 'on-line'.  eLog 11748.
    3)  4/22: From Hiro:
    BNL dCache has been updated to fix the gridftp adapter problem.   Therefore, I will change all FTS channels to use GRIDFTP2 in this afternoon (2PM US East).    The sites, which allows the direct writting to storage disks/pools (eg dCache sites), should pay attention to their SEs.
    4)  4/22: MWT2_UC - jobs failing with the error "22 Apr 06:14:30|pilot.py | !!FAILED!!1999!! Too little space left on local disk to run job: 573571072 B (need > 2147483648 B)."  Issue resolved - from Charles at UC:
    Problem should be fixed now.  Background - we are using pcache
    [https://twiki.cern.ch/twiki/bin/view/Atlas/Pcache ] which uses a subset of scratch space for a file cache. The max size of this cache was set to 90%, which leaves ~40GB free. This job set filled up the available space quickly before a cache cleanup pass could free up space by deleting cached files. I've reduced the pcache max space limit from 
    90% to 80%, which should prevent recurrence of this problem.  ggus 57532 (closed), eLog 11810.
    5)  4/22:  Transfer failures at SWT2_CPB:
    SRC SURL: srm://gk03.atlas-swt2.org:8443/srm/v2/server?
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist]
    Problem understood, issue resolved.  From Patrick:
    The system disk on one of our dataservers got filled and we think that this is the root cause of a problem with the portion of Xrootd used by the SRM.  The SRM was not able to 'stat' files and reported that the files did not exist.
    We sync'ed the dataserver contents to the CNSD's copy of the contents. For some reason, the CNSD is still maintaining some older files that have been deleted, but these should not cause operational issues.  We also took the time to perform some minor maintenance on the dataservers.  The service is seemingly running fine now.  eLog 11806/07.
    6)  4/23: Job failures at MWT2_IU with errors like:
    23 Apr 05:55:23|LocalSiteMov| !!WARNING!!2995!! lsm-put failed (51456): 201 Copy command failed
    23 Apr 05:55:24|Mover.py | !!WARNING!!2999!! Error in copying (attempt 1): 1137 - lsm-put failed (51456): 201 Copy command failed
    23 Apr 05:55:24|Mover.py | !!WARNING!!2999!! Failed to transfer NTUP_MINBIAS.126581._014054.root.1: 1137 (Put error: Error in copying the file from job workdir to localSE)
    >From Sarah at IU:
    Thank you for reporting the issue! We found that certain worker nodes in the cluster had an older version of the lsm-put script, which caused certain put operations to fail. We've updated those nodes and continue to monitor.  ggus 57584, RT 16080, eLog 11872.  4/26: still see job failures with stage-out errors.  From Sarah:
    Proddisk had reached 99% usage, causing writing job outputs to fail. I have allocated more space.  ggus ticket closed.
    7)  4/23: OU sites were set off-line in advance of major upgrades -- from Horst:
    We'll be taking OU_OCHEP_SWT2 down for a complete hardware and software upgrade on Monday morning.  So could you please drain all the queues -- both *OU_OCHEP_* and *OU_OSCER_* -- and turn pilot submission (and everything else that might be accessing them) off starting this afternoon, until we're ready to start back up, which will be at least a week?  
    I'll also schedule a maintenance in OSG OIM, which I will keep updated when we know better how long Dell and DDN will take for the upgrade.  eLog 11813.
    8)  4/23:  Spring reprocessing exercise is underway.  Useful links:
    9)  4/23: ANALY_SWT2_CPB: jobs were failing due to an issue with a specific transform (runGen) which does not default to the system python in the same way as other transforms.  To address this issue a 32-bit version of WN client was installed to ensure that the pilot picks up a workable python version and thus decouple the issue from the job transforms.
    10)  4/23: UTA_SWT2 - NFS server problems.  Evidence that the NIC in the machine was dropping some packets.  Not clear what was causing the problem.  A system reboot cleared up the problem for now.  4/24: One of the xrootd data servers crashed.  Used this opportunity to update xrootd on all the servers, along with some modifications to the XFS file system mounts.  
    System restarted with modified options to the NIC driver - this issue seemingly resolved.
    11)  4/23: HU_ATLAS_Tier2 - stage-in errors like:
    23 Apr 18:31:43|LocalSiteMov| !!WARNING!!2995!! lsm-get failed (28169):
    23 Apr 18:31:44|Mover.py | !!FAILED!!2999!! Error in copying (attempt 1): 1099 - lsm-get failed (28169):
    23 Apr 18:31:44|Mover.py | !!FAILED!!2999!! Failed to transfer DBRelease-10.3.1.tar.gz: 1099 (Get error: Staging input file failed)
    23 Apr 18:31:44|Mover.py | !!FAILED!!3000!! Get error: lsm-get failed (28169):
    >From Saul:
    We're continuing to work on this issue. It's understood but not completely resolved. Let's close the ticket and open a new one if the errors reappear.  ggus 57615 (closed), eLog 11817.
    12)  4/24: FTS errors at SLAC -
    [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgserv04.slac.stanford.edu:8443/srm/v2/server]. Givin' up after 3 tries].  From Wei:
    One of our data servers was down. It is back online.  ggus 57622, eLog 11826.
    13)  4/24: IllinoisHEP, file transfer problems with SRM errors:
    ERROR MSG: [FTS] FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase:
    [CONNECTION_ERROR] failed to contact on remote SRM [httpg://osgx1.hep.uiuc.edu:8443/srm/managerv2]. Givin' up after 3 tries].  From Dave at Illinois:
    Restarted parts of dCache (I believe a pool node was confused) and all seems well at this point.  ggus 57634, eLog 11854.
    14)  4/26 - 4/28: Jobs failing at most U.S. sites due to missing release 15.6.9.  There was a problem with the install pilots which was preventing the s/w installation jobs from running.  This issue has been resolved.  eLog 11996, ggus 57681.
    Follow-ups from earlier reports:
    (i)  4/11: Failed jobs at AGLT2 with errors like:
    11 Apr 13:37:44| lcgcp2SiteMo| !!WARNING!!2999!! The pilot will fail the job since the remote file does not exist.
    Shawn verified that the file exists in /pnfs, and that the correct checksum info is available via dCache/Chimera.  Could there be some timing issue present? What does getdCacheChecksum() try to do? I note it mentioned "attempt 1"...are there follow-on attempts or is this site-db configured?  Paul added to the thread in case there is an issue on the pilot side.  
    ggus 57186, RT 15953, eLog 11406.  In progress.
    Update, 4/16: Still see this error at a low level, intermittently.  For example ~80 failed jobs on this date.  More discussion posted in the ggus ticket (#57186).
    (ii)  4/21: MWT2_UC, MWT2_IU, ANALY_MWT2 offline for kernel upgrades and network tests (in progress).  eLog 11724.
    Update, 4/21 p.m.: Maintenance completed, sites back 'on-line'.

DDM Operations (Hiro)

Release installation, validation (Xin, Kaushik)

The issue of validating process, completeness of releases on sites, etc.
  • last meeting
    • Michael: John Hover will open a thread with Alessandro to begin deploying releases using his methods. Which is WMS-based installation.
    • John's email will start the process today -
    • There will be questions - certificate to be used and account to be mapped to.
    • Charles: makes point that it would be good to have tools that admins could have to test releases. Will do this in the context of Alessandro's framework.
    • Xin is helping Alessandro test his system on the ITB site. WMS for the testbed at CERN is not working - i.e. ITB not reporting to the PPS BDII. Working this with GOC.
  • this meeting:
    • Need to publish the BNL ITB site into the production WLCG BDII in order to test. Reconfigured, and changed OIM.
    • Information has still not appeared on the WLCG side. Once available, then Alessandro can submit jobs.

Local Site Mover for Xrootd (Charles)

  • Specification: LocalSiteMover
  • this week if updates:
    • python wrappers for xrootd complete - not all functions are complete
    • working on the lsm script itself
    • have test xrootd instances setup for testing
    • hope to start testing at xrootd sites by this time next week.

Conditions data access from Tier 2, Tier 3 (Fred, John DeStefano)

  • Reference
  • last week(s)
    • There is a new Squid rpm - hesitant to recommend installation;
    • Structure is being changed - will need to de-install completely
    • If you're doing heavy squid customization - wait. If not, you can do the upgrade.
    • Fred has been testing the latest Frontier server; will be updating the launch pad.
    • Also testing new squid. All okay.
    • ATLAS does not have the resources to add new sites into the monitoring infrastructure. Relevant to Tier 3? CVMFS uses squid - already watched closely at the site level; therefore is there a need for central monitoring?
    • Follow-up on DNS user at AGLT2: - it was associated with one user's job, not resolving to a local host. ATLF variable - the string was used as a host name. No new information on reproducing it.
    • PFC corruption at HU - was affecting production jobs which it should never do. This file is not used, but it needs to exist and be in proper XML format. Hiro reported a problem with its content. Alessandro was working on a validity check in the software - thought this was done. Saul: it was not corrupted actually, but out of date. This is a general problem - the jobs which install this sometimes fail (dq2 failures), and this will affect running analysis jobs. Fred will discuss with Richard Hawking next week at CERN and report back. We need a consistency checker for this.
  • this week
    • New version of squid - recommended for deployment. See message from Dario.
    • AGLT2 updated - but got a slightly older version of the rpms. Needs to update.
    • Advice: make sure you stop running processes; uninstall old version before installing the new release.
    • Caution: Customizations will be overwritten. ACLs for example.
    • John will update US facility instructions - will test at MWT2.

Throughput Initiative (Shawn)

  • NetworkMonitoring
  • https://www.usatlas.bnl.gov/dq2/throughput
  • Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
  • last week(s):
    • Minutes:
    • From Jason: I am currently traveling and will be unable to make the meeting tomorrow. As an update, the first RCs of the next perfSONAR release were made available to USATLAS testers on Monday. Pending any serious issues, we expect a full release on 4/23.
    • A few sites have installed this - AGLT2 and MWT2_IU
    • No meeting this week - probably not next week either. 2 weeks from Tuesday.
  • this week:
    • From Jason:
                - The pSPT 3.1.3 was released on April 23rd (on time)
                - Tier1 and Tier2 should be upgrading as we speak 
                - Report problems as always to the mailing list.
    • Notes from last meeting
      	USATLAS Throughput Meeting Notes --- April 20, 2010
      Attending:  Shawn, Sarah, Horst, Karthik, Jason, Philippe, Hiro, Aaron, Andy, Dave
      1) perfSONAR Issues
      	a) Next release candidate testing status (Jason):   Have RC3 out and available.   Latency node upgrade for RC3.  Release still on target for Friday (April 23)
      	b) Status of issues identified by perfSONAR at OU (Karthik/Jason):  Karthik working with Jason on asymmetry.  Outbound bandwidth much higher than input.  Karthik wrote some script following advice from Jason which verified the issue.  Jason recommended redoing manually to verify each test is valid.  Next step is to go deeper with more advanced tools.  Suggestions forthcoming from Jason once he has more time to analyze the data.   One issue is that if we need UDP testing it needs to be enabled by sites.   Recommendation is to setup UDP testing ONLY for Tier-2 performance nodes.  This means Tier-2 sites need to enable UDP testing from all the other Tier-2 performance nodes.  **ACTION ITEM**: Jason will send info on this configuration option.  **ACTION ITEM**: Shawn will provide the list of relevant IPs that Tier-2 sites need.   
           c) Status of issues identified by perfSONAR at UTA (Mark): No update but current monitoring shows the problem seems to still be there.  **ACTION ITEM**: Need to schedule some detailed exploration to try to isolate the problem.
           d) New issues, interesting results, problem reports?    Karthik reported on latency node missing "admin" info.   Philippe asked about latency testing (1-way)...since the DB is so big asking for the page frequently times out.  Plots also time out.   Known issue.  Jason: optimizing is underway but not for this release.  Issue at MWT2_UC with storage partition filling up.   Aaron will check the verbosity settings on the logs.   Also will clean out the /var/logs area.  Next release may help with this type of issue. 
      Ongoing item about perfSONAR.   Need an updated recommendation for "current" perfSONAR hardware for sites buying now (Tier-3s for example).    Would like to have a well defined Dell box (given ATLAS pricing) that sites can just order (fixed configuration).   **ACTION ITEM**: Shawn will try to customize a possibility and share with Jason.   Shawn/Jason can work with Dell on tuning the setup.
      2) Transaction testing for USATLAS.  Summary of status? (Hiro):   Hiro made 1000 files of 1MB each.  GridFTP2 bug is being worked on (Michael talked to Patrick about prioritizing resolving this). For now we could try testing to GridFTP1/dCache and/or BestMan sites.  **ACTION ITEM**: Hiro will try test to a site soon (AGLT2?)  **ACTION ITEM**: Longer term Hiro will setup histogram of "times" where 0.0 time is when the first of the 1000 files started transferring on the FTS channel.  Histogram for each "test" will have 1 entry per file (1000 files sent) showing the time that file completed transferring.   Have to get some experience with this new test to see how best to use it.  
      3) Site reports - (Round table...open for reporting, problems and questions)
           Illinois - Dave reported perfSONAR operating well.  Had sent results last time in email the look consistent with AGLT2's results.
      	MWT2_UC -  Aaron reported on network failures induced on Cisco 6509 from multiple clients hitting 1 server during testing/benchmarking. Good performance from new Dell nodes (part of the reason the Cisco switch was swamped!) The ~900 MB/sec Perc6/E with 1 MD1000 shelf (redundantly cabled). Aaron will provide link discussing setup and test results.  
      	AGLT2 - Bottleneck in NFS server used to host OSG home areas (especially usatlas1).   Same server hosts ATLAS software installs.   Looked at Lustre but not suitable to fix this problem.  Instead using Lustre to migrate away from NFS servers (Tier-3 storage).  Will be testing SSD(s) to replace home area storage for OSG.   Also exploring migrating ATLAS software into AFS.  SSD change will require site downtime to implement. 
      4) AOB - None
      Next meeting is in two weeks.  Look for email before the meeting.   Send along any agenda items you would like to see added.
      Corrections or additions can be sent to the email list.  Thanks,
    • Focus is on getting new perfsonar deployed.
    • Working on a new Dell perfsonar platform defined
    • Will spend time during meetings to track down perfsonar issues

Site news and issues (all sites)

  • T1:
    • last week(s):Testing of new storage - dCache testing by Pedro. Will purchase 2000 cores - R410s rather than high density units, ~ six weeks. Another Force10 coming online 100 Gbps interconnect. Requested another 10G link out of BNL - for the Tier 2s. Hope ESnet will manage the bw's to sites well. Fast track muon recon running for the last couple of days, majority at BNL (kudos); lsm by Pedro now supporting put operations - tested on ITB. CREAM CE discussion w/ OSG (Alain) - have encouraged him to go for this and make available to US ATLAS as soon as possible.
    • this week: had a good weekend - storage worked perfectly. reprocessing campaign went very well - completed our shares ahead of others. evaluating DDN storage system, FC to four front end servers, 1200 disk system, 4 GB/s writes, less for reads, all using dcache, 2PB useable disk behind the four servers; would make 6PB in total for the storage system; wn purchase underway, 2K cores; putting Pedro's lsm into production. Pedro's lsm uses gsidcap rather than srm to put data into the SE. Should we consider Pedro's lsm at the other dcache sites? Note he has added additional failure monitoring; will ask Pedro for a presentation at this meeting;

  • AGLT2:
    • last week: Lustre in VM going well. v1.8.2. Now have a Lustre deployment going here - looking to replace multiple NFS servers (home, releases, osg home, etc). Getting experience. Will start to transfer to use it and evaluate. A Dell switch stack not talking properly to the central switch in the infrastructure. Have had incidence of dropped net connectivity, may require a reboot. Tom: tuning swappiness parameter on sL5 machine dcache 16GB ram 4 pools default kernel setting wanted to swap - seemed to affect performance, turned down. 60 to 10, improved - machine stopped swapping. Charles notes there is an overcommit variable.
    • this week: Order for new storage at MSU; Tom: 50K order; 6 GB SAS; MD1200 shelve; 12 vs 3.5 drives; two servers, 8 shelves, 2 TB drives; nearline configuration (SATA disks with dual port SAS frontend; seagate). 27% more per useable TB in this configure. (MD1000's still best price $/TB, 15 drives vs 12). Dell will be updating portal with 6 core 5620s Westmere on Friday, 24 GB. Does this change any of the pricing for the previous generation 5500s? Switching Tier 3 storage over to Lustre. Sun NAS running ZFS for VO home dirs, much better performance.

  • NET2:
    • last week(s): Filesystem problem turned out to be a local networking problem. HU nodes added - working on ramping up jobs. Top priority is acquiring more storage - will be Dell. DQ2 SS moved to BNL. Shawn helped tune up perfsonar machines. Moving data around - ATLASDATADISK seems too large. Also want to start using pcache. Built new NFS filesystem to improve performance. Installed pcache at HU - big benefit. Addressed issues with Condor-G from Panda. Ramped HU all the way up; major milestone in that all systems running at capacity. Gatekeepers are holding up - even with 500 MB/s incoming from DDM; interactive users.
    • this week: space situation is top priority. About to do storage upgrade - purchase 360 TB raw per rack. Improving network between HU and BNL (there was a 1G limit).

  • MWT2:
    • last week(s): New storage online - 5 of 7 systems. Cisco backplane failures caused by intensive transfer testing; investigating w/ Cisco.
    • this week: 25 TB of dark data have appeared recently - most of this is in proddisk. These are datasets in dq2. Armen: this is happening in other clouds as well. However is this a US-only issue? (same space token?) Charles will follow-up with the list.

  • SWT2 (UTA):
    • last week: Problems with loading xrootd from analysis jobs - too few file descriptors. Upgraded xrootd on that data server. Now stable and useable. Limiting analysis jobs for the time being. DDM is working fine at a reduced scale. Progressing. Limit number of threads as per Wei's suggestions. May roll out xrootd onto other servers. Hot data server taking most of the new data. Also preparing another 4 data servers, 400 TB.
    • this week: probs with analysis transformations - that run event generation; came down to python version used, fixed by re-installing 32bit wn-client; NFS issue; new xrootd service. DDM transfers were failing - tracked down to an MD1000 being rebuilt. 200 TB, 52 8-core worker nodes ordered.

  • SWT2 (OU):
    • last week: Agreed w/ Dell for install date, April 26. Expect a two week downtime. Have upgraded perfsonar to the latest.
    • this week: Dell on-site installing nodes; by end of next week expect to be online again.

  • WT2:
    • last week(s): New HC tests throughput is much much better - not sure its to be believed.
    • this week: All is well; found a issue where a lot of jobs were reading from a single data server; security team has approved deploying perfsonar, finally. planning for next storage purchase. Considering a local vendor (based on supermicro mobo). Setting up a proof cluster.

Carryover issues (any updates?)

VDT Bestman, Bestman-Xrootd


  • last week
  • this week

-- RobertGardner - 27 Apr 2010

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback