r3 - 18 Feb 2009 - 15:16:39 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesFeb18

MinutesFeb18

Introduction

Minutes of the Facilities Integration Program meeting, Feb 18, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Michael, Rob, Shawn, Nurcan, Wei, Fred, Patrick, Torre, John, Armen, Mark, Saul, ..., Kaushik, Horst, ...
  • Apologies: None
  • Guests: Rich

Integration program update (Rob, Michael)

We discussed at last week's meeting finding a timeslot for
a regular meeting to discuss site-level data management issues
in depth - that we don't have time to discuss at our wednesday
meeting is already pretty long.  I said I'd get the ball rolling
with a email polling for day/times.

Note we have three (US facility) weekly meetings to avoid:

- Bi-weekly Tuesday, 11 am EST (Facility analysis queue performance)
- Weekly Tuesdays, 3 pm EST (Throughput)
- Wednesdays, 1pm EST (Computing operations/Integration)


I presume this meeting would cover topics such as the
below:

* Storage validation (filesystem, LFC, DQ2 catalog)
* Data placement policies, use of space tokens
* Data transfer problems
* Datasets required at sites for analysis, datasets to be deleted, etc
* User & group dataset policies
* Storage capacities required vs space tokens
* Storage capacities reporting (eg. WLCG)

Questions:

- is this the right group of people (a mix of operations, sites, tool
developers)?
- are the topics above the right ones?
- any preference for a day/time?

Operations overview: Production (Kaushik)

  • last meeting(s):
    • Working on job submission to HU. Problems at BU - perhaps missing files. John will work the issues w/ Mark offline.
    • Pilot queue data misloaded when scheddb server not reachable; gass_cache abused. Mark will follow-up with Paul. (carryover)
    • Retries for transferring files & job recovery - pilot option. Kaushik will follow-up with Paul.
    • Pilot problems introducing Adler32 changes at SWT2, checksums stored in LFC are wrong
    • Backlog of transfers to BNL - across several Tier 2 sites - not understood.
    • End of month - reprocessing, and DDM stress test
  • this week:
    • Slowness in US ramp-up traced to a Panamover queue problem - solved.
    • There was a (demonstration) Panda security incident last week requiring a large number of changes (caused some pilot problems); now using secure curl and https in pilot. This incident was done in rush, some notifications didn't go out. Plugging other holds to prevent malicious text insertions into the monitoring database.

Shifters report (Mark)

  • Distributed Computing Operations Meetings
  • last meeting:
    • md5sums were overriding adler32 discussed above
    • old temp directories not getting cleaned up correctly at BNL - Paul fixed
    • UTD - working on bringing this back up. Need to re-install transformations.
    • Late saturday night/sunday - there were a lack of pilots - not sure what alleviated the problems.
    • all sites have been migrated to new condor submit hosts
    • Missing libraries - corrupt release at OU: 14.2.0; Xin will follow-up.
    • large number of checksum errors at mwt2 - caused by a central subscription - trouble ticket submitted.
  • this meeting:
    • Problems at SWT2 seeing some I/O load problems with some tasks
    • UTD problems - look like LFC permissions issues
    • Pilot code updates
    • MWT2 dcache upgrade issues
    • Task 4380 - large number of failed jobs at SLAC; missing transformation not installed, but then the pilot seems to have installed successfully on the fly. But not in all cases.
    • FTS upgrade tomorrow at BNL

 
sorry - did not post Yuri's email.

Analysis queues, FDR analysis (Nurcan)

  • Analysis shifters meeting on 1/26/09
  • last meeting:
    • This week there were lots of problems from users to submit pathena jobs - actually a catalog issue with container datasets.
    • BNL queues now equally divided.
    • Brokering changing to distribute load to more Tier 2s
    • With http-lfc interface, can use a local condor submitter for analysis queues
  • this meeting:
    • Analy queue performance meeting next week.
    • TAG selection jobs that don't work at some sites
    • Discuss performance of Condor on submit host - have noticed pilots waiting a long time at BNL LONG and SHORT analysis queues. Need Xin.
    • Nurcan would like to run a stress test in advance of March software week. Reprocessed DPDs should be available.

Operations: DDM (Hiro)

  • last meeting(s):
    • New DDM monitor up and running (dq2ping); testing with a few sites. Can clean up test files with srmrm. Plan to monitor all the disk areas, except proddisk.
    • Another 10M transfer jobs planned - mid-Feb. During this phase there will be real throughput tests combined with the stress tests. And planning to include the Tier 2's.
    • Proxy delegation problem w/ FTS - the patch has been developed and in the process of being released. Requires FTS 2.1. Did back-port. Though only operational SL4 machines. We would need to carefully plan migrating to this.
    • BNL_MCDISK has a problem - files are not being registered. New DQ2 version coming up the end of the week which will hopefully fix this.
    • BNL_PANDA - many datasets are still open. Is this an operations issues?
    • Pedro: there may be precision problems querying the DQ2 catalog. Will check creation date of the file.
    • Note - all clouds have jobs piling up in the red category.
  • this meeting:
    • Fixed problems with file check sums at two sites.
    • BNL_MCDISK still has slow registrations - still waiting for new DQ2 version.
    • IU and AGLT2 have problems.
    • Mostly back to normal.

Storage validation

  • See new task StorageValidation
  • last week:
    • AGLT2 - what about MCDISK (now at 60 TB, 66 TB allocated)? These subscriptions are central subscriptions - should be AODs. Does the estimate need revision? Kaushik will follow-up.
    • Need a tool for examining token capacities and allocations. Hiro working on this.
    • Armen - a tool will be supplied to list obsolete datasets. Have been analyzing BNL - Hiro has a monitoring tool under development. Will delete obsolete datasets from Tier 2's too.
    • ADC operations does not delete data in the US cloud - only functional tests and temporary datasets. Should we revisit this? We don't know what the deletion policy is, but we'd like to off-load to central operations as appropriate.
    • proddisk-cleanse questions - may need a dedicated phone meeting to discuss space management; more tools becoming available.
  • this week:
    • Discussing data deletion procedures for users. - Armen.
    • Deletions not propagating to HPSS. Looks like there's a solution now.
    • Deletions at BNL incorrectly "deleted" in the DQ2 central.
    • Timeframe for a weekly meeting: Tuesday 3pm Central (see scope above)

VDT Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Doug: waiting on hardware at BNL. Will report next week.
    • Horst - has installed newest version of bm-gateway. version i2.
    • Wei - there are new features for space querying - installed on production service. 2.2.1.2.i2
    • Armen - note this is the version which monitors space token (usage, available)
    • Doug - there are problems with instructions, send to osg-storage.
    • Horst having difficulty posting to osg-storage.
  • this week
    • Horst - sending feedback now to osg-storage.
    • Wei providing updates to documentation
    • Doug having lots of problems at Duke

Throughput Initiative (Shawn)

  • Notes from meeting this week:
  
                USATLAS Throughput Call Meeting Notes
               ================================

February 17th, 2009,  3 PM Eastern

Attending:  Shawn, Neng, Sarah,  Rich,  Karthik, Jay, Hiro, Wei,  John

perfSONAR status:  An Excel preadsheet has been sent-out listing USATLAS perfSONAR installation status to specific site contacts.  This needs to be filled out and returned ASAP.     As per last week’s call,  Jason Zurawski (I2) sent some information about possible examples for the perfSONAR web interface.   See the following URLs:

Example matrix of bandwidth results: http://ndb1.internet2.edu/cgi-bin/bwctl.cgi?name=OFFICEMESHBWTCP4
Sites advertising LHC membership:  http://ndb1.internet2.edu/cgi-bin/perfAdmin/LHC.cgi 
Sites advertising USATLAS membership: http://ndb1.internet2.edu/cgi-bin/perfAdmin/USAtlas.cgi 

The last two URLs point-up a problem.  We need a  consistent way to identify USATLAS perfSONAR instances.  On the call (based upon input from Rich Carlson) we agreed that each USATLAS site should use “LHC USATLAS” as its Community Keyword (group identifier) while (re)configuring perfSONAR.  (In my email earlier I mentioned “LHC/USATLAS” but instead it should be “LHC USATLAS”).    Please check your installation and update it if needed to use this string.

By default perfSONAR installations don’t schedule any automatic tests.   To be more useful for USATLAS we need sites to configure testing with their “Peers”. I am requesting a “volunteer” for trying to configure their site to test against USATLAS peers.   This involves following the URL (also useful as a guide when setting up the initial system): http://code.google.com/p/perfsonar-ps/wiki/NPToolkitQuickStart  and then setting up tests to specific USATLAS sites via: http://code.google.com/p/perfsonar-ps/wiki/NPToolkitPerfSONARBUOY   and  http://code.google.com/p/perfsonar-ps/wiki/NPToolkitPingER .  The volunteer would document their steps on a web page and make it available for the rest of the USATLAS sites to follow.    MWT2_UC and AGLT2_UM were discussed as possible “volunteers”.  

The longstanding network issue between AGLT2 and BNL was resolved last week with help from Mike  O’Connor (ESnet/BNL).   The problem was eventually tracked to a bad fiber jumper on the path from USLHCnet to ESnet at StarLight.   Roughly 10% of all packets were being (randomly?) dropped on the USLHCnet -> ESnet direction (on input to ESnet).   After the bad patch was replaced packet loss went to ~0 and network performance was significantly improved.

We had a discussion about the adequacy of our current monitoring and  our near term plans for improving our ability to track throughput and issues with throughput.   Here are the plans:

 1)	Implement a consistent perfSONAR infrastructure for  USATLAS
      a.	Include “mesh-like” testing between Tier-n sites
      b.	Improve access to perfSONAR data (new frontpage for perfSONAR, Jay’s additions for graphing/display/monitoring)
 2)	Document existing monitoring (BNL Netflow at http://netmon.usatlas.bnl.gov/netflow/tier2.html  , BNL  FTS info at: https://www.usatlas.bnl.gov/fts ,  ATLAS dashboard at   http://dashb-atlas-data.cern.ch/dashboard/request.py/site , example dCache monitoring at http://head01.aglt2.org/billing/xml/local_pool_rates, others?).  
 3)	Implement standard dataset for disk-to-disk tests between BNL and the Tier-n sites.   This will require help from Hiro to define the proper way to implement the data copying and extract the results.   One example is to create a dataset of 40 2GB files that can be moved on a regular schedule from BNL to the Tier-2s.   BNL->Tier-2A, then 4 hours later, BNL->Tier-2B, etc.  and just loop over all destinations.   Jay can also help to create graphs for the results (and for Hiro’s DQ2 ping tests when they are ready).   We are interested in the average file transfer rate, min, max and total time to transfer the dataset.   This can be tracked over time and would be an initial indicator of either  improvements or problems with throughput between the sites.  A goal is to have a prototype of this data movement test in place for next week’s call.  
 4)	Add/iterate to the above as determined by our experience using the infrastructure…

Hiro raised the point that having a bad data transfer test result doesn’t really isolate the problem.   What the infrastructure above does provide is a way to isolate network issues from end-site issues.  Having a specific time-window when a problem started also helps to determine the cause.  Those on the call agreed that the best way to proceed is step-wise (iteratively).   We will get this system in place and then determine if there is additional monitoring that could aid in isolating problems.

No time for site reports.  We plan to meet again next week at the usual time.

Please send along edits/additions to my notes.   Thanks,

Shawn

  • last week:
    • Meeting focused mainly on perfsonar
    • The way perfsonar is not intuitively useful the way we have it deployed
    • Web interface is not immediately useful. Logging to site-level systems (nagios, syslog-ng) could be setup.
    • Need to setup tests between sites.
    • New UC & BNL circuit now in place.
  • this week:
    • USATLAS_perfSONAR_Status.ppt: Presentation on USATLAS perfSONAR Status
    • Timeline
      • Existing perfSONAR sites should reconfigure their “Communities of Interest” this week (LHC USATLAS)
      • AGLT2_UM will document steps needed to setup regular USATLAS peer tests for perfSONAR by next throughput meeting.
      • Hiro (& Jay for graphics?) will create a prototype standardized dataset copy test to measure end-to-end throughput by the next meeting.
      • Missing perfSONAR sites need to update the spreadsheet to provide their timelines ASAP
    • Rich: Also discussing Nagios and syslog-ng extensions for monitoring these boxes
    • Patrick: LHC USATLAS are two communities of interest
    • Michael: Last week's reviewers felt perfsonar is a good tool for monitoring our infrastructure.

Site news and issues (all sites)

  • T1:
    • last week: no report
    • this week: probs w/ analy jobs over weekend due to large backlog of stageout requests, Xin debugged on Monday; moved to a more powerful machine so the queue is better served now. dcache upgrade yesterday 1.9.09. Backend oracle work for FTS and LFC. We need a new version of DQ2 site services. Becomes a pressing issue because of the registration backlog. Deployment of 30 Thors in progress, 10G NIC driver matches; 5 units to be installed by week's end. FTS migration to version 2.1 this week. Network - making progress w/ dedicated circuits; first UC-BNL in place; now working on BNL-BU, discussion on Friday (need another meeting next week). Next would be AGLT2.

  • AGLT2:
    • last week: migrating dcache files off compute nodes to large servers; trouble bringing a pool online - possible migration side effect - solved (increased Java memory). Wenjing looking into database configurations. Large transfer backlog - probably not a local problem. dcache version 1.8-11-15. BNL is upgrading to 1.9.
    • this week: * Poor WAN performance issue tracked down w/ help from Mike OConner. 10% packet loss, rate independent. US LHCnet and Esnet jumper at starlight - bad fiber - fixed. Now getting packet loss down to 0. Dzero transfers to FNAL down from hours to seconds. Throughput test back to BNL - needs work. dCache maintenance - removed pools on compute nodes, and using Berkeley database for metadata. Dell switch stack at MSU again causing problems. Upgraded rocks install for nodes. Frontier tests - 700 running processes to reach saturation (caused server process crash at BNL).

  • NET2:
    • last week: Still working on storage. John: HU functioning okay. Have a problem w/ high gatekeeper loads. Xin notes an install job has been running for over a day.
    • this week: BU (Saul): 224 new harpertown cores have arrived, to be installed. New storage not yet online - HW problems w/ DS3000s (IBM working on it). HU (John): gatekeeper load problems of last week related to polling old jobs. Fixed by stopping server, removed old state files. Also looking into Frontier. Frontier evaluation meeting on Friday's at 1pm EST run by Dantong (new mailing list BNL). Fred notes BDII needs to be configured at HU.

  • MWT2:
    • last week: One day downtime tomorrow for dCache upgrade. BNL-UC circuit established today.
    • this week: Upgraded dCache at both sites. Processes on old pools didn't shut down properly; these errors got mopped up. Pilot changes and problems with curl as distributed in workernode-client.

  • SWT2 (UTA):
    • last week: Adler32 issue discussed above.
    • this week: xrootd site mover change last week orphaned 5000 jobs - corrected w/ Hiro's help. Now running smoothly. Updated perfsonar boxes to include communities of interest.

  • SWT2 (OU):
    • last week: Still not much progress with getting OSCER into production - needs to consult with Paul.
    • this week: OSCER - still working w/ Paul (up to 500 cores). Need to talk about this in detail.

  • WT2:
    • last week: GUMS server HD failed. Installed backup GUMS.
    • this week: checksum issue resolved. Migrating lfc database to a dedicated machine run by database group, to improve reliability. Still working on network monitoring machines, but will be postponed until April.

Carryover issues (any updates?)

Pathena & Tier-3 (Doug B)

  • Last week(s):
    • Meeting this week to discuss options for a lightweight panda at tier 3 - Doug, Torre, Marco, Rob
    • Local pilot submission, no external data transfers
    • Needs http interface for LFC
    • Common output space at the site
    • Run locally - from pilots to panda server. Tier 3 would need to be in Tiers of Atlas (needs to be understood)
    • No OSG CE required
    • Need a working group of the Tier 3's to discuss these issues in detail.
    • http-LFC interface: Charles had developed a proof-of-concept setup. Pedro has indicated willingness to help - pass knowledge of apache configuration and implement oracle features.
  • this week
    • http-LFC interface work

Release installation via Pacballs + DMM (Xin, Fred)

  • last week: * Next week - full production. Discussing with Alessandro switching portal from development to production. Also code not checked in. Also need to publish existing releases.
  • this week * Can run in production mode now, but there are two things to finish. Path of logfiles of install jobs to permanent location; publication of EGEE portal. * Installation pilots are hanging at OU.

Squids and Frontier (Douglas S)

  • last meeting(s):
    • Harvard examining use of Squid for muon calibrations (John B)
    • There is a twiki page, SquidTier2 to organize work a the Tier-2 level
    • Douglas requesting help with real applications for testing Squid/Frontier
    • Some related discussions this morning at the database deployment meeting here.
    • Fred in touch w/ John Stefano.
    • AGLT2 -tests - 130 simultaneous (short) jobs. Looks like x6 speed up. Doing tests without squid.
    • Wei - what is the squid cache refreshing policy?
    • John - BNL, BU conference
  • this week:
    • Dantong will report on weekly Friday meeting

Local Site Mover

AOB

  • Direct notification of site issues from GGUS portal into RT, without manual intervention. Fred will follow-up. - next week.
    • Fred believes this is happening correctly now - tickets are being marked
    • Armen will think about what to do with mis-assigned tickets.
  • Wei: questions about release 15 coming up - which platforms (release sl 4, sl 5 ) and gcc 4.3. Kaushik will develop a validation and migration plan for the production system and facility. - will follow up.


-- RobertGardner - 16 Feb 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


ppt USATLAS_perfSONAR_Status.ppt (625.5K) | ShawnMckee, 18 Feb 2009 - 11:15 | Presentation on USATLAS perfSONAR Status
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback