r4 - 29 Nov 2007 - 21:28:58 - RobertGardnerYou are here: TWiki >  Admins Web > NotesTier2Nov29


Day 2


Minutes of the Tier2/Tier3 workshop at SLAC, November 28-30, 2007

Fabric: processing - Chris Hollowell

  • See slides
  • Vendor selection process, hardware evaluation
  • Running 32bit linux for a variety of reasons
  • SLC4 expected support till 2009
  • PXEUtils - see HEPiX presentation

Fabric: storage - Robert Petkus

  • Storage requirements are outpacing computing requirements
  • Evaluating distributed storage options - prefer high density vertical disk arrays
  • See Sun Fire x4500
  • What RAID levels are supported - recommend RAID6. Very long re-build times for RAID5.
  • Want simple setup, plug-in-play setups; lights-out management and built-in monitoring
  • Purchased 28 Sun Fire thumpers in 4 APC racks w/ metered 2 Phase 30 A PDU
  • Need UFS file system for dcache data and OS; 16 TB useable space on a 24 TB system.
  • Performance profile of the x4500 - yet to tax the system. Easily > 400 MB/s R/W with RAID6. Room to grow witht this system.
  • Conrtolled dccp tests.
  • Channel bonding works as advertised. ZFS is high performance filesystem that is easy to setup, stores checksum as a pointer within the file.
  • Caution about increasing number of inodes due to very large number of dcache control data files.
  • fmadm snmp framework needed to provide alerts for disk failures
  • In 2008 need to purchase 2PB of disk. Worried that CPU and memory are not being addressed in the next gen.
  • DDN - fast HW RAID6 and cheap, less than $1/GB. Have to configure as 6+2, 8 5U shelves
  • See Tier3 recommendations
  • Questions
  • GFS? Free from redhat for distributed file systems.
  • Need non-blocking Gig E network needed.

Fabric Services: Nagios monitoring Tomasz Wlodek

  • See slides
  • Plan to split Nagios services into two instances, separating internal RACF services from "external" Tier2/3, OSG, etc.
  • See https://web.racf.bnl.gov/nagios/
  • Service dependencies - so that a failure in one component does not generate a plethora dependent alarms.
  • False alarms - many generated by network firewall issues - addressed with additional network cards.
  • Demo - going through Nagios pages.
  • See new features for extended information for a specific service. Has links into RT page.
  • Operations to know: schedule a downtime, schedule a test, acknowledge a problem.
  • Resolve problem in RT and then in Nagios, and force a final test.
  • Need to formalize Nagios operations - go to tactical overview to ack the prob, take ownership of ticket, resolve, close, etc.
  • Integration RSV probes into Nagios
  • Options - run as a Nagios plugin.
  • Need feedback from users. What/who/notification policy (email/pager). Use of event handlers - want/need?
  • Question about assignment of tasks automatically from RT that may not be "owned" by a Tier2 admin, eg. Tier3 problems. Answer is that Tier3 points of contact must receive the tickets.
  • Need to provide feedback from Tier2 admins and shift crew.

Network optimization and load tests Dantong Yu

  • See slides
  • Network configuration slide - which way does the traffic flow (T1-T1, T2-T1) - is there a bottleneck introduced by the firewalls?
  • Load testing goals - a ten-fold increase over current performance.
  • Looking at details between BNL and AGLT2.
  • Setting up Tier2 dynamic network links BNL to NET2 using Terapaths. What are the policy issues?
  • Rich Carlson - explaining policy from Internet2 perspective. There is an infrastructure for setting up and tearing down a reserved slice of the bandwidth.
    • Would like to work out what we can do for special replications for AOD replication.
    • Nebraska was able to move data at 9 Gbps and isolate it from their campus traffic.
    • What about Esnet - Internet2 connections? There is on-going work in this area.

Fabric Services: Storage management - dCache Ted Hesselroth/Fermilab

  • Ted is the OSG Storage group leader - there are meetings and support for dCache related technical issues
  • See slides for dCache introduction
  • Installation demo
  • All nodes need host certs, even pool nodes
  • dry-run install option allows for checks of install conditions
  • requires java 1.5
  • Going over the installation, configuration handled in one file.
  • Authorization is site-dependent, must by setup by hand post-install (dcache.kpwd and dcachesrm-gplazma.policy)
  • See monitoring - http://tier2-d2.uchicago.edu:2288/
  • Working through example using a vdt client to copy a file into the dcache
  • Replica manager allow's resilient pools
  • SRM v2.2 access latency and retention policy - defined w/ WLCG MOU. Mappings for TapeXDiskY. When making a reservation you can specify these attributes.
  • Link groups are used to capture attribute specifications for space reservations.
  • Question from Doug: dCache appropriate for a small Tier3? Concerns about equipment overhead for admin services and doors.
  • SRM on filesystems - Bestman from VDT is available as well.
  • Question from Wu: what about pool-to-pool transfers - which protocol used. Ans dcap.
  • Question about queues - in Tomcat, in SRM both have queues. Tomcat for example can open up to 100 threads. Queues in SRM depend on how busy the pools are.

Optimizing data transfers - Shawn McKee

  • See slides
  • Goal is to get 200 MB/s T1-T2 for all sites
  • Disk-to-disk tests are poor for a variety of reasons.
  • Start w/ network stack...then to other disk
  • Will need examine the details at each site - mostly in terms of their storage services, gridftp doors
  • Timescale: should be able to do this now at AGLT2.

OSG Grid Middleware - Robert Gardner

  • See slides
  • Questions for follow-up - problems with SE info-provider?
  • Scalability with the gatekeeper, and failover. There is some work on this at Fermilab.
  • Glide-in factory to schedule analysis jobs locally to reduce load on gatekeeper.

US ATLAS Services: panda-based release installation - Tadashi Maeno (BNL)

  • See slides
  • Someone needs to be identified to 'operate' the system
  • There may sometimes be problems w/ installs over NFS
  • Question whether we want to continue this process or use the global ATLAS system. We will still need to put operational manpower behind this system.

US ATLAS Services: Running entire physics analysis chain w/ Panda, proof & xrootd - Tadashi Maeno (BNL)

US ATLAS Services: distributed data management tools - Hironori Ito

  • See slides
  • Emphasizes using Panda job monitor for debugging DDM problems
  • FTS monitor now captures log files of failures for specific channels
  • Dashboard developers are providing API's for getting info on DQ2 call backs stored in teh Oracle database.
  • Suggestion for an FAQ for troubleshooting

US ATLAS Services: data management at the site - Patrick McGuigan

  • See slides
  • LFC - questions - what extensions do we need. Worries about drifting from what LCG is doing.
  • Questions for discussion - consistency checks, OSG replication manager, User data lifetimes, Tier3 recommendations
  • What changes to cleanse.py and checkse.py with LFC migration.
  • Delivery of DPD's to Tier3's is an issue, as is publication of physics data tomorrow.

-- RobertGardner - 29 Nov 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback