r3 - 08 Feb 2008 - 14:31:26 - RobertGardnerYou are here: TWiki >  Admins Web > NotesTier2Nov30



  • See slides
  • Large capacity ramp-up: will have >2 M SI2K?
  • MSU coming online
  • Some issues w/ Dell switches
  • Running Condor 6.9.4 job scheduler. Tier3 used when idle, vice-versa.
  • Analysis queues are setup
  • Monitoring - APC for power, temp, humility
  • Using Cacti, syslog-ng, Nagios and Ganglia
  • Problems
    • some gatekeeper crashes under heavy load; problem solved - related to XFS filesystem exporting NFS, # inodes
    • NFS server crashes - fewer of these
    • Accounting issues have been fixed
    • Large memory jobs


  • See slides


  • Tour of Egg - a generalized management framework that works well for Tier2 centers
  • Quickly add up spec-int's for all cores in the system
  • Examine PBS queues, gridmap file, etc.
  • Harvard starting up: 17 TB thumper; 10 servers
  • Tufts interested in Tier3


  • Added 23 dual quads 1950s; online and in production
  • Shared resources - sizeable Tier3 facility
    • Expand to use campus Condor pool - 750 cores
    • OSCER running, some probs w/ dg2_get, _put.


  • New machine coming online - 200 cores
  • Negotiating for 400 cores of Opteron 2220, 210/180 TB MD1000 + 7 PE2970 to finish FY07
  • Networking is biggest concern. SWT2 is off-campus, 1 GB/s
  • LEARN peering w/ I2 at 1G
  • Need to improve 50 MB/s to 100 MB/s
  • NLR and I2 boards are not working together; Peering in Houston the problem
  • Would prefer not to support interactive users


  • See slides
  • Negotiating thumpers w/ TB-sized disks
  • Will try to run Bestman SRM on XrootdFS - we need to follow-up w/ srm experts about client access
  • Issue - how to implement analysis queues in fair share environment
  • Need SRM to do load balancing
  • New GUMS v1.2 - one-to-one mapping
  • Need to upgrade DQ2 to 0.4.1
  • Network tuning - up to 800 Mbps in both directions - but not stable (competing traffic)
  • Will upgrade to 10 G link - January
  • Plan to evaluate Terapaths and QoS
  • Performance - utilization is less than 200 on average - related to lack of input datasets.
  • Good news is less useless debugging after moving to PandaMover

Facility Planning (Michael)

  • Scope - next 6 months
  • Analysis at Tier2 centers - high priority
    • December 15 Site configuration by admins
    • AOD replication, Q's: how much space is needed, need to decide which datasets. Must be complete by December 31. Who makes decisions on datasets: physics coordinators and usage patterns. Kaushi will consult Alexei. Jim will talk w/ physics coordinators, report back to Kaushik, Alexei, Michael, Facility.
  • Interactive analysis
    • BNL PROOF farm - for tests, completed by Jan 31 Ofer Rind
    • BNL PROOF farm into production, multi-user mode: March 31
    • Tier2 PROOF farms available?
      • Action item - plan for setting this up - as part of interation program. June 30. Plan to be delivered: end of January. Bruce, Ofer, Patrick, Sergey, Rob
  • Support setting up of Tier3's
    • Immediately, on-going. Doug/Duke, UTD/Justin, ... Need to contact Tier3's.
  • Evaluate pinning SRM v2.2
    • How important is space reservation? Gabriele: totally linked.
    • Must do this on a short timescale - Gabriele: plan by December 31
  • Develop and deploy software necessary to manage pinned files. To be integrated into DQ2
  • Disk space reconfiguration according to the computing model
    • Kaushik - we need disk-only areas. Proactively have our own plan.
  • Development and deployment of disk-only management tools: what are the needs?
    • Available space and usage. Kaushik will provide a bulleted list of requirements appropriate for Panda
  • LFC
    • Test system deployed by 31 December John, production ready by 31 January.
    • Migration by end February.
  • US ATLAS data management
    • Storage quota system US ATLAS wide - to be handled within DQ2 - to bring up w/ Massimo
    • Data deletion system - Need to collect capabilties, report to DDM operations Alexei
    • Complete DQ2 lost file tagging Kaushik will bring this to Operations
  • Jim Develop policy for Tier3 data, to be discussed at RAC
  • Jim Need a model for lifetime management of AOD, ESD, DPD's at site.
  • Incident tracking and communication, Elog deployed and operational, Mark complete by December 15
  • Performance
    • Average 90% efficiency of 2007 WLCG pledge; important for funding agencies to review
  • Many other issues not covered here.

Next meeting

  • US ATLAS Tier2/Tier3, jointly w/ OSG all-hands at RENCI / North Carolina, March 3-5, March 2008
    • Propose US ATLAS talks in the plenary session
    • Format - TBD
  • US ATLAS Tier2/Tier3, last week of May 2008 - location: Ann Arbor

-- RobertGardner - 30 Nov 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback