r18 - 17 Dec 2007 - 09:40:59 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesTPDec10

MinutesTPDec10

Introduction

Minutes of the Facilities Integration Program meeting focusing on Throughput, December 10, 2007.
  • Objectives: 200 MB/s throughput sustained disk-to-disk Tier1-Tier2
  • Other meetings and background : IntegrationProgram
  • Coordinates: Mondays, 2:00pm Eastern
    • Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.

Attending

  • Meeting attendees: Saul, Dantong, (Jay), Shawn, Rob, Tom, Joe, John, Horst, Karthik, Bruce, Nam, Ning, Anabelle, Wei (please correct!)

Overall background and plan (Shawn)

  • Have been able to fully utilize Gbps networks for mem-to-mem, but we're trying to get to easily doing 100's MB/s for disk-to-disk.
  • 200 MB/s is the initial goal, T1 to T2. Eventually ~400 MB/s for 10G sites.
  • What are the problems?
  • We need to come up with some debugging steps.
  • Issue of limited number of exposed doors at BNL - only 8. Not equiv to what Fermilab has with many public doors and pools. May be longer term implications.
  • Want to see what is possible w/ single, high performance servers. There is some work going on at UMich for local disk configurations - RAID options, etc.
  • Dantong:
    • What are the immediate steps to get to 200 MB/s? Shawn: tune up one of the doors w/ one of the doors and get that working to 120 MB/s. Bechmark the disk system at each end.
  • Bruce: how do "Tier3"'s fit? They can participate in this just like Tier2s.
  • Shawn would like us to document each site, and benchmark each filesystem/disk systems. List storage endpoints and other information.
    • All agree this can be provided.
  • Wenjing - will post a page w/ instructions for each site. Starting w/ a dd timed from /dev0 to a file on disk.
  • Saul has a bonnie++ pacman package. Will post instructions here.
  • There was a question about the rate of DQ2 managed transfers.
  • Another issue is the endpoint filesystem - dCache, Lustre, GPFS

Tier1 testing datasets and endpoints (Hiro)

  • Dantong will ask Hiro to publish this information. Locations in pnfs.
  • Dedicated Thumper nodes w/ datasets will be used for the testing. Ready to go.
  • Is there any load balancing on the doors? SRM should not be used. DQ2 tools use glite-copy, Dantong will follow-up w/ Tadashi.
  • Need names of doors. dc01, etc. Questions about whether these are tuned optimally for the network.
  • Door nodes have low memory - only 4 GB on each door. They should be 16 GB.

Load testing tools (Jay)

  • Runs from one machine, but makes 3rd party transfers.
  • Would like to use the Monalisa control framework to schedule the tests, and to allow people to schedule these themselves.

Wisconsin

  • Going to change gridftp server to have a 1 GB server to the campus backbone (10G).
  • xrootd backend - several servers.
  • No firewalls.

AGLT2 status (Shawn)

Interested in Lustre + BestMan, and comparing to SRM-dCache; also FDT.

For Lustre+StoRM see info from LustreStoRM, an ATLAS Tier-1 setup in Spain.

BeStMan? info is at: BeStMan web page.

lustre information:

http://www.clusterfs.com/

http://wiki.lustre.org/index.php?title=Main_Page

http://www.sun.com/software/products/lustre/

future lustre testbed : 1 MDT server, 4 data server, 4 clients

See slides at tier2 meeting. 10G connections to the backbone, and on to Chicago.

Write at 600-700 MB/s, read at > 1 GB/s

IoTestInstructions

hardware:

Storage Enclosures (4 X15 disks) : MD1000

raid controller : ERC 5/E Adapter

server : PowerEdge? 2950

raid configurations:

4 r5 4 r5 device,each r5 has 15 disks, 2 r5 device share one controller
2R50 2 r50 device,each r50 has 30 disks on separate controller
sr02r50 soft r0 over 2 hard r50,each r50 has 30 disks on separate controller
2r5 2 r5 device,each r5 has 30 disks on separate controller
sr02r5 soft r0 over 2 hard r5,each r5 has 30 disks on separate controller

test tool: iozone-3.279-1.el4.rf.x86_64

test mode: Run Iozone in a throughput mode. This option allows the user to specify how many threads or processes to have active during the measurement. aslo use -P to bind threads to the processors.. we tested 1 to 12 threads running mode..

the number (1..12)stands for number of parallel threads running .. the unit is MB/s

total write for all threads

write
1 2 3 4 5 6 7 8 9 10 11 12
4r5 370 351 682 641 621 630 610 595 593 599 599 581
2r5 382 695 707 657 649 623 618 605 603 588 587 583
sr02r5 664 718 657 633 606 599 626 613 597 587 575 553
2r50 366 715 688 653 646 623 619 604 603 592 588 584
sr02r50 692 711 651 636 615 604 594 586 572 574 569 563

total read for all threads

read
1 2 3 4 5 6 7 8 9 10 11 12
4r5 698 815 1530 1665 1650 1533 1396 1462 1362 1569 1324 1243
2r5 755 1552 1515 1026 965 827 840 830 824 804 702 783
sr02r5 1212 1440 1381 879 765 672 691 904 795 739 732 498
2r50 757 1558 1456 936 899 851 935 887 879 838 710 763
sr02r50 1222 1389 579 1547 796 766 744 683 505 692 694 598

average write for each thread

write
1 2 3 4 5 6 7 8 9 10 11 12
4r5 370 175 227 160 124 105 87 74 65 59 54 48
2r5 382 347 235 164 129 103 88 75 67 58 53 48
sr02r5 664 359 219 158 121 99 89 76 66 58 52 46
2r50 366 357 229 163 129 103 88 75 67 59 53 48
sr02r50 692 355 217 159 123 100 84 73 63 57 51 46

average read for each thread

read
1 2 3 4 5 6 7 8 9 10 11 12
4r5 698 407 510 416 330 255 199 182 151 156 120 103
2r5 755 776 505 256 193 137 120 103 91 80 63 65
sr02r5 1212 720 460 219 153 112 98 113 88 73 66 41
2r50 757 779 485 234 179 141 133 110 97 83 64 63
sr02r50 1222 694 193 386 159 127 106 85 56 69 63 49

ATLT2 storage system

1) Node: umfs05.aglt2.org
  • a. Config: dual dual-core (Intel X5355), 16GB ram, single Perc5/e, two MD1000 shelves, 15@750GB SATA-II drives / MD1000, RAID50 config via Perc5, 10GE Myricom NIC on PCI-e x8 slot.
  • b. Storage location gsiftp://umfs05.aglt2.org/atlas/data16
  • c. dd test results: write=289MB/s read=217MB/s
  • d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
         1   2   3   4
  Initial write    520   362   471   400
        Rewrite    654   662   654   630
           Read    828   814   962   832
        Re-read    828   810   788   660
2) Node: umfs02.aglt2.org
  • a. Config: Dual Dual Core AMD 280, 16GB ram, single ARC-1170, 24@750GB SATA-II disks, RAID5 config ARC-1170, 1GE Broadcom BCM5704 on PCI-e x8 slot.
  • b. Storage location gsiftp://umfs02.aglt2.org/atlas/data08
  • c. dd test results: write=186MB/s read=301MB/s
  • d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
         1   2   3   4   5   6   7   8
  Initial write    129   129   118   115   101   124   107   101
        Rewrite    200   200   196   195   174   201   171   173
           Read    250   252   251   252   237   250   235   234
        Re-read    254   258   244   253   254   250   224   239

3) Node: dq2.aglt2.org
  • a. Config: dual quad-core (Intel X5355), 16GB ram, single ARC-1260, 16@750GB SATA-II disks, RAID6 config via ARC-1260, 10GE Myricom NIC on PCI-e x8 slot.
  • b. Storage location gsiftp://dq2.aglt2.org/atlas/data15
  • c. ddtest results: write=266MB/s read=330MB/s
  • d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
         1   2   3   4   5   6   7   8
  Initial write    320   309   213   309   293   308   323   302
        Rewrite    383   292   323   367   363   351   354   372
           Read    440   408   429   432   380   407   369   391
        Re-read    436   425   412   416   401   408   427   405

4) Node: umfs07.aglt2.org
  • a. Config: dual quad-core (Intel E5335), 16GB ram, tow Perc5/e,four MD1000 shelves, 15@750GB SATA-II drives / MD1000, RAID50 config via Perc5, 10GE Myricom NIC on PCI-e x8 slot.
  • b. Storage location gsiftp://umfs07.aglt2.org/atlas/dcache gsiftp://umfs07.aglt2.org/atlas/dcache1
  • c. dd test results: write=330MB/s read=265MB/s
  • d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
         1   2   3   4   5   6   7   8
  Initial write    495   259   362   204   154   151   219   407
        Rewrite    661   666   657   653   639   632   616   645
           Read    1105   847   1005   1001   982   829   893   915
        Re-read    823   992   1115   882   1043   997   1007   1015

5) Node: c-4-15.aglt2.org (typical dcache node)
  • a. Config: dual quad-core ( Intel X5355), 16GB ram, 2@750GB SATA-II disks, 10GE Myricom NIC on PCI-e x8 slot.
  • b. Storage location gsiftp://c-4-15.aglt2.org/atlas/dcache gsiftp://c-4-15.aglt2.org/atlas/dcache1
  • c. ddtest results: write=73MB/s read=75MB/s
  • d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
         1   2   3   4   5   6   7   8
  Initial write    73   124   116   106   106   103   103   100
        Rewrite    51   96   91   90   90   89   86   89
           Read    73   147   95   40   44   41   44   42
        Re-read    74   146   95   40   43   42   44   42

MWT2 status (Joe)

To date, most testing and tuning has taken place at UC. At UC, we currently have 1 SRM door and 4 GridFTP doors at our site frontending 37 pools of between 1.4TB - 3.4 TB. These pools run on a private IP network connected to the doors through a Cisco 6509 router with 1Gb NICs. All doors at UC have 1Gb connections to both the public and private networks, with the exception of uct2-dc1, a GridFTP door, which has 10Gb NIC on both networks. At IU, we have 43 pools of the same sizes, thought they all have public IP's, connected with 1Gb NICs to a Force10 router. Test results were obtained below by using IU compute nodes as clients, connecting to UC's dCache. Iperf tests have shown 4.2Gbps from IU to UC. IN-UC_ATLAS_VLAN_Peering.png Door nodes hardware specs:
  • 1 Dual core Opteron 275
  • 4GB RAM
  • 1GB NICs (except for uct2-dc1, which has a 10Gbps connection)

  • At UC, we can get up to ~3.6 Gbps (~450MBps) into our dCache at around 25 concurrent SRM connections. After that, we hit a CPU bottleneck on the SRM door (uct2-dc3), where performance degrades to around ~3.4Gbps (~425MBps). Also, with 10Gb NICs on either side of uct2-dc1 (primary GridFTP door), we get a maximum of 1.8Gbps on that gridftp door.
  • UC GridFTP 10G Stress Test: With 10Gb NIC on either side of uct2-dc1, we can get a maximum of 1.8Gbps into our dCache on this door. Not sure what the limiting factor here is though, but certainly not CPU related. Some further tuning of the cards and the network stack probably needed here.
uct2-dc1_gridftp_stresstest.png

  • UC SRM Stress Test: The sweet spot for greatest throughput is around 25 concurrent connections, after this, performance degrades a bit due to the CPU on the door (uct2-dc3) running at 100%. The largest number of concurrent connections I threw at it here was 41.
srm_test01.png srm_test02-41_concurrent_conns.png

  • AGLT2 to UC?
  • Joe and Wenjeng to work together to structure the tests.

SWT2_OU status (Horst)

  • Two Dell 2950 head nodes (Dual Quad Core Xeon 5345, 2.33 GHz,16 GB RAM) w/ gigabit links that go through to DDN (S2A? 3000) Ibrix (3.0), 48 500 GB SATA disk RAID5 array; 16 TB useable, 8 hot spares
  • Max performance was 45 MB/s with this system. Question is how to improve this?
  • IOZone has the ability to specify threads.
  • 100+ MB/s on LAN, disk-to-disk (OUHEP raid to local SAS disk on the headnodes); so gridftp not the bottleneck, getting almost line speed.

SWT2_UTA status (Patrick)

  • Will attend next time.

NET2 status (John/Saul)

  • Have dedicated 10G network to our cisco 6509 router.
  • Bottleneck is disk I/O.
  • 60 TB GPFS system. 500-600 MB/s writing, 800 MB/s reading using 9 nodes.
  • One gridftp door. Xeon 2.8 GHz, 4 GB ram.
  • Interesting thing will be peformance from the gridftp node.

WT2 status (Wei)

  • Running 1 Gbps external link. January upgrade to 10 Gbps.
  • 3 Thumpers, each has 4 Gig E nics, channel bonded.
  • 1 Gridftp server, Gig E.
  • 500 MB/s memory to disk on Thumper, using dd write.
  • xrdcp on gridftp door, multithreaded, 104 MB/s over a gigabit link.
  • g-u-c on gridftp door 37MB/s reading, 50 MB/s writing. Not sure what the difference is, same path is taken. Note g-u-c is built on top of the Posix xrootd library. Not sure what the problem is. Additional note: talked to Andy and get undocumented Xrootd environment variables. Now 50MB/s reading as well. It seems at this point, the gridftp server saturated the CPU (2x AMD Opteron 244, 1.8Ghz, 2GB).
  • Will use SRM from Bestman. Concerned about SRM functioning w/ FTS. No place to test.
  • Shawn notes there is an srm-testers site that can monitor your SRM endpoint. Suggests we look at this.

SRM Testing Link

For sites interested in daily testing of their system see: LBNL SRM Testing or for results: LBNL SRM Testing Results.

Info on Site Testing

Wenjing Wu will provide details about testing methodology used at AGLT2. Also Saul pointed out there are some existing PacMan? caches which have testing tools ready for easy installation and use:

  • % pacman -get BU:Bonnie (installs and runs Bonnie)
  • % pacman -get BU:IO-benchmark (installs and runs Bonnie++ [much slower])
  • % pacman -get BU:Connectathon (installs and runs Connectathon)
  • % pacman -get JAB:FDT (FDT for fast file transfers)

Need examples of a "standard" dd test that each site should run on each storage element they have.

-- RobertGardner - 05 Dec 2007

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png uct2-dc1_gridftp_stresstest.png (37.1K) | JosephUrbanski, 10 Dec 2007 - 14:07 |
png IN-UC_ATLAS_VLAN_Peering.png (48.8K) | JosephUrbanski, 10 Dec 2007 - 13:26 |
png srm_test02-41_concurrent_conns.png (33.3K) | JosephUrbanski, 10 Dec 2007 - 12:20 |
png srm_test01.png (32.7K) | JosephUrbanski, 10 Dec 2007 - 12:20 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback