MinutesTPDec10
Introduction
Minutes of the Facilities Integration Program meeting focusing on
Throughput, December 10, 2007.
- Objectives: 200 MB/s throughput sustained disk-to-disk Tier1-Tier2
- Other meetings and background : IntegrationProgram
- Coordinates: Mondays, 2:00pm Eastern
- Phone: (605) 475-6000, Access code: 735188; Dial 6 to mute/un-mute.
Attending
- Meeting attendees: Saul, Dantong, (Jay), Shawn, Rob, Tom, Joe, John, Horst, Karthik, Bruce, Nam, Ning, Anabelle, Wei (please correct!)
Overall background and plan (Shawn)
- Have been able to fully utilize Gbps networks for mem-to-mem, but we're trying to get to easily doing 100's MB/s for disk-to-disk.
- 200 MB/s is the initial goal, T1 to T2. Eventually ~400 MB/s for 10G sites.
- What are the problems?
- We need to come up with some debugging steps.
- Issue of limited number of exposed doors at BNL - only 8. Not equiv to what Fermilab has with many public doors and pools. May be longer term implications.
- Want to see what is possible w/ single, high performance servers. There is some work going on at UMich for local disk configurations - RAID options, etc.
- Dantong:
- What are the immediate steps to get to 200 MB/s? Shawn: tune up one of the doors w/ one of the doors and get that working to 120 MB/s. Bechmark the disk system at each end.
- Bruce: how do "Tier3"'s fit? They can participate in this just like Tier2s.
- Shawn would like us to document each site, and benchmark each filesystem/disk systems. List storage endpoints and other information.
- All agree this can be provided.
- Wenjing - will post a page w/ instructions for each site. Starting w/ a
dd timed from /dev0 to a file on disk.
- Saul has a bonnie++ pacman package. Will post instructions here.
- There was a question about the rate of DQ2 managed transfers.
- Another issue is the endpoint filesystem - dCache, Lustre, GPFS
Tier1 testing datasets and endpoints (Hiro)
- Dantong will ask Hiro to publish this information. Locations in pnfs.
- Dedicated Thumper nodes w/ datasets will be used for the testing. Ready to go.
- Is there any load balancing on the doors? SRM should not be used. DQ2 tools use glite-copy, Dantong will follow-up w/ Tadashi.
- Need names of doors.
dc01, etc. Questions about whether these are tuned optimally for the network.
- Door nodes have low memory - only 4 GB on each door. They should be 16 GB.
Load testing tools (Jay)
- Runs from one machine, but makes 3rd party transfers.
- Would like to use the Monalisa control framework to schedule the tests, and to allow people to schedule these themselves.
Wisconsin
- Going to change gridftp server to have a 1 GB server to the campus backbone (10G).
- xrootd backend - several servers.
- No firewalls.
AGLT2 status (Shawn)
Interested in Lustre + BestMan, and comparing to SRM-dCache; also FDT.
For Lustre+StoRM see info from
LustreStoRM, an ATLAS Tier-1 setup in Spain.
BeStMan? info is at:
BeStMan web page.
lustre information:
http://www.clusterfs.com/
http://wiki.lustre.org/index.php?title=Main_Page
http://www.sun.com/software/products/lustre/
future lustre testbed : 1 MDT server, 4 data server, 4 clients
See slides at tier2 meeting. 10G connections to the backbone, and on to Chicago.
Write at 600-700 MB/s, read at > 1 GB/s
hardware:
Storage Enclosures (4 X15 disks) : MD1000
raid controller : ERC 5/E Adapter
server : PowerEdge? 2950
raid configurations: | 4 r5 | 4 r5 device,each r5 has 15 disks, 2 r5 device share one controller |
| 2R50 | 2 r50 device,each r50 has 30 disks on separate controller |
| sr02r50 | soft r0 over 2 hard r50,each r50 has 30 disks on separate controller |
| 2r5 | 2 r5 device,each r5 has 30 disks on separate controller |
| sr02r5 | soft r0 over 2 hard r5,each r5 has 30 disks on separate controller |
test tool: iozone-3.279-1.el4.rf.x86_64
test mode: Run Iozone in a throughput mode. This option allows the user to specify how many threads or processes to have active during the measurement. aslo use -P to bind threads to the processors.. we tested 1 to 12 threads running mode..
the number (1..12)stands for number of parallel threads running .. the unit is
MB/s
total write for all threads
| write | | | | | | | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 4r5 | 370 | 351 | 682 | 641 | 621 | 630 | 610 | 595 | 593 | 599 | 599 | 581 |
| 2r5 | 382 | 695 | 707 | 657 | 649 | 623 | 618 | 605 | 603 | 588 | 587 | 583 |
| sr02r5 | 664 | 718 | 657 | 633 | 606 | 599 | 626 | 613 | 597 | 587 | 575 | 553 |
| 2r50 | 366 | 715 | 688 | 653 | 646 | 623 | 619 | 604 | 603 | 592 | 588 | 584 |
| sr02r50 | 692 | 711 | 651 | 636 | 615 | 604 | 594 | 586 | 572 | 574 | 569 | 563 |
total read for all threads
| read | | | | | | | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 4r5 | 698 | 815 | 1530 | 1665 | 1650 | 1533 | 1396 | 1462 | 1362 | 1569 | 1324 | 1243 |
| 2r5 | 755 | 1552 | 1515 | 1026 | 965 | 827 | 840 | 830 | 824 | 804 | 702 | 783 |
| sr02r5 | 1212 | 1440 | 1381 | 879 | 765 | 672 | 691 | 904 | 795 | 739 | 732 | 498 |
| 2r50 | 757 | 1558 | 1456 | 936 | 899 | 851 | 935 | 887 | 879 | 838 | 710 | 763 |
| sr02r50 | 1222 | 1389 | 579 | 1547 | 796 | 766 | 744 | 683 | 505 | 692 | 694 | 598 |
average write for each thread
| write | | | | | | | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 4r5 | 370 | 175 | 227 | 160 | 124 | 105 | 87 | 74 | 65 | 59 | 54 | 48 |
| 2r5 | 382 | 347 | 235 | 164 | 129 | 103 | 88 | 75 | 67 | 58 | 53 | 48 |
| sr02r5 | 664 | 359 | 219 | 158 | 121 | 99 | 89 | 76 | 66 | 58 | 52 | 46 |
| 2r50 | 366 | 357 | 229 | 163 | 129 | 103 | 88 | 75 | 67 | 59 | 53 | 48 |
| sr02r50 | 692 | 355 | 217 | 159 | 123 | 100 | 84 | 73 | 63 | 57 | 51 | 46 |
average read for each thread
| read | | | | | | | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| 4r5 | 698 | 407 | 510 | 416 | 330 | 255 | 199 | 182 | 151 | 156 | 120 | 103 |
| 2r5 | 755 | 776 | 505 | 256 | 193 | 137 | 120 | 103 | 91 | 80 | 63 | 65 |
| sr02r5 | 1212 | 720 | 460 | 219 | 153 | 112 | 98 | 113 | 88 | 73 | 66 | 41 |
| 2r50 | 757 | 779 | 485 | 234 | 179 | 141 | 133 | 110 | 97 | 83 | 64 | 63 |
| sr02r50 | 1222 | 694 | 193 | 386 | 159 | 127 | 106 | 85 | 56 | 69 | 63 | 49 |
ATLT2 storage system
1) Node: umfs05.aglt2.org
- a. Config: dual dual-core (Intel X5355), 16GB ram, single Perc5/e, two MD1000 shelves, 15@750GB SATA-II drives / MD1000, RAID50 config via Perc5, 10GE Myricom NIC on PCI-e x8 slot.
- b. Storage location gsiftp://umfs05.aglt2.org/atlas/data16
- c. dd test results: write=289MB/s read=217MB/s
- d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
1 2 3 4
Initial write 520 362 471 400
Rewrite 654 662 654 630
Read 828 814 962 832
Re-read 828 810 788 660
2) Node: umfs02.aglt2.org
- a. Config: Dual Dual Core AMD 280, 16GB ram, single ARC-1170, 24@750GB SATA-II disks, RAID5 config ARC-1170, 1GE Broadcom BCM5704 on PCI-e x8 slot.
- b. Storage location gsiftp://umfs02.aglt2.org/atlas/data08
- c. dd test results: write=186MB/s read=301MB/s
- d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
1 2 3 4 5 6 7 8
Initial write 129 129 118 115 101 124 107 101
Rewrite 200 200 196 195 174 201 171 173
Read 250 252 251 252 237 250 235 234
Re-read 254 258 244 253 254 250 224 239
3) Node: dq2.aglt2.org
- a. Config: dual quad-core (Intel X5355), 16GB ram, single ARC-1260, 16@750GB SATA-II disks, RAID6 config via ARC-1260, 10GE Myricom NIC on PCI-e x8 slot.
- b. Storage location gsiftp://dq2.aglt2.org/atlas/data15
- c. ddtest results: write=266MB/s read=330MB/s
- d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
1 2 3 4 5 6 7 8
Initial write 320 309 213 309 293 308 323 302
Rewrite 383 292 323 367 363 351 354 372
Read 440 408 429 432 380 407 369 391
Re-read 436 425 412 416 401 408 427 405
4) Node: umfs07.aglt2.org
- a. Config: dual quad-core (Intel E5335), 16GB ram, tow Perc5/e,four MD1000 shelves, 15@750GB SATA-II drives / MD1000, RAID50 config via Perc5, 10GE Myricom NIC on PCI-e x8 slot.
- b. Storage location gsiftp://umfs07.aglt2.org/atlas/dcache gsiftp://umfs07.aglt2.org/atlas/dcache1
- c. dd test results: write=330MB/s read=265MB/s
- d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
1 2 3 4 5 6 7 8
Initial write 495 259 362 204 154 151 219 407
Rewrite 661 666 657 653 639 632 616 645
Read 1105 847 1005 1001 982 829 893 915
Re-read 823 992 1115 882 1043 997 1007 1015
5) Node: c-4-15.aglt2.org (typical dcache node)
- a. Config: dual quad-core ( Intel X5355), 16GB ram, 2@750GB SATA-II disks, 10GE Myricom NIC on PCI-e x8 slot.
- b. Storage location gsiftp://c-4-15.aglt2.org/atlas/dcache gsiftp://c-4-15.aglt2.org/atlas/dcache1
- c. ddtest results: write=73MB/s read=75MB/s
- d. iozone test results (MB/s)
(x axis stands for the number of threads, the number in the table is the total performance of mutiple threads)
1 2 3 4 5 6 7 8
Initial write 73 124 116 106 106 103 103 100
Rewrite 51 96 91 90 90 89 86 89
Read 73 147 95 40 44 41 44 42
Re-read 74 146 95 40 43 42 44 42
MWT2 status (Joe)
To date, most testing and tuning has taken place at UC. At UC, we currently have 1 SRM door and 4 GridFTP doors at our site frontending 37 pools of between 1.4TB - 3.4 TB. These pools run on a private IP network connected to the doors through a Cisco 6509 router with 1Gb NICs. All doors at UC have 1Gb connections to both the public and private networks, with the exception of uct2-dc1, a GridFTP door, which has 10Gb NIC on both networks. At IU, we have 43 pools of the same sizes, thought they all have public IP's, connected with 1Gb NICs to a Force10 router. Test results were obtained below by using IU compute nodes as clients, connecting to UC's dCache. Iperf tests have shown 4.2Gbps from IU to UC.

Door nodes hardware specs:
- 1 Dual core Opteron 275
- 4GB RAM
- 1GB NICs (except for uct2-dc1, which has a 10Gbps connection)
- At UC, we can get up to ~3.6 Gbps (~450MBps) into our dCache at around 25 concurrent SRM connections. After that, we hit a CPU bottleneck on the SRM door (uct2-dc3), where performance degrades to around ~3.4Gbps (~425MBps). Also, with 10Gb NICs on either side of uct2-dc1 (primary GridFTP door), we get a maximum of 1.8Gbps on that gridftp door.
- UC GridFTP 10G Stress Test: With 10Gb NIC on either side of uct2-dc1, we can get a maximum of 1.8Gbps into our dCache on this door. Not sure what the limiting factor here is though, but certainly not CPU related. Some further tuning of the cards and the network stack probably needed here.
- UC SRM Stress Test: The sweet spot for greatest throughput is around 25 concurrent connections, after this, performance degrades a bit due to the CPU on the door (uct2-dc3) running at 100%. The largest number of concurrent connections I threw at it here was 41.
- AGLT2 to UC?
- Joe and Wenjeng to work together to structure the tests.
SWT2_OU status (Horst)
- Two Dell 2950 head nodes (Dual Quad Core Xeon 5345, 2.33 GHz,16 GB RAM) w/ gigabit links that go through to DDN (S2A? 3000) Ibrix (3.0), 48 500 GB SATA disk RAID5 array; 16 TB useable, 8 hot spares
- Max performance was 45 MB/s with this system. Question is how to improve this?
- IOZone has the ability to specify threads.
- 100+ MB/s on LAN, disk-to-disk (OUHEP raid to local SAS disk on the headnodes); so gridftp not the bottleneck, getting almost line speed.
SWT2_UTA status (Patrick)
NET2 status (John/Saul)
- Have dedicated 10G network to our cisco 6509 router.
- Bottleneck is disk I/O.
- 60 TB GPFS system. 500-600 MB/s writing, 800 MB/s reading using 9 nodes.
- One gridftp door. Xeon 2.8 GHz, 4 GB ram.
- Interesting thing will be peformance from the gridftp node.
WT2 status (Wei)
- Running 1 Gbps external link. January upgrade to 10 Gbps.
- 3 Thumpers, each has 4 Gig E nics, channel bonded.
- 1 Gridftp server, Gig E.
- 500 MB/s memory to disk on Thumper, using
dd write.
-
xrdcp on gridftp door, multithreaded, 104 MB/s over a gigabit link.
-
g-u-c on gridftp door 37MB/s reading, 50 MB/s writing. Not sure what the difference is, same path is taken. Note g-u-c is built on top of the Posix xrootd library. Not sure what the problem is. Additional note: talked to Andy and get undocumented Xrootd environment variables. Now 50MB/s reading as well. It seems at this point, the gridftp server saturated the CPU (2x AMD Opteron 244, 1.8Ghz, 2GB).
- Will use SRM from Bestman. Concerned about SRM functioning w/ FTS. No place to test.
- Shawn notes there is an srm-testers site that can monitor your SRM endpoint. Suggests we look at this.
SRM Testing Link
For sites interested in daily testing of their system see:
LBNL SRM Testing or for results:
LBNL SRM Testing Results.
Info on Site Testing
Wenjing Wu will provide details about testing methodology used at AGLT2. Also Saul pointed out there are some existing PacMan? caches which have testing tools ready for easy installation and use:
- % pacman -get BU:Bonnie (installs and runs Bonnie)
- % pacman -get BU:IO-benchmark (installs and runs Bonnie++ [much slower])
- % pacman -get BU:Connectathon (installs and runs Connectathon)
- % pacman -get JAB:FDT (FDT for fast file transfers)
Need examples of a "standard"
dd test that each site should run on each storage element they have.
--
RobertGardner - 05 Dec 2007