10GBASE-T Performance in Windows 7

Question

This is my scenario:

-Workstation1: CPU i7-3770 / 16GB ram / Gigabyte Z77-D3H motherboard / Crucial CT256MX100SSD1 Sysyem Disk / Network adapter intel X540-T1 / Windows 7 64bit -Workstation2: CPU i7-950 / 12GB ram / ASUS P6X58D-E motherboard / Crucial CT256MX100SSD1 Sysyem Disk / Network adapter intel X540-T1 / Windows 7 64bit
-Switcher HP procurve 2920 with two double 10 Gigabit ethernet expansion cards.

Both of the utp cables are Cat6 less than 15 metres and they are attached directly from each workstation to the hp switcher. Both of the network adapters appears to be linked at 10Gbps.

I´m testing network performance with iperf:

-workstation1: iperf -s
-workstation2: iperf -c <workstation1 ip>

I´m getting about 1Gbit per seccond instead of 10Gbit per second. So is there any step that i´m doing wrong? Any info about windows 7 network limitations? Thanks.

UPDATED - NTttcp tests

C:\NTttcp-v5.28\x64>NTttcp.exe -s -m 8,*,192.168.1.20 -l 128k -a 2 -t 15

Copyright Version 5.28
Network activity progressing...

Thread  Time(s) Throughput(KB/s) Avg B / Compl    
======  ======= ================ =============    

0       15.001        38661.956    131072.000
1       14.999        38257.484    131072.000   
2       14.998        53989.065    131072.000   
3       14.998        38336.845    131072.000   
4       14.999        38086.806    131072.000   
5       15.000        37563.733    131072.000   
6       14.997        56408.082    131072.000   
7       15.000        52292.267    131072.000   


#####  Totals:  #####


Bytes(MEG)    realtime(s) Avg Frame Size Throughput(MB/s)    
===========   =========== ============== ================  
5179.250000    15.000       1459.696       345.283   


>Throughput(Buffers/s) Cycles/Byte       Buffers    
===================== ==============   ==========    
      2762.267           6.912          41434.000    


>DPCs(count/s) Pkts(num/DPC)   Intr(count/s) Pkts(num/intr)   
============= ============= =============== ==============   
    13668.933         1.633       22030.933          1.013   



>Packets Sent Packets Received Retransmits Errors Avg. CPU %    
============= ================ =========== ====== ==========    
     3720525       334723        4364         0     10.179

Try to verify the results with a different tool like nttcp - https://gallery.technet.microsoft.com/NTttcp-Version-528-Now-f8b12769 — pauska, Jun 14 '15 at 21:35
Thanks. I tried with this tool as it is in the link you sent me and i´m getting 345MBps, that is the third of the nominal network performance. — Abraham, Jun 14 '15 at 21:48
Try to crossover connect the workstations and run the test again - it will show you if the switch is the limiting factor. — pauska, Jun 14 '15 at 21:59
What do the host and switch interfaces say about errors, drops, late collisions, giants, runts, buffer overruns/underruns, failures, lost carrier, etc.? — Ron Maupin, Oct 28 '19 at 01:05

shodanshok · Answer 1 · 2015-06-15T05:25:25.140

1

Try the suggestions shown here

You need RSS (Receive Side Scaling), LSO (Large Send/Segment Offloading), TCP window scaling (auto tuning) and TCP Chimney (for Windows), optionally RSC (Receive Side Coalescing), are setup and configured properly.

Even modern processors cannot handle 10Gb worth of reads on a single processor core, thus RSS needs setup with a minimum of 4 physical processor cores (RSS doesn't work on Hyperthreaded logical cores), possibly 8, depending on processor, to distribute receive load across multiple processors. You can do this via PowerShell (Windows) with the Set-NetAdapterRss cmdlet.

example command for a 4 physical core proc w/ Hyerpthreading (0,2,4,6 are physical, 1,3,5,7 are logical....pretty much a rule of thumb) Set-NetAdapterRss -Name "" -NumberOfReceiveQueues 4 -BaseProcessorNumber 0 -MaxProcessorNumber 6 -MaxProcessors 4 -Enabled

LSO is set in the NIC drivers and/or PowerShell. This allows Windows/Linux/whatever to create a large packet (say 64KB-1MB) and let the NIC hardware handle segmenting the data to the MSS value. This lowers processor usage on the host and makes the transfer faster since segmenting is faster in hardware and the OS has to do less work.

RSC is set in Windows or Linux and on the NIC. This does the opposite of LSO. Small chunks are received by the NIC and made into one large packet that is sent to the OS. Lowers processor overhead on the receive side.

While TCP Chimney gets a bad rap in the 1Gb world, it shines in the 10Gb world. Set it to Automatic in Windows 8+/2012+ and it will only enable on 10Gb networks under certain circumstances.

TCP window scaling (auto-tuning in the Windows world) is an absolute must. Without it the TCP windows will never grow large enough to sustain high throughput on a 10Gb connection.

Enable 9K jumbo frames (some people say no, some say yes...really depends on hardware, so test both ways).

On my hardware, enabling jumbo frames was the critical thing. Pay also special attention to IRQ coalesce setting.

edited Jun 15 '15 at 05:25

answered Jun 14 '15 at 22:03

shodanshok

44,038
6
98
162

4

Can you please elaborate the "here" within your answer ? Links can die so providing a single link as an answer is not improved – krisFR Jun 14 '15 at 22:14
I've edited my answer – shodanshok Jun 15 '15 at 05:26
HI. It´s very interesting, but does it make such a difference from standard config that it is the third of the maximun rate? from 10Gbps to 3Gbps – Abraham Jun 16 '15 at 21:05
Hi, jumbo frames can really made a big difference, and the same can be told for IRQ coalesce tuning. Simply try it ;) – shodanshok Jun 17 '15 at 06:47
@shodanshok Is it possible to do this on windows 7 (particularly rss) ? – bumble_bee_tuna Sep 23 '19 at 17:28
1

@bumble_bee_tuna it depends on the network driver in use by Windows. Some "basic" network drivers do not provide many options, while others expose much more tunables. – shodanshok Sep 23 '19 at 20:58
@shodanshok, "_...jumbo frames can really made a big difference..._" Not really. That has been way overplayed by marketers. I have seen the real math put to running full ethernet frames (1500 bytes) vs. jumbo frames (9000+ bytes) and you can possibly save five times the ethernet, IP and TCP headers (`5 * 58 bytes) per a 9000 byte frame. that is really a pretty small percentage. Also, when a frame, packet ,or segment is lost (happens all the time), there is just that much more to resend. The manufacturer numbers are 4.4%, but I think that is too much, too, for a problematic non-standard. – Ron Maupin Oct 28 '19 at 01:25
@RonMaupin in my case, the critical advantage provided by jumbo frames was the much decreased interrupt rate. While a similar decrease can be obtained with IRQ coalescing, not all drivers have good support for this feature. Enabling jumbo frames is a simple method to lower headers *and* IRQ overheads. – shodanshok Oct 28 '19 at 06:45

10GBASE-T Performance in Windows 7

1 Answers1