7

I encounter a strange performance problem on 2008 R2 Enterprise SP1.

Here is the setup:

  • Many processes listening to distinct Multicast UDP streams (5 multicasts listen by process) bound on a single NIC
  • Across processes,all multicasts using the same port range but different multicast IPs (important detail,since each multicast receiver for a given port will be server of REUSED server socket)
  • Each process multicast listened bandwith is 10Mbits
  • RSS set on NIC , maximum offload settings set on NIC & OS , MSI activated

Behaviour:

  • Under 17 listening processes (about 85 joined UDP Multicasts), Kernel CPU impact is neglectible.
  • Between 17 & 22 listeners (about 110 joined UDP multicasts) , Kernel CPU usage begins to grow slowly but is acceptable
  • Above 25, each joined multicast begins to have huge impact in Kernel CPU time , this impacts all RSS bound CPUs
  • CPU time used per listening process is near 0 (normal since processes do nothing but reading the multicast) , so the real problem lies in the OS component

What we found:

  • Changing NIC hardware has no impact on behaviour (Tested on HP NC382i , Broadcom based NIC & HP NC365T , Quad Gigabit , Intel Based)
  • Global receive bandwith is not the limiting factor (Single 500Mbits stream does not trigger CPU load)
  • Reading on multicast socket seems not to be the limiting factor (we performed the test with just dumb JOIN only processes on the multicast streams and reproduced CPU load problem)
  • Splitting Multicast traffic on two NICs seems to limit CPU load & spread better. However this is not a use case for us.

Problem:

  • We need at least to be able to listen to about 500 multicast streams and maybe up to 750
  • Same hardware, running XP OS does not have this behavior in CPU Kernel time

Supected Component:

  • NDIS.sys seems to be a good candidate for explaining the CPU usage increase.

Have any of you encountered such problems and could give some direction to investigate. I've read all i could about win server 2008 network perf enhancement, but all seem to be linked to TCP traffic. I've also tested all possible optimizations that could be done via registry or netsh command.

dweeves
  • 121
  • 5
  • Edit: Using different ports to listen to multicast streams lowers the CPU load but not. So part of the problem seem to be bound to server socket reuse.However, even with different ports, the cost seems a bit expensive for just consuming 250 multicast streams – dweeves Aug 22 '11 at 13:01
  • You must use separate ports for each stream otherwise WinSock is going multiplex every packet to every subscriber socket and then filter it out. This is different to Unix & Linux which perform no additional filtering and each socket will see all the packets. – Steve-o Aug 24 '11 at 09:18
  • Also don't forget [Multicast MAC Address Mapping](http://www.cisco.com/en/US/tech/tk828/technologies_white_paper09186a00802d4643.shtml) limitations, 238.1.1.1 and 239.1.1.1 are effectively the same. – Steve-o Aug 24 '11 at 09:36
  • @dweeves How are the server and the clients connect? Are they connect using a layer 2 switch? That switch support IGMP snooping? – Ricardo Polo Jaramillo Aug 27 '11 at 20:50
  • @dweeves, i can confirm that having all unique destination ports will solve the problem. Not that this is a which-full situation, hope someone else has a better solution? I have tested with >100 udp streams, having tested with unique source ports, and unique multicast addresses as well. The only thing that seems to work is unique endports. – GizMoCuz Feb 10 '14 at 13:51
  • @user208780 This is a very old question, you're unlikely to get a response from anyone here. Also, Server 2008 is 4 versions and 6 years old. You should consider using 2008 R2, 2012, or 2012 R2. – Chris S Feb 10 '14 at 14:24

1 Answers1

2

That's a lot of multicast streams, typically NICs have a low limit for hardware filtering and when you exceed that they either drop everything (poor implementation on cheap NICs), or forward everything to the operating system for it to filter instead. When the operating system is performing the filtering your processor usage is going to sky rocket.

Aside of investigating different hardware, which you list some, you could extend to 10GigE based too, the only option is to use proxy servers.

By experimentation find a number of multicast streams which can be managed reliably, then forward the streams on via TCP to a central server or set of servers. That central server can then use TCP segmentation acceleration or full ToE to render the incoming network load insignificant to the processor.

I cannot get decent multicast rates with Broadcom hardware at all due to very poor Windows drivers. It would be interesting to see how Linux performs on the same hardware, that should give you a good indication of the hardware and IP stack quality.

You list Windows XP as working fine, the major difference between Windows Server and Windows XP is the quantum time. Windows Server gives longer quantum times, it might be worth investigating forcing a shorter quantum (if you can even set it).

Steve-o
  • 829
  • 6
  • 12
  • @dweeves, so wait what is the distribution of ports and addresses in the question? Addresses are filtered by the NIC, ports are by the OS. – Steve-o Aug 24 '11 at 09:14
  • Latest investigation i made manage to achieve 250 streams at 0% CPU but using single Multicast Adress & 250 different ports. So it seems that too many multicast adresses cause problems to the OS (probably due to some filtering limitations). There is no way using several server & proxies, the core interest of the dev i'm doing is to achieve a mass ne – dweeves Aug 24 '11 at 09:16