4

I'm using a PC Engines APU device running FreeBSD as a NAT router. The setup is very usual: one WAN connection and one LAN connection.

Theoretically, WAN connection is 800/40 Mbit/s and the LAN one is 1/1 Gbit/s. In practice, the router is connected via gigabit ethernet both to a modem (WAN) and to a Netgear switch (LAN).

If I plug a fast PC straight to the WAN connection (modem), I can reach actual download speeds of about 700 MBit/s. But if the router is inbetween, there is a severe performance hit and download speeds never get above 350 MBit/s.

Which could be easily explained by the router not being powerful enough.

Thing is, I tried to see what was going on, and when trying to max out the connection (actual bandwidth measured being 350 MBit/s), the router's CPUs are both idle about 30% of the time.

I understand this means that the CPU isn't the bottleneck. But then, what is? Is there a way to diagnose what the router is actually doing more accurately and why it's only running at half of the capacity?

In order to make my question clearer, here are some additional details.

First, a visual representation of the issue:

Visual representation

Then and for reference, the output of top -S -C -H -P -s1 -ocpu

When there is very little traffic on the router:

last pid: 14077;  load averages:  0.00,  0.00,  0.00    up 0+18:13:58  12:02:53
118 processes: 3 running, 98 sleeping, 17 waiting
CPU 0:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
CPU 1:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
Mem: 16M Active, 89M Inact, 130M Wired, 497M Buf, 3678M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root     155 ki31     0K    32K CPU1    1  18.0H 100.00% idle{idle: cpu1}
   11 root     155 ki31     0K    32K RUN     0  18.0H 100.00% idle{idle: cpu0}
14077 root      20    0 21996K  3120K CPU0    0   0:00   0.10% top
   12 root     -92    -     0K   272K WAIT    1   5:22   0.00% intr{irq259: re0
   12 root     -92    -     0K   272K WAIT    0   4:21   0.00% intr{irq260: re1
    9 root     -16 ki-1     0K    16K pollid  0   1:51   0.00% idlepoll
   12 root     -60    -     0K   272K WAIT    0   1:40   0.00% intr{swi4: clock
    0 root     -16    0     0K   160K swapin  1   0:37   0.00% kernel{swapper}
    5 root     -16    -     0K    16K pftm    0   0:31   0.00% pf purge
24147 root      20    0 12464K  2176K select  0   0:25   0.00% apinger
11846 root      52   20 17144K  2692K wait    1   0:12   0.00% sh
52774 root      20    0 28172K 18060K select  1   0:10   0.00% ntpd{ntpd}
   15 root     -16    -     0K    16K -       0   0:09   0.00% rand_harvestq
87531 dhcpd     20    0 24820K 13576K select  1   0:08   0.00% dhcpd
44974 unbound   20    0 47020K 19840K kqread  0   0:08   0.00% unbound{unbound}
   20 root      16    -     0K    16K syncer  0   0:05   0.00% syncer

And when I try to max out the WAN connection (and only got 318 MBit/s in that case):

last pid: 41402;  load averages:  0.02,  0.01,  0.00    up 0+18:15:40  12:04:35
118 processes: 4 running, 98 sleeping, 16 waiting
CPU 0:  0.0% user,  0.0% nice,  0.7% system, 34.3% interrupt, 64.9% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system, 68.7% interrupt, 31.3% idle
Mem: 16M Active, 89M Inact, 130M Wired, 497M Buf, 3678M Free
Swap: 8192M Total, 8192M Free

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root     155 ki31     0K    32K CPU0    0  18.0H  82.86% idle{idle: cpu0}
   11 root     155 ki31     0K    32K RUN     1  18.1H  69.87% idle{idle: cpu1}
   12 root     -92    -     0K   272K WAIT    1   5:27  32.86% intr{irq259: re0
   12 root     -92    -     0K   272K CPU0    0   4:23  17.19% intr{irq260: re1
14077 root      20    0 21996K  3232K CPU0    0   0:01   0.10% top
    9 root     -16 ki-1     0K    16K pollid  0   1:51   0.00% idlepoll
   12 root     -60    -     0K   272K WAIT    0   1:40   0.00% intr{swi4: clock
    0 root     -16    0     0K   160K swapin  0   0:37   0.00% kernel{swapper}
    5 root     -16    -     0K    16K pftm    1   0:31   0.00% pf purge
24147 root      20    0 12464K  2176K select  0   0:25   0.00% apinger
11846 root      52   20 17144K  2692K wait    0   0:12   0.00% sh
52774 root      20    0 28172K 18060K select  1   0:10   0.00% ntpd{ntpd}
   15 root     -16    -     0K    16K -       0   0:09   0.00% rand_harvestq
87531 dhcpd     20    0 24820K 13576K select  1   0:08   0.00% dhcpd
44974 unbound   20    0 47020K 19840K kqread  1   0:08   0.00% unbound{unbound}
   20 root      16    -     0K    16K syncer  0   0:05   0.00% syncer
Ecco
  • 121
  • 2
  • 7
  • What happens when you try and max the router without the macbook doing anything ? It could just be the router settings (being half the speed maybe a duplex setting issue ?) `ethtool eth0` works on linux, not sure about bsd – exussum Feb 04 '15 at 12:02
  • to test directly from the router `wget cachefly.cachefly.net/100mb.test ` should work as well as speedtest – exussum Feb 04 '15 at 12:08
  • Using wget you have to add "-O /dev/null", otherwise you may also be limited by storage – Ecco Feb 04 '15 at 12:19
  • 1
    Also check if the NAT rules apply only to forwarded packets so that the `wget` test wont get NATed. NAT can cause enough CPU load on embedded devices to be a bottleneck on its own. If wget maxes out your gateway then probably your bottleneck is your NAT. I haven't used freeBSD so I don't have any reference to it, but on Linux on a 680MHz MIPS cpu using NAT will get the CPU maxed out at 120-150Mbit. While without NAT (and fast-path) it would reach up to gigabit speeds (depending on the rest of the configuration). It may worth a try :) – Cha0s Feb 04 '15 at 13:19
  • Have you got any end to end traces of the traffic with and without the router? Are those traces simultaneous? In those traces, is the MTU the same on both cases? Regarding TCP traffic, is the TCP Window behaving equally on both? – Pedro Perez Feb 09 '15 at 17:07

6 Answers6

2

I have developed a board using Realtek RTL8211E phy chip and I can assure you, that it is able to operate on gigabit speed :) (actually 10/100/1000). The only problem with this phy chip would be if it had not been connected to the CPU using a gigabit interface (like RGMII for example). I couldn't find the PCB layout of your router on the internet to check it.

However, like I wrote before, it sounds more likely like a duplex mismatch.

Pierre.Vriens
  • 1,159
  • 34
  • 15
  • 19
Mariusz S
  • 21
  • 2
1

It could be something related to the network cards and the path between them and the kernel/cpu (including interrupt processing). You should look at verifying the various "offload" settings (sorry, I'm not familiar enough with FreeBSD to suggest the right tool). Also look for any other network-card driver-specific settings that can be tweaked, and experiment with them .

Craig Miskell
  • 4,086
  • 1
  • 15
  • 16
  • Actually I did try to tweak all the offload mechanisms already (hardware checksum offload, TCP segmentation offload, Large receive offload). Enabling or disabling them didn't noticeably change performances. While valuable, your input quite doesn't answer my question since I would like to *measure* what's going on in order to try to improve things. And I don't know what to measure since the CPU seems to be idling… – Ecco Jan 31 '15 at 20:54
  • Does FreeBSD expose some way of showing interrupt counts? In linux it's /proc/interrupts – Craig Miskell Feb 01 '15 at 04:25
  • 1
    Also, most "offloading" settings don't really make sense when forwarding traffic. AFAIK, all the router sees are IP packets, it doesn't really have to deal with TCP directly. – Ecco Feb 04 '15 at 10:44
1

The cpu is not idling at all, one core 68.7% other 34.3% busy processing interrupts is not idle. The userspace is idle not the kernel.

Not familiar with openbsd, can you set cpu affinity so that one core processes irq259 and the other irq260. Then see how busy each core is.

Dan
  • 641
  • 4
  • 7
  • Well, the CPU sure is doing something, but it's not at 100% load. So I don't get why it's not routing things faster. – Ecco Feb 01 '15 at 18:50
  • I'd set top to refresh several times a second and watch it during a transfer. It's enough to hit now and then 100% on one core and tcp will decrease speed. – Dan Feb 02 '15 at 20:12
  • You want to be careful not to artificially increase the load purely with the increased monitoring metrics in that suggestion though, @Dan – BE77Y Feb 05 '15 at 16:31
1

How about the top's "load average" after some time of doing the speedtest? Does it ever reach 1?

If it is not the CPU, maybe something is wrong with some lower layers? I suggest checking if ethtool or mii-tool show 1000FD in both cases (with and without router-in-the-middle). Maybe your router board forces some link settings and maybe you have a duplex mismatch issue?

Could you run "iperf -s" on your router to check how is the connection between your client and the router?

Regards

Mariusz S
  • 11
  • 1
1

This is rather old topic, but I thought I would contribute anyway. The bottleneck in your case is the CPU. This CPU has 4 cores, you are probably maxing out one of the cores, and openBSD is presumably using single thread for routing.

I have performed throughput testing on APU system on several operating systems. The results are different between BSD and Linux.

BSD-based operating systems (OpenBSD, pfSense, etc) max out at 622Mbit/s on APU, while Linux-based systems (IPFire, DD-wrt, etc) handle 1Gbit with ease.

Here are more detailed information on the benchmark performed: https://teklager.se/en/knowledge-base/apu2c0-ipfire-throughput-test-much-faster-pfsense/

And here's throughput test for BSD: https://teklager.se/en/knowledge-base/apu2c0-pfsense-network-throughput-test/

If you are not committed to OpenBSD, try IPFire. It will give you a full gigabit throughput.

Sniku
  • 121
  • 1
  • 1
  • 6
0

Given that the CPU isn't 100% in use, the question becomes what else in the system is limiting performance.

My bet is that the ethernet chips just don't have the juice. As per the link in the question, your board uses the Realtek RTL8111E chip. I don't know anything in particular about this chip, but I do know that not all ethernet cards/chips are created equal. Some brief googling suggests that Realtek isn't a particularly respected brand.

In my own testing several years ago, I found that Intel "server" PCIE cards could easily run at line rate, even with all offloading features disabled, but intel "client" PCIE cards couldn't. The server card was $120, the client card $30. Go figure.

One thing that might help throughput, but might hurt latency, is to see if interrupt coalescing is enabled (linux term - not sure how to configure on freeBSD).

Dan Pritts
  • 3,181
  • 25
  • 27