3

I have a web server that has 4 CPUs, It has somehow encountered packet loss intermittently. Finally we moved allthe application and data to another system that has 8 CPUs. We did this because we found nothing was wrong except one phenomenon that the average CPU utilization consistently went up to 80%.

During the troubleshooting. I checked the /proc/interrupts file and the interrupts was fixed to CPU 0. The "mpstat -p ALL" was also issued to see the utilization of each CPU. And none of them was 100% at the time we checked them.

Having said that only the average CPU utilization went up to 100%, but any one of them might have a spike to 100% since we didn't use the monitoring system to gather the CPU utilization in a regular basis. Instead, It was just checked using the command. After changing to a new system with more CPUs, the packet loss hasn't happened any more. The following questions occur to me:

  1. If the utilization of one of the CPU in the quad-core system happens to be 100%, and it is being fixed to the interrupt to the NIC interrupt. Will the kernel schedule another CPU that is not very busy to handle the NIC interrupt in its place?
  2. The packet loss got resolved after adding more CPU for the system, is this because the more CPUs the system have, the smaller probability the utilization of the CPU occupied by the NIC interrupt goes up to 100%?

  3. Does adding more CPUs to the system result in less context switch and therefore less system overhead?

Sven
  • 97,248
  • 13
  • 177
  • 225
Jepsenwan
  • 160
  • 3
  • 11

1 Answers1

4

If the utilization of one of the CPU in the quad-core system happens to be 100%, and it is being fixed to the interrupt to the NIC interrupt. Will the kernel schedule another CPU that is not very busy to handle the NIC interrupt in its place?

Typically, no. The interrupt gets priority anyway, so there's no need to move the NIC interrupt.

The packet loss got resolved after adding more CPU for the system, is this because the more CPUs the system have, the smaller probability the utilization of the CPU occupied by the NIC interrupt goes up to 100%?

No. Why would that matter? The interrupt, as its name implies, interrupts the CPU and makes it service the interrupt.

Does adding more CPUs to the system result in less context switch and therefore less system overhead?

It could, but that's unlikely to make any difference. More CPUs will only reduce unforced context switches (ones that the system decides to take even though it doesn't have to) and nobody designs a system so badly that unforced context switches have a significant impact on performance.

Speculating just from what you've said, I'd suspect that under some conditions where the system was under high load, packets were lost because the network card wasn't services quickly enough. Likely this is not due to the interrupt not being serviced fast enough, but the other work associated with network traffic not getting completely quickly enough to keep up with the packet rate. This includes, for example, all the operations required by the TCP protocol. If this backs up, packets will be dropped somewhere.

David Schwartz
  • 31,215
  • 2
  • 53
  • 82
  • Thanks David. In my case, even the ping test to the web server would have pakcet loss. ICMP echo on the machine wasn't associated with any other work your answer mentioned. – Jepsenwan Jun 24 '17 at 09:38
  • @David If multiple IRQ handlers run on core 0 and core 0 is pegged at 100% it'd make sense to move handlers to other cores. – XTF Jun 24 '17 at 09:52
  • @Jepsenwan It doesn't matter. If the system is overwhelmed handling TCP packets, then it may wind up dropping packets before it has a chance to even tell what type of packets they are. Think of it like a line of customers at a grocery store checkout. If it's taking too long to handle all the customers who need to get cigarettes from a case, then even some customers who don't want cigarettes will leave because the line is too long. – David Schwartz Jun 24 '17 at 10:50
  • @DavidSchwartz Thanks. Are there specific steps of using whatever benchmark tools to reveal? – Jepsenwan Jun 24 '17 at 10:58