I have a web server that has 4 CPUs, It has somehow encountered packet loss intermittently. Finally we moved allthe application and data to another system that has 8 CPUs. We did this because we found nothing was wrong except one phenomenon that the average CPU utilization consistently went up to 80%.
During the troubleshooting. I checked the /proc/interrupts file and the interrupts was fixed to CPU 0. The "mpstat -p ALL" was also issued to see the utilization of each CPU. And none of them was 100% at the time we checked them.
Having said that only the average CPU utilization went up to 100%, but any one of them might have a spike to 100% since we didn't use the monitoring system to gather the CPU utilization in a regular basis. Instead, It was just checked using the command. After changing to a new system with more CPUs, the packet loss hasn't happened any more. The following questions occur to me:
- If the utilization of one of the CPU in the quad-core system happens to be 100%, and it is being fixed to the interrupt to the NIC interrupt. Will the kernel schedule another CPU that is not very busy to handle the NIC interrupt in its place?
The packet loss got resolved after adding more CPU for the system, is this because the more CPUs the system have, the smaller probability the utilization of the CPU occupied by the NIC interrupt goes up to 100%?
Does adding more CPUs to the system result in less context switch and therefore less system overhead?