2

Centos 6.9 (64gb Ram)

Running nginx, mariadb, php-fpm, iptables, java

The server is having random but frequent bursts of 100% system cpu load on only 1 core, crippling network connections to the server.

I found out that even with nginx, mariadb, php-fpm, iptables and java not running the problem persists.

I tried installing irqbalance but nothing changed. I tried restarting several times but nothing changed. I tried yum update but nothing changed. I tried swapping the ssd to another server with the same hardware but nothing changed. I tried SMART checking the ssd for problems with no errors. I checked if the problem was related to swappiness but nothing is being swapped.

The "/proc/interrupts" shows that the interrupt related to the ksoftirqd is eth0 I don't know which steps to make for troubleshooting what's causing the problem. I need help as my services hosted on this server are hurting really bad because of the downtime generated during the bursts (which can last for 10-15 minutes, stop and then reappear randomly).

top or htop does not show anything worrying running or taking that much cpu, just ksoftirqd and events.

The problem started just a few days ago, no changes were made to the kernel/OS that I am aware of that could have caused this problem.

"iostat" during the 100% load

Linux 2.6.32-696.30.1.el6.x86_64 (CentOS-69-64-minimal) _x86_64_ (16 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.01    0.00    3.03    0.20    0.00   88.76
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdb              83.52        18.46      1341.05    2874477  208769462
sda              94.26       435.50      1341.05   67797010  208769462
md1               0.00         0.01         0.00       2106         12
md0               0.26         0.25         1.82      38640     283096
md2             176.32       453.67      1322.56   70625762  205890864

"/proc/interrupts" during the 100% load

            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       CPU8       CPU9       CPU10      CPU11      CPU12      CPU13      CPU14      CPU15
   0:        681          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      timer
   1:          2          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
   8:          1          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      rtc0
   9:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          4          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  56:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      aerdrv
  57:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      aerdrv
  58:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      aerdrv
  65:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  66:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  67:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      xhci_hcd
  68:   16149263          0          0          0          0          0          0          0          0          0          0   19021454          0          0          0          0   PCI-MSI-edge      ahci
  69:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   PCI-MSI-edge      ahci
  70:  158827141          0          0          0   82558205          0          0          0          0          0    2755343          0          0          0          0          0   PCI-MSI-edge      eth0
 NMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Non-maskable interrupts
 LOC:  123773684  105894389  123476055  142376826  111487788  122494116  118841739  134480148  113422196  121203288  114414525  114218214  114794017  119322938  115083581  119549111   Local timer interrupts
 SPU:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Spurious interrupts
 PMI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Performance monitoring interrupts
 IWI:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   IRQ work interrupts
 RES:   54086898   67527262   46597734   44323475   25356657   32869325   18540932   20137227   13606660   13955101   14826738   12242106   10962617   11082631   10466998   10574150   Rescheduling interrupts
 CAL:       1258       1407       1440       1446       1474       1442       1448       1436       1436       1435       1435       1431       1438       1449       1449       1430   Function call interrupts
 TLB:    8082115    6419817    4992332    3914962    5927373    4081295    4056598    2953591    4134873    3207107    3852793    5106863    3780341    3298234    3875200    3270066   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:        520        520        520        520        520        520        520        520        520        520        520        520        520        520        520        520   Machine check polls
 ERR:          0
 MIS:          0

Something strange I've seen on dmesg, which does not print anything problematic but this line, repeated 50 times since boot (replaced my ip with X for privacy reasons):

TCP: Peer X.XX.XXX.XXX:56847/44567 unexpectedly shrunk window 2670303830:2670305282 (repaired)

htop

https://i.imgur.com/2vlcsN8.png

Any kind of help is appreciated, I'm really desperate to solve this right now.

SensitiveGuy
  • 21
  • 1
  • 6

1 Answers1

-3

this ksoftirqd not a server fault, the mail issue is with kernal version please check the kernal version Linux localhost 2.6.32-Linux localhost 2.6.32-573.6.3.el6.x86_64_64 using those no issue if you upgrade the kernal that goes to 754 version that time some perl and asterisk moduls are crashing that why cpu utilization going high all centos 6.10 servers you can use 600 below kernal versions that is best thank you.