ESXi 6.0 U1 + Solaris 11.0 VM = host CPU high (slowly climbing) but VM idle

1

I have ESXi 6.0 Update 1 (Build 3029758) installed on brand new hardware (Atom C2750, 16GB RAM).

VM is Solaris 11.0 with 1 vCPU and 4GB RAM. Solaris was installed from sol-11-1111-text-x86.iso without any further configuration done except for installing VMware Tools.

After starting the VM the Host CPU shows idle (12MHz) but it slowly climbs by about 10% per day (240MHz per day) until it reaches around 95% at which time the VM itself becomes unresponsive, however ESXi still says the VM is running fine.

At every stage the VM itself always reports that it is idle. Rebooting the VM returns the Host CPU back to idle (12MHz) and the slow climb begins again.

VM uptime:

# uptime
  00:06am  up 4 days  0:12,  1 user,  load average: 0.00, 0.00, 0.00

Output from esxtop:

 1:10:11pm up 4 days 47 min, 504 worlds, 2 VMs, 2 vCPUs; CPU load average: 0.06, 0.06, 0.06
PCPU USED(%): 5.3 0.5 0.1 0.1 0.1  45 0.1 0.2 AVG: 6.5
PCPU UTIL(%): 5.0 0.5 0.2 0.5 0.0  42 0.1 0.4 AVG: 6.2

      ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
   36739    27266 vmx                 1    0.03    0.02    0.01   99.98       -    0.00    0.00    0.00    0.00    0.00    0.00
   36741    27266 vmast.36740         1    0.14    0.13    0.00   99.87       -    0.01    0.00    0.00    0.00    0.00    0.00
   36742    27266 vmx-vthread-5       1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36743    27266 vmx-vthread-6:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36744    27266 vmx-vthread-7:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36745    27266 vmx-vthread-8:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36746    27266 vmx-mks:nas         1    0.01    0.01    0.00   99.99       -    0.00    0.00    0.00    0.00    0.00    0.00
   36747    27266 vmx-svga:nas        1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36748    27266 vmx-vcpu-0:nas      1   45.15   42.35    0.00   57.66    0.02    0.00   57.63    0.03    0.00    0.00    0.00

In case anyone is wondering, I am using Solaris 11.0 because the intention for this VM is to use ZFS with physical RDMs to create a NAS. Unfortunately, Solaris 11.1 (and above) + physical RDM = ESXi purple screen (see Bug report: Solaris 11.1 + RDM = ESXi 5.1 purple screen). This problem was supposed to be fixed in ESXi 5.5 but it still exists in ESXi 6.0U1. I have tested (ESXi 6.0U1 + Solaris 11.1 + physical RDM) and (ESXi 6.0U1 + Solaris 11.3 + physical RDM). Both combinations result in purple screen.

Interestingly though, Solaris 11.1 (and above) without physical RDMs does NOT suffer from the slow Host CPU usage climb described above. So, I've either got to deal with a weird Host CPU climb or purple screens :-(


Additional info requested by Andrew Henle (2016-01-11)

Output from esxtop is now:

11:12:46am up 5 days 22:49, 518 worlds, 3 VMs, 3 vCPUs; CPU load average: 0.09, 0.08, 0.08
PCPU USED(%): 1.3 5.9 0.6 0.6 0.4  63 0.3 0.2 AVG: 9.1
PCPU UTIL(%): 1.3 5.6 0.5 0.8 0.7  59 0.4 0.1 AVG: 8.6

      ID      GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT %VMWAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD  %SWPWT
   36739    27266 vmx                 1    0.03    0.02    0.01  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36741    27266 vmast.36740         1    0.14    0.13    0.00  100.00       -    0.01    0.00    0.00    0.00    0.00    0.00
   36742    27266 vmx-vthread-5       1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36743    27266 vmx-vthread-6:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36744    27266 vmx-vthread-7:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36745    27266 vmx-vthread-8:n     1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36746    27266 vmx-mks:nas         1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36747    27266 vmx-svga:nas        1    0.00    0.00    0.00  100.00       -    0.00    0.00    0.00    0.00    0.00    0.00
   36748    27266 vmx-vcpu-0:nas      1   64.28   59.40    0.00   40.69    0.05    0.04   40.64    0.03    0.00    0.00    0.00

VM uptime is now:

# uptime
 22:14pm  up 5 days 22:21,  1 user,  load average: 0.00, 0.00, 0.00

VM prstat output:

# prstat -a -n 20
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   702 root       19M 9904K sleep   59    0   0:15:33 0.3% vmtoolsd/2
  3272 root       11M 3508K cpu0    59    0   0:00:00 0.1% prstat/1
     5 root        0K    0K sleep   99  -20   0:00:05 0.1% zpool-rpool/136
  3237 root       18M 5172K sleep   59    0   0:00:00 0.1% sshd/1
   427 root     7668K 5768K sleep   59    0   0:00:17 0.0% hald/4
   369 root       12M 3552K sleep   59    0   0:00:14 0.0% nscd/35
   602 root       35M   16M sleep   59    0   0:00:26 0.0% fmd/28
  3238 root       11M 3344K sleep   59    0   0:00:00 0.0% bash/1
   292 root     3220K 1376K sleep   59    0   0:00:00 0.0% vbiosd/3
   247 root     2648K 1316K sleep   60  -20   0:00:00 0.0% zonestatd/5
   267 root       11M 1272K sleep   59    0   0:00:00 0.0% net-physical/1
    81 daemon     14M 4612K sleep   59    0   0:00:00 0.0% kcfd/3
   975 root     3680K 2240K sleep   59    0   0:00:00 0.0% nmz/1
   108 root       13M 2552K sleep   59    0   0:00:00 0.0% syseventd/18
    48 root     3880K 2256K sleep   59    0   0:00:04 0.0% dlmgmtd/7
   162 netadm   4100K 2476K sleep   59    0   0:00:00 0.0% ipmgmtd/5
    39 netcfg   3784K 2424K sleep   59    0   0:00:04 0.0% netcfgd/4
   335 root       11M 2848K sleep   59    0   0:00:00 0.0% picld/4
    13 root       19M   18M sleep   59    0   0:00:29 0.0% svc.configd/22
    11 root       20M   11M sleep   59    0   0:00:07 0.0% svc.startd/12
 NPROC USERNAME  SWAP   RSS MEMORY      TIME  CPU
    53 root      444M  173M   4.2%   0:18:11 0.6%
     1 smmsp    6288K 1584K   0.0%   0:00:00 0.0%
     2 daemon     17M 5664K   0.1%   0:00:00 0.0%
     1 netadm   4100K 2476K   0.1%   0:00:00 0.0%
     1 netcfg   3784K 2424K   0.1%   0:00:04 0.0%
Total: 60 processes, 637 lwps, load averages: 0.01, 0.00, 0.00

VM memory usage:

# echo ::memstat | mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     101923               398   10%
ZFS File Data               60118               234    6%
Anon                        30785               120    3%
Exec and libs                2168                 8    0%
Page cache                   5295                20    1%
Free (cachelist)             8599                33    1%
Free (freelist)            837526              3271   80%

Total                     1046414              4087
Physical                  1046413              4087

Additional info requested by Andrew Henle (2016-01-13)

vmtoolsd is the VMware Tools daemon for Solaris. That process is not the cause. While investigating, I stopped that process on the VM and found it made no difference.

Yes, the data is all taken at the same time. And, yes, it does show very low VM CPU usage but very high Host CPU usage.

Your suggestion of using mpstat is (I think) going down the right track. It revealed massively high numbers of interrupts on the VM.

# mpstat 2
CPU minf mjf xcal  intr  ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0  60208  107  147    1    0    1    0   100    0  14   0  86
  0    0   0    0 130658  103  181    4    0    1    0   413    1  28   0  72
  0    0   0    0 130157  103  134    1    0    0    0    39    0  26   0  74
  0    0   0    0 129690  104  134    1    0    1    0    46    0  27   0  73
  0    0   0    0 129722  102  142    1    0    1    0    42    0  27   0  73

I then used intrstat on the VM to see what interrupts were being raised, however it didn't reveal anything useful.

# intrstat

      device |      cpu0 %tim
-------------+---------------
       ata#1 |         0  0.0
    e1000g#0 |         0  0.0

      device |      cpu0 %tim
-------------+---------------
       ata#1 |         0  0.0
    e1000g#0 |         9  0.0

      device |      cpu0 %tim
-------------+---------------
       ata#1 |         0  0.0
    e1000g#0 |         2  0.0

I then used an old dtrace script I've had for many years to look a little closer at interrupts on the VM. It appears to suggest that the interrupts are being generated by cbe_fire however I don't really know what that is and couldn't find any useful info on it.

# ./intrstat.d
dtrace: script './intrstat.d' matched 4 probes
dtrace: 24610 dynamic variable drops with non-empty dirty list
CPU     ID                    FUNCTION:NAME
  0  62167                       :tick-1sec
driver     instance         intr-count
---------------------------------------
ata        0x1                     0
mpt        0x0                     0
mpt        0x2                     0
e1000g     0x0                     3
<unknown>  unix`cbe_fire       71405

dtrace: 24005 dynamic variable drops with non-empty dirty list
  0  62167                       :tick-1sec
driver     instance         intr-count
---------------------------------------
ata        0x1                     0
mpt        0x0                     0
mpt        0x2                     0
e1000g     0x0                     3
<unknown>  unix`cbe_fire       70285

dtrace: 23759 dynamic variable drops with non-empty dirty list
  0  62167                       :tick-1sec
driver     instance         intr-count
---------------------------------------
mpt        0x0                     0
mpt        0x2                     0
ata        0x1                     1
e1000g     0x0                     3
<unknown>  unix`cbe_fire       71097

Full interrupt table below.

# echo '::interrupts -d' | mdb -k
IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
1    0x41 5   ISA    Edg Fixed  0   1     0x0/0x1   i8042#0
9    0x80 9   PCI    Lvl Fixed  0   1     0x0/0x9   acpi_wrapper_isr
12   0x42 5   ISA    Edg Fixed  0   1     0x0/0xc   i8042#0
15   0x44 5   ISA    Edg Fixed  0   1     0x0/0xf   ata#1
16   0x43 5   PCI    Lvl Fixed  0   1     0x0/0x10  mpt#1
17   0x40 5   PCI    Lvl Fixed  0   2     0x0/0x11  mpt#2, mpt#0
18   0x60 6   PCI    Lvl Fixed  0   1     0x0/0x12  e1000g#0
24   0x81 7   PCI    Edg MSI    0   1     -         pcieb#0
25   0x30 4   PCI    Edg MSI    0   1     -         pcieb#1
26   0x82 7   PCI    Edg MSI    0   1     -         pcieb#2
27   0x31 4   PCI    Edg MSI    0   1     -         pcieb#3
28   0x83 7   PCI    Edg MSI    0   1     -         pcieb#4
29   0x32 4   PCI    Edg MSI    0   1     -         pcieb#5
30   0x84 7   PCI    Edg MSI    0   1     -         pcieb#6
31   0x33 4   PCI    Edg MSI    0   1     -         pcieb#7
32   0x85 7   PCI    Edg MSI    0   1     -         pcieb#8
33   0x34 4   PCI    Edg MSI    0   1     -         pcieb#9
34   0x86 7   PCI    Edg MSI    0   1     -         pcieb#10
35   0x35 4   PCI    Edg MSI    0   1     -         pcieb#11
36   0x87 7   PCI    Edg MSI    0   1     -         pcieb#12
37   0x36 4   PCI    Edg MSI    0   1     -         pcieb#13
38   0x88 7   PCI    Edg MSI    0   1     -         pcieb#14
39   0x37 4   PCI    Edg MSI    0   1     -         pcieb#15
40   0x89 7   PCI    Edg MSI    0   1     -         pcieb#16
41   0x38 4   PCI    Edg MSI    0   1     -         pcieb#17
42   0x8a 7   PCI    Edg MSI    0   1     -         pcieb#18
43   0x39 4   PCI    Edg MSI    0   1     -         pcieb#19
44   0x8b 7   PCI    Edg MSI    0   1     -         pcieb#20
45   0x3a 4   PCI    Edg MSI    0   1     -         pcieb#21
46   0x8c 7   PCI    Edg MSI    0   1     -         pcieb#22
47   0x3b 4   PCI    Edg MSI    0   1     -         pcieb#23
48   0x8d 7   PCI    Edg MSI    0   1     -         pcieb#24
49   0x3c 4   PCI    Edg MSI    0   1     -         pcieb#25
50   0x8e 7   PCI    Edg MSI    0   1     -         pcieb#26
51   0x3d 4   PCI    Edg MSI    0   1     -         pcieb#27
52   0x8f 7   PCI    Edg MSI    0   1     -         pcieb#28
53   0x3e 4   PCI    Edg MSI    0   1     -         pcieb#29
208  0xd0 14         Edg IPI    all 1     -         kcpc_hw_overflow_intr
209  0xd1 14         Edg IPI    all 1     -         cbe_fire
240  0xe0 15         Edg IPI    all 1     -         apic_error_intr

Again, I'm stuck :(


Additional info requested by Andrew Henle (2016-01-15)

End of output from your dtrace script:

          unix`do_splx+0x97
          genunix`disp_lock_exit+0x55
          genunix`post_syscall+0x61a
          unix`0xfffffffffb800cb7
           18

          unix`todpc_rtcget+0xe5
          unix`todpc_get+0x1a
          unix`tod_get+0x1a
          genunix`clock+0xa53
          genunix`cyclic_softint+0xd6
          unix`cbe_softclock+0x1a
          unix`av_dispatch_softvect+0x62
          unix`dispatch_softint+0x33
          unix`switch_sp_and_call+0x13
           38

          unix`sys_syscall+0xff
           39

          genunix`fsflush_do_pages+0x142
          genunix`fsflush+0x3d4
          unix`thread_start+0x8
           58

          unix`dispatch_softint+0x27
          unix`switch_sp_and_call+0x13
        26573

          unix`mach_cpu_idle+0x6
          unix`cpu_idle+0x150
          unix`idle+0x58
          unix`thread_start+0x8
       135471

The last stack trace is the idle loop. The second last stack trace with a count of 26573 appears to be the culprit, but not much info to go on there.

protogen

Posted 2016-01-10T13:19:18.203

Reputation: 13

Post the output from prstat -a on the VM. That will show the process that's taking up all your CPU time. Also, run echo ::memstat | mdb -k on the VM as root and post that output. That will show memory usage. – Andrew Henle – 2016-01-11T11:05:20.590

@Andrew - requested info added to original post above. Appreciate your input and assistance. – protogen – 2016-01-12T11:17:02.990

What is vmtoolsd? Does it continue to accumulate CPU time and possibly a higher and higher CPU percentage over time? And is this data all taken at the same time such that from inside the VM you see very low CPU percentages in use, but outside the VM you see CPU utilization at 64%? If so, post the output from mpstat 2 2, run from inside the VM. That will cause mpstat to output data twice with a two-second delay. The first set of output will be averages since boot. The second set of output will be for the two-second period between outputting the sets. – Andrew Henle – 2016-01-13T12:35:21.480

@Andrew - requested info again added to original post above (it's getting quite long!). Thanks again. – protogen – 2016-01-14T15:13:28.353

This looks real similar: https://www.illumos.org/issues/1333

– Andrew Henle – 2016-01-16T01:12:55.540

@Andrew - I've actually spent the morning reading through that very same issue :-) I agree, there are many similarities. I'm going to try the recommendation from Aubrey Li of adding set apix:apic_timer_preferred_mode = 0x0 to /etc/system however in my case I think it should be set pcplusmp:apic_timer_preferred_mode = 0x0 because apix is not active on my Solaris VM. Do you agree with my thinking on that? Edit: Actually, I've just realised my modified version is shown in the original docs.oracle.com link you gave me (below), so I think I'm correct. – protogen – 2016-01-16T04:05:03.673

@Andrew - Also, I used Aubrey's deadline_timer_reprogram:entry dtrace script and can see exactly the same issue, namely, that the interrupt is occurring before the target is due. – protogen – 2016-01-16T04:10:21.283

@Andrew - It's looking like you've solved my problem. My Solaris VM has been up for 11 hours now (with the /etc/system entry set pcplusmp:apic_timer_preferred_mode = 0x0) and everything appears normal. VM mpstat 2 reports interrupts < 350 and ESXi Host CPU is < 20MHz. I'm going to watch it for another 24 hours before confirming. – protogen – 2016-01-17T12:34:47.493

@Andrew - Confirmed. The problem is definitely solved. Thank you so much. What's the best way to credit you with the solution? Should I just accept your answer below, or would you like to add a new answer referencing the 2 links you provided and the /etc/system entry? – protogen – 2016-01-18T12:51:56.470

Answers

0

(posting this as an answer so the code can be formatted)

You can run this dtrace script as root so you can tell where the kernel is spending its time:

#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg0/
{
    @[ stack() ] = count();
}
  • Put that into a file, call it something descriptive like hotspot.d
  • chmod it to be executable: chmod 755 hotspot.d
  • Run it: ./hotspot.d
  • A few seconds after it prints something out about matching N probes, halt its running by hitting CTRL-C.

You will see a lot of data - every kernel stack trace it sampled while it was running, ordered by increasing sample count for that stack trace. Ignore the ones that have idle in the stack trace somewhere - that's the idle loop - and that's quite likely that the last few non-idle stack traces will be where your system is spending all its time.

Andrew Henle

Posted 2016-01-10T13:19:18.203

Reputation: 200

Using dtrace -n I'm getting the error dtrace: invalid probe specifier. Should it be dtrace -s ? Edit: Yes, I've confirmed it should be dtrace -s, the output looks like what you've described. – protogen – 2016-01-15T07:28:12.420

If you correct your post to dtrace -s notice also the bang is missing from #!/usr/sbin/dtrace -s – protogen – 2016-01-15T07:39:11.913