6

Investigating some call quality issues (0.5 – 1 second dead spots in calls) I took a packet capture of a phone call between two extensions on the same PBX. Since I was capturing from the PBX, I was rather surprised to see Wireshark reporting a huge spike in jitter that synced up with a dead spot in the call:

screenshot of jitter graph

My understanding was that jitter is caused by packet loss and/or latency in transit, and that the RTP stream leaving the PBX should be relatively pristine. But this spike showed up in all four RTP streams (office 1 to PBX, office 2 to PBX, PBX to office 1, PBX to office 2) so it seems like the packets are already in poor shape by the time they leave the server.

The PBX is Asterisk 13 on Scientific Linux (RHEL) 6.9 (running on a VMWare ESXi 5.5 guest with newly updated tools and VMXNET3 adapters.) The CPU sits pretty steadily around 5-15% usage, and network traffic is minimal. Where can I look to troubleshoot this issue? Are there any common causes for this sort of problem? I'm assuming since the problems are there on the server that I can rule out problems on the external network side?

miken32
  • 930
  • 1
  • 11
  • 32
  • Please provide the data from lower layers. How many retransmits could be seen on IP level? – Nils Nov 27 '18 at 12:24
  • @Nils what kind of data are you looking for? It's a UDP stream so no retransmits; no lost or out-of-sequence packets either. – miken32 Nov 27 '18 at 22:55
  • If nothing like that can be seen the problem is further away - propably way behind your first router/switch-device. Can you analyze the traffic on the phsical ESXi-interface as well (e.g. by port mirroring through the network team)? – Nils Nov 29 '18 at 21:51
  • I'm seeing this in traces from the PBX, of outgoing traffic, so jitter seems to be already in place before it even hits the network. The graph above was from tcpdump running on the PBX, of audio going from PBX to a client. – miken32 Nov 29 '18 at 21:53

2 Answers2

2

Finally figured this out! TLDR: disable power management on the host.

Despite the low CPU usage, we still figured this was something to do with CPU load. So, we were experimenting with loading down the CPU, expecting this problem with the dead spots in the calls to get worse. Instead, it went away completely. So, after looking at CPU usage stats in vCenter many many times, I finally looked into the other line on that graph.

CPU usage graph showing high ready time

This is probably not news to many, but I found out that CPU ready time is the amount of time that a VM is ready to use the CPU, but the physical resources can't be allocated by the host. Most sources I found say that anything less than 5% isn't a problem, but it certainly seemed to be having an impact on our voice streams. We were seeing the cutouts every minute, and the graph also showed a spike in ready time every minute.

So I got to wondering why this would go away during high CPU load and figured it must be some kind of power management. When the host sees the increased usage it makes CPU resources consistently available to the VM. So I disabled power management in the BIOS of the host, et voila:

CPU usage graph showing low ready time

The slight increase in ready time near the end of the graph corresponds to a half-dozen other VMs migrating back to this host.

Call traces now show negligible amounts of jitter, and the cutouts have disappeared from calls. Further research showed this is a somewhat common issue with workloads that are both latency-sensitive and CPU non-intensive. The power management sees the very low CPU usage and assumes it can throttle the processor, even though it should not!

miken32
  • 930
  • 1
  • 11
  • 32
  • Interesting one. Can you test, wether it is enough to just disable the (deep sleep) C-states in BIOS? Apart from that - yes - "power saving" has weird effects in VMs and single VMs do not trigger an up-throttle. It is best practice to disable power-management for productive systems - from within VMWare. – Nils Dec 16 '18 at 21:05
  • 1
    After realizing power management might be the problem, I did try disabling it on the host from within vSphere web client (set power management to "high performance" instead of "balanced") and it didn't stop the spikes in the ready time. – miken32 Dec 18 '18 at 21:36
0

I had a similar issue, but much worse, with many spikes in the Wireshark RTP graph, hisses and choppy audio.

In the course of many experiments, I dumped the CDR database, that had grown to 1.5GB. I had noticed the size, but was putting off the pruning until I had fixed the audio problems. B-)

This apparently immediately improved audio quality, including the transcoding of the IVR messages to G729.

The delays were also visible from a SmokePing trace to the VPS.

simonpa71
  • 220
  • 1
  • 14