On about 5% of our customer's calls, we see large jitter spikes and high delta #s that have caused an noticeable audible impact on call quality. (Stuttering/Breakups/Robotic Audio). We know this from call quality statistics we are pulling via our Homer server as well as PCAPs taken on both the LAN and WAN side of the network. See https://imgur.com/a/IoVe8Zr for more detailed rtp stats. The issue is incredibly sporadic but the reports we've received tell us this is happening on multiple calls at the same time.
Screenshots:
Very high jitter numbers (likely not real) that are being introduced somewhere
PCAP from mirror port on customer switch (Mirroring switchport to Polycom VVX handset)
RTP Stats from VMWare Router
Another RTPStats example from our VMWare Router
Background:
PBX: Asterisk 11 system running on CentOS 6.5 in VMWare (ESXi 6.5, virtual hardware v13, managed through vCloud Director as a dedicated host), hosted in our data center. 8 Cores - 32G RAM. Very low load > average .07, but we have a fair amount of call volume (~2000 calls per day). It is one of many similar systems in this infrastructure (many which also run VoIP/Asterisk)...the rest are running flawlessly, some with much higher volume.
Network: Traffic is delivered to the customer's Cisco ASA via a direct 1G DIA (AT&T) Ethernet Circuit to the our site. All of our internal routes that the traffic traverses are over 1G links and traffic is properly prioritized.
Endpoints: Polycom VVXs as well as some Bria Softphones
Our initial thought was that this was being introduced on the the LAN, but pingplotter/MTR and various other tests back to our infrastructure came back completely in the clear. What we ended up doing is mirroring a port on our router ingress to VMWare...we found that the jitter was not there when it entered VMWare, but the jitter was present on all legs back out of our VMWare infrastructure. This has us thinking currently that either VMWare or our Asterisk configuration are the culprit, but the fact that we have over 50 other customer hosted in the same infrastructure has me pointing the finger at our asterisk system. Maybe some type of CPUWait issue that is causing packets to not be loaded onto the network in a timely fashion?
Also, we've been able to generally recognize that these jitter spikes happen when a ringall group is dialed that has a high number of agents (about 25 agents rung all at once). Our call center manager refuses to budge from this configuration. We have other groups with similar set ups, but not quite that large. I'm also seeing some of what I believe are skewed jitter numbers on some calls (with a jitter in the millions of milliseconds - examples included w/ screenshot above). I'm not sure where that is being introduced or if it is relevant to our issue.
Things we've tried:
Full implementation of QoS through the entire network layer
Setting Asterisk to run as high priority
Modifying UDP and Asterisk Jitterbuffers (which has seemed to have some marginal benefit)
Installation of VMWare Tools as well as setting the VM to "High Latency" sensitivity
Modified system power settings to performance (I thought this was it for sure as it is very similar to the problem described here: Causes of RTP jitter at the server however no luck.)
Replaced a number of switches in the environment
Disabled SIP ALG
Implementation of G729 codec (vs our standard G711)
- Vmotion'd to a new host
We'd also like to segment voice and data within their network as separate VLANs, but have not gotten appropriate buy-in from the network vendor for that yet...at this point we are at a bit of a dead end.
If you were in my shoes, what would be your next steps? Are there any additional angles of this problem that I should be looking into? Or an obvious test that I've missed?
Any help is much appreciated!