2

On about 5% of our customer's calls, we see large jitter spikes and high delta #s that have caused an noticeable audible impact on call quality. (Stuttering/Breakups/Robotic Audio). We know this from call quality statistics we are pulling via our Homer server as well as PCAPs taken on both the LAN and WAN side of the network. See https://imgur.com/a/IoVe8Zr for more detailed rtp stats. The issue is incredibly sporadic but the reports we've received tell us this is happening on multiple calls at the same time.

Screenshots:

Very high jitter numbers (likely not real) that are being introduced somewhere

enter image description here

PCAP from mirror port on customer switch (Mirroring switchport to Polycom VVX handset)

enter image description here

RTP Stats from VMWare Router

https://i.imgur.com/zz27mDY.png

Another RTPStats example from our VMWare Router

enter image description here

Background:

PBX: Asterisk 11 system running on CentOS 6.5 in VMWare (ESXi 6.5, virtual hardware v13, managed through vCloud Director as a dedicated host), hosted in our data center. 8 Cores - 32G RAM. Very low load > average .07, but we have a fair amount of call volume (~2000 calls per day). It is one of many similar systems in this infrastructure (many which also run VoIP/Asterisk)...the rest are running flawlessly, some with much higher volume.

Network: Traffic is delivered to the customer's Cisco ASA via a direct 1G DIA (AT&T) Ethernet Circuit to the our site. All of our internal routes that the traffic traverses are over 1G links and traffic is properly prioritized.

Endpoints: Polycom VVXs as well as some Bria Softphones

Our initial thought was that this was being introduced on the the LAN, but pingplotter/MTR and various other tests back to our infrastructure came back completely in the clear. What we ended up doing is mirroring a port on our router ingress to VMWare...we found that the jitter was not there when it entered VMWare, but the jitter was present on all legs back out of our VMWare infrastructure. This has us thinking currently that either VMWare or our Asterisk configuration are the culprit, but the fact that we have over 50 other customer hosted in the same infrastructure has me pointing the finger at our asterisk system. Maybe some type of CPUWait issue that is causing packets to not be loaded onto the network in a timely fashion?

Also, we've been able to generally recognize that these jitter spikes happen when a ringall group is dialed that has a high number of agents (about 25 agents rung all at once). Our call center manager refuses to budge from this configuration. We have other groups with similar set ups, but not quite that large. I'm also seeing some of what I believe are skewed jitter numbers on some calls (with a jitter in the millions of milliseconds - examples included w/ screenshot above). I'm not sure where that is being introduced or if it is relevant to our issue.

Things we've tried:

  • Full implementation of QoS through the entire network layer

  • Setting Asterisk to run as high priority

  • Modifying UDP and Asterisk Jitterbuffers (which has seemed to have some marginal benefit)

  • Installation of VMWare Tools as well as setting the VM to "High Latency" sensitivity

  • Modified system power settings to performance (I thought this was it for sure as it is very similar to the problem described here: Causes of RTP jitter at the server however no luck.)

  • Replaced a number of switches in the environment

  • Disabled SIP ALG

  • Implementation of G729 codec (vs our standard G711)

  • Vmotion'd to a new host

We'd also like to segment voice and data within their network as separate VLANs, but have not gotten appropriate buy-in from the network vendor for that yet...at this point we are at a bit of a dead end.

If you were in my shoes, what would be your next steps? Are there any additional angles of this problem that I should be looking into? Or an obvious test that I've missed?

Any help is much appreciated!

Stuggi
  • 3,366
  • 4
  • 17
  • 34
AVoIPm8
  • 29
  • 2
  • 1
    " in VMWare" - VMware is a company, can you tell us details about which of their manu products and version you're using please. – Chopper3 May 22 '20 at 13:55
  • 1
    Yep - good point. I've edited the details above - We're using VCloud director for management. Hypervisor is ESXi 6.5 (virtual hardware version 13). All virtual hosts have dedicated resources. NIC Adapater is VMXNet3. – AVoIPm8 May 22 '20 at 14:11
  • What's your vswitch setup - vSS/vDS/NSX? Why such old SW? – Chopper3 May 22 '20 at 16:23
  • What virtual network adaptor are you using for the VMs? – Stuggi May 25 '20 at 10:48

1 Answers1

0

Sounds to me you have put in the time. I have the experience and the best I can say is

I have played with hyper-visors on microsft/vmware/kvm and still have many running. That in the end running on hardmetal seems to make all these problems go away.

note i would also try gsm codecs ..

I run small and big offices. The Virtual machines i run phone systems on now are on small offices 2-3 people and usually anything bigger especially in call volume i have just stopped doing it on a hyper-visor virtual machine! i have had good experience with Openvz running asterisk as a container it seems to share resources a bit better.

Its a hard option to take but after fighting with good hardware and eventually running on some old hardware as a tests (more then one company that we jumped on the VM world) Runing it on its own real hardware seems to fix the issues. So I would agree with anyone who states to contact your hyper-visor company VMware it looks like.. but in the end hard metal!