Internal Site to Site VPN over MPLS speed issues?

Question

How do I troubleshoot slow performance of a site to site VPN tunnel over a MPLS circuit? What are the relevant reports/stats I should be looking at?

Background: I support one end of a Site to Site VPN that is used to connect two ends of a Process Control Network (PCN). The PCN is separated from the Business/Corporate network by Juniper SRX/SSG firewalls that also provide the VPN endpoints.

Originally the business network between the sites was connected with an AT&T GigaMAN connection, which as I understand it is a brand name for Metro Ethernet Service. One site was a sub-site of the main site (mine) and any traffic that needed to go to a different company site other than mine from the sub-site passed though the main site before routing to the other sites in the company.

Due in part by cost and in part for additional reliability the Metro Ethernet was replaced by a T3 circuit tied into the company MPLS at the sub-site. The main site was already on MPLS with the rest of the company. One of the uses for the VPN is scheduled file transfers between sites, and since the switch to MPLS at the sub-site we will have intermittent time outs for the transfers.

I don’t control the company LAN or WAN, just the PCN, so I have to work through another group to find the root cause but don’t know the right questions to ask.

Are you asking us what questions to ask the IT staff responsible for the LAN/WAN? Or are you wanting to know what to look at yourself? What do you have access to? Just the endpoints on each side? Juniper gear? MPLS circuit info/support? — TheCleaner, Dec 11 '13 at 19:21
A little bit of both I guess. I have access to the Juniper firewalls. The LAN/WAN group will pull reports, but I'm not sure what reports to ask for. Were getting push back saying everything is fine, no problem found, so I'm having to investigate it more myself to reinforce my case that it need more investigation. — Randy K, Dec 11 '13 at 21:56

score 1 · Accepted Answer · answered Dec 11 '13 at 22:40

Things I would recommend looking at:

Pull event logs from the Juniper boxes, especially looking for drops in the tunnel.
Run debug logs on the Juniper boxes, especially if the issues are consistent enough that you can do so without worrying about log rollover or performance issues while debugging.
Get any MPLS reports that will show loss of connectivity, bandwidth utilization, etc. as granular as possible in timeframe
Do some normal tests. Test various endpoints, file sizes, MTU sizes, QCheck tests, etc. at various times of the day. If you can run these during the intermittent issues, even better.
If it can be reproduced, even on a daily basis, try running the endpoints with wireshark logging and then analyze those logs.
Try different file transfer protocols. See if the issue is the protocol itself. SMB is pretty poor over a VPN tunnel. Try FTP instead. Test and gather results.

Really overall, the more data points and logs you can gather from various angles, the easier it will be to put the puzzle together.

Bandwidth utilization turned out to be the key to my issue. There was a user at the sub site that would periodically upload large amounts of data (200GB+) to a server located at my site. When he did this he maxed out the available bandwidth. I am now waiting to hear why one user on the sub-site could monopolize the 45Mbs connection, effectively shutting out the rest of the users from any useful connectivity. We're a very large corporation paying another very large corporation to manage our LAN/WAN systems, I would expect some sort of QoS to be in place to stop this from occurring. — Randy K, Dec 17 '13 at 17:06
Yes, work with the 3rd party on this. QoS and some policies will need to be in place on the MPLS routers/circuits as well as all the way through to at least the Juniper gear on each endpoint. — TheCleaner, Dec 17 '13 at 17:08

score 0 · Answer 2 · answered Dec 11 '13 at 18:48

Look at MTU. Are you using a value large enough to cause fragmentation which in turn could be resulting in delays. Imagine that your packet size is 5 bytes too big. So two fragments are sent and the small one, only 5 bytes, gets there first. Then it has to wait until the bugger one catches up before reassembly. And if buffers overflow, then you have the classic ATM situation where small numbers of fragments lost cause many packet retransmissions.

Do some ping tests with specific packet sizes, in particular use the MTU size and something like MTU + 5. Also calculate the encapsulation overhead (OVH) and use MTU - OVH and MTU - OVH + 5

Get a report of packet size distribution to get an idea of your mean packet size and what shape of distribution you have.

should I run the MTU ping tests inside the VPN tunnel, outside, or both? If this was the case, wouldn't this be a constant issue? This issue is intermittent. — Randy K, Dec 11 '13 at 21:59

Internal Site to Site VPN over MPLS speed issues?

2 Answers2