1

Our monitoring system is indicating transmit errors on the IPsec VTIs on our Vyatta Core routers when they are under high load. They only appear fairly occasionally, and don't appear to seriously impact performance (we're getting pretty close to 100 Mbps on a 100 Mbps link), but there seems to be very little information out there about what constitutes a transmit error on a VTI. I'm sure the information exists in the kernel sources, but having no kernel development experience, it could take me days or weeks to understand it enough to answer the question. Where can I find more information about this?

Paul Gear
  • 3,938
  • 15
  • 36

2 Answers2

2

The transmit errors on VTI interfaces (and other tunneling interfaces) have special meanings. Unfortunately it's poorly documented and I've looked into the source code of kernel to investigate this (see the /net/ipv4/ip_vti.c file).

To list the categories of TX errors use the ip -s -s -d link show [ dev <vti-iface> ] command.

TX carrier errors and troubleshooting:

  • No suitable route was found - check it with the ip route get <dst> command
  • No suitable policy was found - check the policies with the ip xfrm policy get ... command
  • No suitable SA was found - check the SA status with the ip xfrm state get ... command
  • The SA isn't in the tunnel mode - check the SA mode with the ip xfrm state show or the ip xfrm state get ... commands

TX collision errors:

  • Routing loop found - after transformation a packet should be sent through the same VTI interface - check the SA configuration and the routing configuration.
Anton Danilov
  • 4,874
  • 2
  • 11
  • 20
0

The errors that you're seeing can happen for a number of reasons. My suggestion would be to dig through your logs for a message that looks like:

Nov 25 21:18:00.000 UTC: ISAKMP (0:1): deleting node ######## error TRUE reason "[the answer you seek is likely in this string]"

I'd take a look at this link for troubleshooting IPSec VPNs. Normally, I'd summarize as links can go down for any reason, but without knowing more specifics, you want to generally look for troubleshooting guides not relating to initial configuration (as you have a working setup -- only occasional errors). Which is to say, the answers to your question likely live as a string in your logfiles.

More generally, transmit errors can occur for any number of reasons - mangled checksums, mangled authentication headers, need to retransmit due to congestion or dropped packets, really any error in any of the layers of the IPSec affected network stack can bubble up.

  • I can understand how mangled checksums & authentication headers could happen on receive (due to corruption in transmission), but the sender calculates the checksums/MACs and inserts them into the packets - how could it go wrong (except perhaps for local memory corruption, which is very unlikely to be detected by the sender)? Retransmissions due to congestion or bad links make more sense, but why would they be showing up at the link layer rather than the transport layer? A VTI shows up as an ordinary point-to-point interface on Linux. – Paul Gear Nov 26 '13 at 22:21
  • I guess the main likelihood here is that the lower level Ethernet device is detecting that the medium is busy (through CSMA/CD), and pushing that back to the VTI. – Paul Gear Nov 26 '13 at 22:24
  • When you say that you get these errors under high load, do you mean high network load or is it correlated with other constraints (CPU/mem)? – 89c3b1b8-b1ae-11e6-b842-48d705 Nov 27 '13 at 13:21
  • Can you get more detailed information from your monitoring system? – 89c3b1b8-b1ae-11e6-b842-48d705 Nov 27 '13 at 13:36
  • Sanity check: are you on the most recent stable version? If not, is there anything in the changelogs for intermediate versions that would suggest a now-handled bug? – 89c3b1b8-b1ae-11e6-b842-48d705 Nov 27 '13 at 14:35
  • High network load is what I was referring to. When we push 100 Mbps through the system it ties up 1 CPU core pretty well, but the other 3 cores are idle. Our monitoring system only reports interface errors that net-snmp detects. So it's really the kernel that is determining all of this. – Paul Gear Nov 29 '13 at 03:29
  • It's near impossible to solve this without polling for more information or checking out the logs. Is that out of the question? – 89c3b1b8-b1ae-11e6-b842-48d705 Dec 02 '13 at 15:19
  • There are no messages from the IPsec daemon except for the usual ones on key expiry: – Paul Gear Dec 03 '13 at 20:44
  • Sorry - didn't edit quick enough. Here's the message on key expiry: Dec 4 06:29:35 host pluto[6195]: "peer-1.2.3.4-tunnel-vti" #1234: ignoring Delete SA payload: PROTO_IPSEC_ESP SA(0xa0b1c2d3) not found (maybe expired) There's also nothing relevant in dmesg - it's just firewall logs. – Paul Gear Dec 03 '13 at 20:51
  • Here's netstat -i (taken directly from /proc/net/dev): Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg ... vti2 1500 0 1915573169 0 0 0 5001462216 5554 0 0 OPRU vti3 1500 0 4972367985 0 0 0 1904624979 933 0 0 OPRU (Apologies for the formatting). Note the TX-ERR entries - I believe this is what SNMP is reporting. I don't think this is something we can solve by looking at logs. It needs a detailed understanding of what causes TX errors in the kernel. – Paul Gear Dec 03 '13 at 20:57