3

We have a client with 6 sites using IPsec. Every now and again, possibly once a week, sometimes once a month, data just stops flowing from the remote Fortigate VPN server to the local MikroTik IPsec VPN client.

In order to demonstrate the symptoms of the problem I have attached a diagram. On the diagram Installed SAs tab you will notice a source IP address x.x.186.50 trying to communicate with x.x.7.3 but 0 current bytes. x.x.186.50 is the client's remote Fortigate IPsec server, and x.x.7.73 is a MikroTik based IPsec endpoint. It appears data from the remote side to us is not always flowing.

Phase 1 and 2 are always established but traffic always refuses to flow from the remote side to us.

We tried various things over time, such as rebooting, setting clocks, dabbling with configuration, rechecking and rechecking configuration but it appears the problem is entirely random. And sometimes random things fixes it. At one stage I had a theory that if the tunnel is initiated from their side it works, but fiddling with "Send Initial Contact" has not made any difference.

We've had many chats to the client about this but they have many more international IPsec VPNs and only our MikroTik configuration is failing.

Fortigate log:

enter image description here http://kb.fortinet.com/kb/microsites/microsite.do?cmd=displayKC&externalId=11654

Looking at Fortigate's knowledgebase it appears SPIs don't agree and DPD would make a difference. But I have tried every single combination of DPD on this side without avail. I would like to enable DPD on the other side but I cannot due to change control and also because the client is saying it's working on all the other sites exactly configuration the same. EDIT DPD was enabled

Local VPN client diagram showing no traffic flow:

enter image description here

I have included a log file showing continuous loops of "received a valid R-U-THERE, ACK sent" MikroTik log file:

echo: ipsec,debug,packet 84 bytes from x.x.7.183[500] to x.x.186.50[500]

echo: ipsec,debug,packet sockname x.x.7.183[500]

echo: ipsec,debug,packet send packet from x.x.7.183[500]

echo: ipsec,debug,packet send packet to x.x.186.50[500]

echo: ipsec,debug,packet src4 x.x.7.183[500]

echo: ipsec,debug,packet dst4 x.x.186.50[500]

echo: ipsec,debug,packet 1 times of 84 bytes message will be sent to x.x.186.50[500]

echo: ipsec,debug,packet 62dcfc38 78ca950b 119e7a34 83711b25 08100501 bc29fe11 00000054 fa115faf

echo: ipsec,debug,packet cd5023fe f8e261f5 ef8c0231 038144a1 b859c80b 456c8e1a 075f6be3 53ec3979

echo: ipsec,debug,packet 6526e5a0 7bdb1c58 e5714988 471da760 2e644cf8

echo: ipsec,debug,packet sendto Information notify.

echo: ipsec,debug,packet received a valid R-U-THERE, ACK sent

I've received various suggesions from IPsec experts and MikroTik themselves implying that the problem is at the remote side. However the situation is greatly compounded that 5 other sites are working and that the client's firewall is under change control. The setup also always worked for many years, so they claim it cannot be a configuration error on their side. This suggestion seems plausible but I cannot implement due to change control. I may only change the client side:

Make sure the IPSec responder has both passive=yes and send-initial-contact=no set.

This did not work.

EDIT 9 Dec 2013

I am pasting additional screenshots with the Fortigate configuration and what we believe are the Quick Mode selectors on the Mikrotik side.

Phase 1 Fortigate screenshot

Phase 2 Fortigate screenshot

Quick Mode Selectors?

Let me re-iterate that I don't think it's a configuration problem. I speculate it's a timing problem whereby side A or side B tries to send information too aggressively making the negotiation of the information (e.g. the SPI) out of sync.

EDIT 11 Dec 2013

Sadly I have to give up on this issue. Happily everything is working. Why it's working is still a mystery, but to further illustrate what we did I post another image inline.

We fixed it by:

  1. Turning off PPPoE at client.
  2. Installing completely new router (Router B) and tested at Border. It worked at Border.
  3. Switching off new router B at border. AND THEN, WITHOUT MAKING A SINGLE CHANGE, the client's end-point Router A started working. So just adding a duplicate router at the border and taking this router offline again made the original router work.

So add this fix to the list of things we've done:

  1. Reboot. That worked once.
  2. Create new tunnel with new IP. That worked once but only once. After changing IP back client endpoint came live again.
  3. Change time servers.
  4. Fiddle with every possible setting.
  5. Wait. Once, after a day, it just came right. This time, even after days, nothing came right.

So I postulate that there is an incompatibility on either Fortigate or MikroTik side which only happens at very random situations. The only things we haven't been able to try is upgrade firmware on Fortigate. Maybe there is hidden corrupt configuration value or timing issue invisible to configurer.

I further speculate that the issue is caused by timing issues causing SPI mismatch. And my guess is the Fortigate doesn't want to "forget" about the old SPI, as if DPD is not working. It just happens randomly and from what I can tell only when endpoint A is Fortigate and endpoint B is MikroTik. The constant aggressive attempts at trying to re-establish the connection "holds" on to old SPI values.

I'll add to this post when it happens again.

enter image description here

EDIT 12 Dec 2013

As expected it happened again. As you may recall we have 6 MikroTik client IPsec end-point routers configured exactly the same connecting to one Fortigate server. The latest incident was again to a random router, not the one I posted here about originally. Considering the last fix where we installed this duplicate router, I took this shortcut:

  1. Disable Router A, the router that does not want to receive packets from Fortigate any more.
  2. Copy Router A's IPsec configuration to a temporary router closer to the border of our network.
  3. Immediately disable the newly created configuration.
  4. Re-enable Router A.
  5. Automagically it just starts working.

Looking at @mbrownnyc comment I believe that we are having an issue with Fortigate not forgetting stale SPIs even though DPD is on. I will investigate our client's firmware and post it.

Here is a new diagram, much like the last, but just showing my "fix":

enter image description here

  • Here's some quick advice, but isn't an answer: Make sure everything matches. Everything (DPD, PFS). Enable `autokey keep alive`. Also, setup a ping from the remote site to a host at the destination site. What about your `quick mode selectors` (and whatever MikroTek calls them)? What about your [fortigate debug logs](http://mbrownnyc.wordpress.com/2009/03/11/fortigate-debugging-on-a-fortigate/) (`diag debug app ike -1`? – mbrownnyc Dec 07 '13 at 14:43
  • Thanks for the reply. As it turns out we have no access to the Fortigate and the client's argument is it works across all other 5 sites. In addition their Fortigate is under change control so they don't want to do anything on their side. But they said they'll try to help us again on Monday. I will mention all these settings to them. – Eugene van der Merwe Dec 07 '13 at 19:55
  • I would make sure that everything matches. Being that `R-U-THERE` is a function of `DPD` (which functions on phase 1, it seems like phase 1 is establishing (okay on the Aggressive versus main mode), but phase 2 might be failing. I'd say, what about PFS, but I already said verify each setting is exactly the same, particularly what Fortinet calls Quick Mode Selectors. It doesn't seem you have confirmed that you have verified every single setting. Can you post what they gave you (less IPs, shared key, etc), appending to your original post? – mbrownnyc Dec 08 '13 at 21:47
  • 1
    `the Fortigate doesn't want to "forget" about the old SPI,` YES YES YES! I have had this happen to me. I'm on v4 MR3 patch 11. – mbrownnyc Dec 11 '13 at 22:13
  • My client is on 620B v4 MR3 Patch 8. I'm going to try Fortigate official channels next as I am so sure this is going to happen again. – Eugene van der Merwe Dec 12 '13 at 08:10
  • I was on my way out when i responded. The cause: Adding an additional `Firewall\Group` to the `Quick Mode Selectors` that I had configured to allow a group name. Restarted the tunnel, and there were two tunnels up. There are bugs. Those guys should [upgrade to at least the latest v4, if not v5](http://mbrownnyc.wordpress.com/2013/01/30/upgrading-the-firmware-on-a-fortigate-unit/). I had to restart the ipsec daemon, if I remember, with `debug app test` in `global` scope. – mbrownnyc Dec 12 '13 at 11:30

2 Answers2

1

May not be the cause of your problem, but may be useful information for other users. We had a slightly similar problem with a VPN between a Mikrotik and a Sonicwall. Traffic would randomly stop, requiring the SA's to be flushed.

In the end we realised that the Sonicwall was creating a separate SA for each network policy (by the look of your screenshot it looks like you have 2 policies/subnets going over the VPN). I don't know if this 'SA-per-policy' setting is hard coded or configurable as I didn't have access to the Sonicwall.

Our Mikrotik was using the 'require' level for the policies (the default, and seen in your screenshot). This causes the router to create a single SA with the remote peer. When sending traffic for any of the policies for that peer, it will use this same SA, regardless of the src/dest subnet.

This basically meant that it worked as long as we only used one subnet. As soon as our Mikrotik tried to send traffic for the second subnet, it would send over the existing SA (which as far as the Sonicwall is concerned is for a specific subnet pair), the Sonicwall would complain, SA sequence numbers would go out of whack and the whole lot stopped. (In our case the customer got 'replay' errors on their end)

In the end it was as simple as changing the policy Level to 'unique', so both ends used a unique SA for each unique subnet pair. The tunnels were perfectly happy after that.

USD Matt
  • 5,321
  • 14
  • 23
0

I know that you have checked this (just like I did when I had a similar, but completely different intermittent problem), but make sure that you don't have a duplicate IP address that router A is sharing. This would give you the intermittent problem when your high side router does an arp lookup for router A and gets confused. You would think that dup Ips on routers would give a consistent error, but it doesn't.

mike
  • 1
  • Thanks for the answer. This problem mostly died down so we ask, what has changed? 1. We had 24 hour radius session-timeout. We removed this, more stability. 2. The client had a primary and backup firewall. They went to and from the primary a couple of times. They can't / won't give me the info but I pray or suspect they also updated some firmware. 3. The last time it happened we used a disable/enable everything for IPsec technique. This solved it. I really like your answer. I will remember this when it happens again. I need an ARP debug tool though. – Eugene van der Merwe Aug 21 '14 at 09:56