44

I am evaluating a system for a client where many OpenVPN clients connect to a OpenVPN server. "Many" means 50000 - 1000000.

Why do I do that? The clients are distributed embedded systems, each sitting behind the system owners dsl router. The server needs to be able to send commands to the clients. My first naive approach is to make the clients connect to the server via an openvpn network. This way, the secure communication tunnel can be used in both directions.

This means that all clients are always connected to the server. There are many clients summing up over the years.

The question is: does the OpenVPN server explode when reaching a certain number of clients? I am already aware of a maximum TCP connection number limit, therefore (and for other reasons) the VPN would have to use UDP transport.

OpenVPN gurus, what is your opinion?

Steffen Müller
  • 678
  • 3
  • 10
  • 17
  • Could you share your final conclusions about this with us? Have you been able to do tests with > 5'000 users? – Philipp May 13 '13 at 16:21
  • Hello Philipp, we ditched the OpenVPN plan as it was clear that we would touch ground nobody has ever touched before. We opted to a SSL based normal TCP Socket connection to a Node.js connection management server. – Steffen Müller May 14 '13 at 16:48

4 Answers4

28

I doubt that a setup that large has ever been attempted before, so you likely will be pushing limits when trying. I could find an article on a VPN deployment for 400 clients but judging from the text, the author just relied on rough estimates about how many clients could be run per CPU and lacked some understanding about how his setup would perform.

You would mainly need to consider these two points:

  1. The bandwidth your data transfers are going to use would need encryption / decryption at the VPN server side, consuming CPU resources

  2. OpenVPN client connections consume both, memory and CPU resources on the server even when no data is transferred

Any decent PC hardware available today should easily saturate a Gigabit link with Blowfish or AES-128, even $100 embedded devices are capable of rates near 100 Mbps, so CPU bottlenecks due to bandwidth intensity should not be of any concern.

Given the default rekeying interval of 3600 seconds, a number of 1,000,000 clients would mean that the server would need to be able to complete 278 key exchanges per second on average. While a key exchange is a rather CPU-intensive task, you could offload it to dedicated hardware if needed - cryptographic accelerator cards available easily meet and exceed this number of TLS handshakes. And memory restrictions should not bother too much as well - a 64-bit binary should take care of any virtual memory restrictions you would be likely to hit otherwise.

But the real beauty with OpenVPN is that you can scale it out quite easily - simply set up an arbitrary number of OpenVPN servers and make sure your clients are using them (e.g. through DNS round-robin), configure a dynamic routing protocol of your choice (typically this would be RIP due to its simplicity) and your infrastructure would be capable of supporting an arbitrary number of clients as long as you've got enough hardware.

the-wabbit
  • 40,319
  • 13
  • 105
  • 169
  • Thanks for the concise answer. Do you see alternatives to using openvpn? The main goal is just having the bidirectional communication going through the router. – Steffen Müller Oct 19 '12 at 09:13
  • 2
    @SteffenMüller If you do not need a complete stack but only a control channel, why not use something similar to [botnets](http://en.wikipedia.org/wiki/Botnet)? Implementations are available and the SANS [conveniently offers a paper](http://www.sans.org/reading_room/whitepapers/covert/byob-build-botnet_33729) on how to set them up – the-wabbit Oct 19 '12 at 10:47
  • Thanks for the interesting link. Unfortunately, the bot is using simple polling to query whether the server has informations. Although this might be the way to go, I am searching for a way to establish and keep a bidirectional connection. The constant polling either causes delays in command execution or a high data volume for useless polling requests. Maybe a permanent TCP connection is the way to go? – Steffen Müller Oct 20 '12 at 11:52
  • 1
    @SteffenMüller Botnets are *proven* to handle thousands of clients well - thus my suggestion to look into it. You do not have to go with the specific implementation hinted at by SANS - there really are plenty of others. Other than that, without knowing your exact requirements it is really difficult to tell. A TCP connection sending keepalives surely would be able to make sure the state relation at the NAT gateway does not age out. But you would need to take care of everything else (authentication, encryption, error handling) all by yourself. – the-wabbit Oct 21 '12 at 09:38
  • 2
    BTW, there isn't any reason you can't lower the rekeying interval (there is a security tradeoff, in that a key being compromised will reveal plaintext back to the last rekeying). Also, I'd be much more worried about the *routing* or other connection-look-up failing first. I mean, if OpenVPN is intended to have <100 connections active, what's the chance there there is an O(n) lookup of a connection somewhere? – derobert Oct 22 '12 at 22:30
  • Just came across this old question while doing some research. For reference OpenVPN 2.x isn't multi-threaded, it's planned for 3.x. See http://community.openvpn.net/openvpn/wiki/RoadMap and http://comments.gmane.org/gmane.network.openvpn.user/30272 – Mike Oct 31 '13 at 08:51
  • @Mike thanks for the correction - I've removed the passage stating that OpenVPN is multithreaded from my answer. – the-wabbit Nov 01 '13 at 23:32
27

I have actually done this, albeit with "only" a few hundred remote connections similarly behind DSL routers. I can't comment too much about the rekeying issues, but a few practical things I learned along the way:

1) When deploying clients, make sure you specify multiple VPN servers in the client conf, vpn1.example.com, vpn2.example.com, vpn3..... Even if you only provide one or two of these now, you give yourself headroom. Configured properly, the clients will keep retrying them at random till they find one that works.

2) We use a custom AWS VPN server image, and can spin up additional capacity on demand, and Amazon DNS (R53) handles the DNS side of things. It is completely detached from the rest of our infrastructure.

3) At the server(s) end, make careful use of the netmask to restrict the number of potential clients. That should force clients onto an alternative server, mitigating the CPU issues. I think we limit our servers to 300 or so clients. This choice was somewhat arbitrary on our part - "gut feel" if you like.

4) Also at the server end, you should make careful use of firewalls. In simple terms, we have ours configured such that the clients can VPN connect in, but the servers strictly disallow all ssh connections inbound except from a known IP address. We can SSH to the clients if we occasionally need to, they can't SSH to us.

5) Don't rely on OpenVPN doing the reconnect for you at the client end. 9 times out of 10 it will, but sometimes it gets stuck. Have a separate process to reset/restart openVPN at the client end regularly.

6) You need a way of generating unique keys for the clients so you can disavow them sometimes. We generate these internally with our server build (PXEboot) process. Never happened to us, but we know we can do it.

7) You will need some management tools, scripts to monitor your VPN server connections effectively.

There is not much material out there about how to do this unfortunately, but it is possible, with careful configuration.

Aitch
  • 1,179
  • 8
  • 14
  • Thank you very much for the insights. I am surprised that the rekeying problems already hit you with 300 clients... – Steffen Müller Oct 24 '12 at 08:53
  • To clarify - they have not, but I have not been tracking it either.... :-/ The "300" number just seemed reasonable. If we have problems we would just bump up the AWS image to a larger instance. I've never had close to that many connections on a server before, probably only about 100 max, but we run several servers and they approximately balance in line with openvpn randomly choosing a destination from a known list. – Aitch Oct 24 '12 at 11:15
  • Can you share any more details on how you do this: "5) Don't rely on OpenVPN doing the reconnect for you at the client end. 9 times out of 10 it will, but sometimes it gets stuck. Have a separate process to reset/restart openVPN at the client end regularly." – Doug May 18 '17 at 17:12
  • Sorry left that job 4.5 years ago(!), can't remember, but almost certainly some sort of process list, kill then service restart. – Aitch May 31 '17 at 17:54
  • (i do a similiar setup with currently about 400 devices on one VPN server) you need to make a decision what to do when vpn cannot be reached, timeout or gets rejected. the random retry interval wont help you forever and will only generate traffic. Depending on the issue you have to either do something on the client, on firewall/DSL, which you usually cannot, and therefor send the system to a sleep phase "meh, transfer data later" , or if the issue is the VPN server itself. You can estimate that through the logs and decide based on that. rekeying is not (yet) a problem for us. – Dennis Nolte Jul 05 '18 at 11:04
4

Update 2018

Not sure what all has changed since 2012. Just wanted to give an update as to my experience in 2018. We have deployed an openvpn network very similar to the OP setup. Our endpoints are full blown linux pcs instead of embedded devices. Each endpoint has a monitor used to display information and alarm for that site and our server allows us a single point to remote into all endpoints. The network isn't overly active but sometimes has 5-10 remote sessions simultaneously.

Using a current build of openvpn at around 100 clients on an azure image with a single core and 2gb of ram we use around 0.7% of memory on average and the cpu usage is almost always around 0%. Based on what I found for this smaller test I figure a single server with decent specs would easily handle 50000 concurrent if it had the ram to support it. If ram usage scaled linearly then 16gb would be able to handle 50000 users with enough extra on a dedicated openvpn machine.

We are not at a large enough scale to say that with significant confidence but I just wanted to give a recent update since when originally deploying our network I found this and was expecting much more resource usage at this scale. Now, I do believe the cpu that runs this does have hardware encryption and I'm not sure at what point that would be overloaded traffic wise but for endpoints that do not communicate a lot this shouldn't be an issue.

At 1000000 you would need 200gb of ram on a single machine(if scaled linearly with extra) while this is possible I would think at that point you would want to have 5 machines each with 64gb of ram so you don't have a single point of failure. This should allow maintenance, restarts and replacements of 1 or even 2 machines without significant issues.

My ram estimates are likely way overkill since I'm dividing the entire openvpn usage by number of clients where only a portion of that ram is due to clients.

We have added 74 endpoints in a year since initially deployed. I hope to continue to grow that number significantly and will make a further update if we get to a decent scale.

CraigZ
  • 41
  • 1
  • Can you share any more details on how you do this: "5) Don't rely on it won't let me comment on the thread above but I wanted to answer this:OpenVPN doing the reconnect for you at the client end. 9 times out of 10 it will, but sometimes it gets stuck. Have a separate process to reset/restart openVPN at the client end regularly." – Doug May 18 '17 at 17:12 – CraigZ Mar 23 '18 at 12:54
  • Hit a character limit. Use supervisord to do this. Make it restart automatically every 6-12h – CraigZ Mar 23 '18 at 13:01
  • Any further updates on this? Found your answer very interesting considering so much other forums out there seem to indicate the low to mid hundreds is what a single instance can support. – Peadar Doyle Aug 12 '21 at 08:25
1

I'm looking into a similar problem, although the number of clients would be into hundreds maybe a couple of thousands.

I figured that I can't keep all the clients connected all the time.

I'm thinking of starting OpenVPN daemon on clients at randomized time intervals so they can check if they were polled. If they were they are to send an email or something that they are online and send keep alive packets for a period of time so I can connect to them.

If there is no traffic for some time daemon would be stopped.

The problem I'm facing right now is that it seems impossible to get a list of currently connected VPN clients ...

  • 2
    You can get a current list of connected clients through openvpn status log. There you see all connected ips to the current server. – Fa11enAngel Jul 02 '14 at 17:26