Cannot create nested network namespace

Question

Is seems that one is not able to create a network namespace from a network namespace. It results in "Error: Peer netns reference is invalid.".

Is this a bug or is there some kind of limitation that I am not aware of?

Below is my cmd trace of the error.

# ip netns add foo1
# ip netns exec foo1 ip netns add foo2
# ip netns
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
foo2
foo1
# ip netns exec foo2 /bin/bash
setting the network namespace "foo2" failed: Invalid argument

A.B · Accepted Answer · 2019-04-05T14:55:10.130

TL;DR: As weird as it seems, this is actually not a network namespace issue, but a mount namespace issue and is to be expected.

You should create all new "ip netns namespaces" (see later for the meaning), i.e. run all ip netns add ... commands from the initial (host) "ip netns namespace", not from inside an "ip netns namespace" having been entered with ip netns exec .... As long as you don't create them you're then free to switch between them at will including nesting commands from one to an other, with ip netns exec ....

Detailed explanation with step-by-step examples following...

ip netns is specialized on network namespaces, but to handle all features, has also to mingle with mount namespaces for two reasons (at least, that I know of):

bind mounting /etc/netns/FOO/SOMESERVICE to /etc/SOMESERVICE to manage alternate service/daemon configurations

A feature which can be handy to easily run some (network related) daemons in an other network namespace but beside this being still part of the "host". You can check my answer at UL on a question about it there: Namespace management with ip netns (iproute2). Its use requires the same treatment as the following feature, so I won't talk about it anymore.
remounting /sys to expose new network namespace's network devices in its hierarchy

This one is a mandatory feature. Example exposing the problem:

From "initial host":
```
# ip link add dev dummy9 type dummy
# ip -br link show dummy9 
dummy9           DOWN           f6:f6:48:9c:12:b9 <BROADCAST,NOARP> 
# ls -l /sys/class/net/dummy9
lrwxrwxrwx. 1 root root 0 Apr  4 22:09 /sys/class/net/dummy9 -> ../../devices/virtual/net/dummy9
```
Using a lower level tool to change to an other (ephemeral) network namespace:
```
# unshare --net ip -br link show dummy9 
Device "dummy9" does not exist.
# unshare --net ls -l /sys/class/net/dummy9
lrwxrwxrwx. 1 root root 0 Apr  4 22:13 /sys/class/net/dummy9 -> ../../devices/virtual/net/dummy9
```
And that's the issue: /sys still exposes initial host's interfaces instead of the new network namespace's interface. That's where there is an interaction between network namespace and with mounting /sys: if /sys is mounted from the new network namespace, it will switch to exposing the new network interfaces in select directory hierarchies (eg /sys/class/net and /sys/devices/virtual/net). This is done at mount time only, not dynamically. Some advanced network settings are easily available by just reading or writing there, so they have to be provided, and the reverse is true: the isolated processes running in the new network environment shouldn't be able to see or alter the initial host's interfaces.

So ip netns exec FOO ... (but not ip netns add FOO) solves this by also unsharing the mount namespace and remounting /sys/ inside it, to not disrupt initial host's network namespace. But what is important is that this mount namespace is itself ephemeral: when you run separately two ip netns exec FOO ... commands, they don't end up in the same mount namespace. They each have their own, with /sys remounted there pointing to the same network namespace.

Until now, no problem. I'll call this an "ip netns namespace" when this happened since there are now two types of namespaces involved. We have so far:

term1:

# ip netns add FOO
# ls -l /proc/$$/ns/{mnt,net}
lrwxrwxrwx. 1 root root 0 Apr  4 22:28 /proc/1712/ns/mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Apr  4 22:28 /proc/1712/ns/net -> net:[4026531992]
# ip netns exec FOO bash
# ls -l /proc/$$/ns/{mnt,net}
lrwxrwxrwx. 1 root root 0 Apr  4 22:33 /proc/1864/ns/mnt -> mnt:[4026532618]
lrwxrwxrwx. 1 root root 0 Apr  4 22:33 /proc/1864/ns/net -> net:[4026532520]

term2:

# ls -l /proc/$$/ns/{mnt,net}
lrwxrwxrwx. 1 root root 0 Apr  4 22:32 /proc/1761/ns/mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Apr  4 22:32 /proc/1761/ns/net -> net:[4026531992]
# ip netns exec FOO bash
# ls -l /proc/$$/ns/{mnt,net}
lrwxrwxrwx. 1 root root 0 Apr  4 22:33 /proc/1866/ns/mnt -> mnt:[4026532821]
lrwxrwxrwx. 1 root root 0 Apr  4 22:33 /proc/1866/ns/net -> net:[4026532520]

Note how after changing ip netns namespaces, while the new network namespace is the same for term1 and term2, the new mount namespaces are different from each others (and from initial host).

Now what happens when in term1 you create a new ip netns namespace? Let's see:

term1:

# ip netns add BAR
# ip netns ls
BAR
FOO

term2:

# ip netns ls
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
BAR
FOO

That's because the newer namespace BAR, to be kept existing without a process, is, as others, mounted on (the newly created empty file) /var/run/netns/BAR (again, see previous link for examples). While the mount namespaces are different, they have the same root directory: initial host's root. So of course this newly created empty file /var/run/netns/BAR could be seen everywhere (initial, term1's mount ns, term2's mount ns) when it was created.

Alas, the mount over it, being done on term1's FOO's mount namespace, can only be seen on term1, not on term2 nor anywhere else, because it's a different mount namespace. So while in term1 ('s FOO ip netns namespace) /var/run/netns/BAR is a pseudo-file belonging to the nsfs pseudo-filesystem:

term1:

# stat -f -c %T /var/run/netns/BAR
nsfs

It's an empty file on tmpfs (from the actual /run mount) anywhere else:

term2:

# stat -f -c %T /var/run/netns/BAR
tmpfs

Any other terminal:

$ stat -f -c %T /var/run/netns/BAR
tmpfs

It can still be seen in term1 as long as one doesn't exit the current "ip netns namespace". If from term1 one still switches ip netns namespaces , it will still be fine, because the new unshared ephemeral mount namespace is a copy of the previous, including all the mounts.

If exited, that mount point is lost (and that means if there are no processes or file descriptors using it anymore, BAR's corresponding network namespace will disappear because it was held only by this mount point). After this any ip netns ls command will complain, anywhere. You can just remove the stale and now useless file /run/netns/BAR to fix it.

After this step-by-step explanation, what to remember is that you shouldn't create new namespaces with ip netns add inside a namespace currently entered with ip netns exec. You should create them all from the initial (host) namespace, then you can switch at will between them from any ip netns namespace.

Of course, if /var/run/netns/ (i.e. the mount point /run) is distinct between (staying fuzzy) namespaces, then there is no interaction, and each ip netns invocation will be isolated from others, not seing nor interacting with others. Where does this usually happen? In full containers, where both the mount and the network namespaces are separated and point to distinct resources from the start.

UPDATE: as asked in comments, I checked how to "repair" this problem, but couldn't find any easy solution.

First there's a prerequisite: as told above, once the new "ip netns" namespace BAR is created inside FOO, and FOO is left, the only reference to BAR will disappear, thus making BAR also disappear. Something more is needed.

Actually there are three ways to keep a reference to a namespace:

process: that's the main method, and most of the time that's how the namespace is used at all
mount point (that's the method used by ip netns): allows to keep a namespace without any process, fine to have a namespace with only network settings inside (interfaces, bridges, tc rules, firewall rules, ...)
open file descriptor: rare, used when creating the namespaces, but seldom kept, except for applications dealing with multiple namespaces at the same time and switching some of their threads using the file descriptor for easy reference.

We can use the 1st or 3rd method. Here are various failed attempts before finding something that works...

As told before, won't work:

# ip netns add FOO
# ip netns exec FOO ip netns add BAR

Just leave a process running temporarily in the first "ip netns" namespace, for its ephemeral mount namespace part, to keep the needed reference to the new "ip netns" namespace's network namespace and reuse it later from outside (from the initial namespace).

Won't work either:

# ip netns add FOO
# ip netns exec FOO sh -c 'ip netns add BAR; sleep 999 < /var/run/netns/BAR & echo $!'
28344
# strace -e trace=readlink,mount mount --bind /proc/6295/fd/0 /var/run/netns/BAR
readlink("/proc/6295/fd/0", "/run/netns/BAR", 4095) = 14
readlink("/var/run", "/run", 4095)      = 4
mount("/run/netns/BAR", "/run/netns/BAR", 0x55c88c9cccb0, MS_BIND, NULL) = 0
+++ exited with 0 +++
# stat -f -c %T /run/netns/BAR
tmpfs

As seen with strace the mount command followed the symlink when it shouldn't have for this use case (note: the mount is still linked to the sleep process somehow which has to be killed to unmount it).

This (entering sleep's mount namespace, to access the BAR's mounted network namespace hidden there) works but relies on the continued existence of sleep or any process for continued use:

# ip netns add FOO
# ip netns exec FOO sh -c 'ip netns add BAR; ip -n BAR link add dummy8 type dummy; sleep 999 & echo $!'
12916
# nsenter --target=12916 --mount ip -n -brief BAR link show
lo               DOWN           00:00:00:00:00:00 <LOOPBACK> 
dummy8           DOWN           8e:ce:b3:d1:9c:bb <BROADCAST,NOARP>

strangely this (using the mount namespace shortcut /proc/pid/root/) doesn't work (I don't really know why):

# stat -f -c %T /proc/12916/root/var/run/netns/BAR 
tmpfs

Finally what will work:

# ip netns add FOO
# ip netns exec FOO sh -c 'ip netns add BAR; ip -n BAR link add dummy8 type dummy; ip netns exec BAR sh -c '\''sleep 999 & echo $!'\'
14124
# mount --bind /proc/14124/ns/net /var/run/netns/BAR
# ip -n BAR -brief link show
lo               DOWN           00:00:00:00:00:00 <LOOPBACK> 
dummy8           DOWN           3a:48:65:20:68:c1 <BROADCAST,NOARP>

So something like this could be used in the end. There might be race conditions if you attempt to delete them right after, before the sleep command ends.

# ip netns add FOO
# mount --bind /proc/$(ip netns exec FOO sh -c 'ip netns add BAR; ip netns exec BAR bash -c '\''sleep 5 </dev/null >/dev/null 2>&1 & echo $!; disown'\')/ns/net /var/run/netns/BAR

How could such a construct be used? I have no idea because the original problem before encountering the nested "ip netns" problem was not given. Maybe easier solutions are available without ever trying to create "a nested network namespace".

Great answer, thanks. Is there a way to create a new netns safely while inside a netfs. i.e ip netfs exec foo1 /bin/bash.... ip netns exec ip netns add foo2? — user98651, Apr 05 '19 at 10:08
It appears much more difficult than it seemed, and I don't see how to use the result in an actual use case. Perhaps you should ask an other question, about the *original* problem which forced you to try creating "nested network namespaces". Anyway I'm updating the answer. — A.B, Apr 05 '19 at 14:48
The original problem was to make openvpn move the tun interface into a new namespace. This way one could make a namespace with just two interfaces (lo and tun0) that is VPN'd without altering the global network stack. Creating nested namespaces would be convenient to create a system for: localhost -> vpn1 -> vpn2 -> public internet. It actually does in fact work this way, except the only inconvenience is the "invalid peer netns" error messages. See https://github.com/conallprendergast/openvpn-netns-systemd for the implementation — user98651, Apr 10 '19 at 14:01
As long as you have a references on those namespaces it's easy later to "repair" this issue in initial host. So here I'm assuming openvpn is running in a net namespace. something like `mount --bind /proc/$(pidof openvpn)/ns/net /var/run/netns/somenamespace` . The question didn't show any process launched. — A.B, Apr 10 '19 at 14:47

Cannot create nested network namespace

1 Answers1