3

I am attempting to direct client traffic to a kubernetes cluster NodePort listening on 192.168.1.100.30000.

Client's needs to make a request to 192.168.1.100.8000 so I added the following REDIRECT rule in iptables:

iptables -t nat -I PREROUTING -p tcp --dst 192.168.1.100 --dport 8000 -j REDIRECT --to-port 30000

I then issue a curl to 192.168.1.100:8000 however, in tcpdump i see a different port:

# tcpdump -i lo -nnvvv host 192.168.1.100 and port 8000
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
[Interface: lo] 20:39:22.685968 IP (tos 0x0, ttl 64, id 20590, offset 0, flags [DF], proto TCP (6), length 40)
[Interface: lo]     192.168.1.100.8000 > 192.168.1.100.49816: Flags [R.], cksum 0xacda (correct), seq 0, ack 3840205844, win 0, length 0
[Interface: lo] 20:39:37.519256 IP (tos 0x0, ttl 64, id 34221, offset 0, flags [DF], proto TCP (6), length 40)

I would expect the tcpdump to show something like

192.168.1.100.8000 > 192.168.1.100.30000

However, it is showing and causing a connection refused error since no process is listing on 192.168.1.100.49816.

192.168.1.100.8000 > 192.168.1.100.49816

I am using a test environment so i don't have access to remote devices that is why I am using curl to test the iptables REDIRECT path.

Is there a reason why adding a REDIRECT rule causes tcpdump to redirect the traffic to a different port than the one specified?

Edit:

After @A.B. suggestion added the following OUTPUT rule:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

and curl does proceed further, packet count for the OUTPUT chain does increase (PREROUTING REDIRECT chain packet didn't increase though):

2       10   600 REDIRECT   tcp  --  *      *       0.0.0.0/0            192.168.1.100         tcp dpt:8000 redir ports 30000

However, getting the following error:

# curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connected to 192.168.1.100 (192.168.1.100) port 8000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* NSS error -12263 (SSL_ERROR_RX_RECORD_TOO_LONG)
* SSL received a record that exceeded the maximum permissible length.
* Closing connection 0
curl: (35) SSL received a record that exceeded the maximum permissible length.

Also, tried adding a remotesystem net, this time the PREROUTING REDIRECT CHAIN packet count increases after executing remotesystem curl ... (but the OUTPUT CHAIN doesn't increase):

2       34  2040 REDIRECT   tcp  --  *      *       0.0.0.0/0            172.16.128.1         tcp dpt:8000 redir ports 30000

Error:

# ip netns exec remotesystem curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connection timed out
* Failed connect to 192.168.1.100:8000; Connection timed out
* Closing connection 0
curl: (7) Failed connect to 192.168.1.100:8000; Connection timed out
tiger_groove
  • 143
  • 4
  • Your rule won't work with a test from the host. Test again from a remote system, not from the system to itself. – A.B Mar 30 '22 at 21:36
  • Why wouldn't it work, could you explain? Is there a way to make it work from the host with a loopback interface? – tiger_groove Mar 30 '22 at 21:39
  • Please first add context to the question: you tell what you are doing, but I would like to know why you are doing it first (to solve what practical problem that made you use this?) – A.B Mar 30 '22 at 21:41
  • Added in the post – tiger_groove Mar 30 '22 at 21:55
  • Can't tell the reason for local system (rather than remote) use is explained, but the answer won't need it in the end. – A.B Mar 30 '22 at 21:57
  • I am using a test environment so i don't have access to remote devices that is why I am using `curl` to test the iptables REDIRECT path. What do you mean the answer won't need it anyway? – tiger_groove Mar 30 '22 at 21:59
  • 2
    Note that `192.168.1.100.8000 > 192.168.1.100.49816` doesn't mean "redirecting from port 8000 to port 49816", it means "a packet was sent from port 8000 to port 49816", which is simply the port used by your (local) client, and the packet is the TCP RST ("connection refused"). You should have a prior packet from 49816 to 8000 before that (the connection request, TCP SYN). And the connection refused is not because there isn't anything listening on 49816, but rather nothing listening on 8000. – jcaron Mar 31 '22 at 08:25

1 Answers1

4

To be clear: OP's test is done from the system 192.168.1.100 to itself, not from a remote system, and that's the cause of the problem. The port wasn't changed in this case because no NAT rule matched, while it would have matched if done from a remote system.

The schematic below shows how order of operations are performed on a packet:

Packet flow in Netfilter and General Networking

The reason is how NAT works on Linux: iptables sees a packet in the nat table only for the first packet of a new conntrack flow (which is thus in NEW state).

This rule works fine when from a remote system. In this case the first packet seen will be an incoming packet:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack --> nat/PREROUTING (iptables REDIRECT): to port 30000
--> routing decision --> ... --> local process receiving on port 30000

All following packets in the same flow will have conntrack handle directly the port change (or port reversion for replies) and will skip any iptables rule in the nat table (as written in the schematic: nat table only consulted for NEW connections). So, (skipping the reply packet part), the next incoming packet will undergo this instead:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack: to port 30000
--> routing decision --> ... --> local process receiving on port 30000

For a test on the system to itself, the first packet isn't an incoming packet but an outgoing packet. This happens instead, using the outgoing lo interface:

local process client curl --> routing decision --> conntrack --> nat/OUTPUT (no rule here)
--> reroute check --> AF_PACKET (tcpdump) --> to port 8000

And now this packet is looped back on the lo interface, it reappears as a packet which isn't anymore the first packet in a connection so follows second case as above: conntrack alone takes care of the NAT and doesn't call nat/PREROUTING. Except it wasn't instructed in the step before to do any NAT:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack
--> routing decision --> ... -->nolocal process receiving on port8000

as there's nothing listening on port 8000, the OS sends back a TCP RST.

For this to work on the local system, a REDIRECT rule must also be put in the nat/OUTPUT chain:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

Additional notes

  • if the case is intended for remote use, don't test from the local system: rules traversed by the test aren't the same. This makes the test not reflecting reality.

    Just use a network namespace to create a pocket remote system in case no other system is available. Example that should work with a system having only OP's nat/PREROUTING rule and doing curl http://192.168.1.100/ (which doesn't require DNS):

    ip netns add remotesystem
    ip link add name vethremote up type veth peer netns remotesystem name eth0
    ip address add 192.0.2.1/24 dev vethremote
    ip -n remotesystem address add 192.0.2.2/24 dev eth0
    ip -n remotesystem link set eth0 up
    ip -n remotesystem route add 192.168.1.100 via 192.0.2.1
    ip netns exec remotesystem curl http://192.168.1.100:8000/
    
  • tcpdump and NAT

    tcpdump happens at the AF_PACKET steps in the schematic above: very early for ingress and very late for egress. That means for a remote system case, it will never capture the port 30000 even when it's working. For the local system case, once the nat/OUTPUT rule is added, it will capture port 30000.

    Just don't trust blindly the address/port displayed by tcpdump when doing NAT: it depends on the case and where the capture happens.

A.B
  • 9,037
  • 2
  • 19
  • 37
  • Thank you so much for the detail explanation, I am having some issues still and have put more information in my post. It seems to be getting further than before but seems like something is still blocking. – tiger_groove Mar 31 '22 at 00:22
  • I suspect that 1/ the local process isn't a local process but a container/pod (since it's about Kubernetes), so is also routed or further filtered (=> no connectivity with the new net namespace). I based my answer on the single iptables rule present in the question and nothing not available. Moreover I tried successfully what I presented before I made the answer 2/ as explained counter in the nat table increases only for the first packet (in state NEW): it will either increase in nat/OUTPUT or nat/PREROUTING not both. filter table will see all packets. 3/ initial question wasn't about https – A.B Mar 31 '22 at 06:28
  • I won't change further this answer. You'd have to create a new question, with ALL context given in advance. And preferably reproduce a problem that won't depend on this current Q/A (keep the use with the OUTPUT rule or with a remote system that is known to connect) – A.B Mar 31 '22 at 06:28
  • That is fine, I will create a new question. Really appreciate your help! – tiger_groove Mar 31 '22 at 18:28
  • I have created a new question, please let me know if this makes sense to you https://serverfault.com/questions/1097511/iptables-redirect-to-kubernetes-nodeport-causes-request-to-hang – tiger_groove Mar 31 '22 at 19:48
  • Your new problem is about doing a curl request to your API, not about a timeout because I suggested a method to replace a remote system that didn't happen to work in your specific setup. iptables shouldn't be involved at all in the new question. I'm sorry I didn't explain correctly how should have been the new question – A.B Mar 31 '22 at 20:36
  • It's weird because when i perform the curl like this `ip netns exec remotesystem curl -vk https://192.168.1.100:30000/v1/flight` it works fine and i get a response back, only when I change it to `192.168.1.100:8000` it hangs, i'm not entirely sure why, but it seems like something doesn't like the REDIRECT iptables rule. – tiger_groove Mar 31 '22 at 21:09
  • I was talking about the case with SSL_ERROR_RX_RECORD_TOO_LONG which didn't hang. – A.B Apr 01 '22 at 06:21