4

I have very busy Web servers, and wanted to introduce some analysis to see what kind of traffic present. Namely, total number of all connections, number of time waits, established connections, udp and tcp connections.

First, I made a graph simple - displaying only total number of connections by reading /proc/sys/net/netfilter/nf_conntrack_count with:

$ cat /proc/sys/net/netfilter/nf_conntrack_count 1994

Everything was nicely presented in the graph, so I introduced more details into it. Now processing /proc/net/nf_conntrack with similar commands and placing to appropriate monitoring:

$ grep -c tcp /proc/net/nf_conntrack 1273 $ grep -c udp /proc/net/nf_conntrack 49

Made this analysis of nf_conntrack to run every minute. Initially everything was properly displayed so I left it for a day.

Next day I noticed huge drops and re-bounces in total connection count (/proc/sys/net/netfilter/nf_conntrack_count) which were not normal for Web server occurring every couple of minutes. After many testing and troubleshooting I finally pinpointed the reason behind mystery.

I've put in terminal watch -n0 "cat /proc/sys/net/netfilter/nf_conntrack_count" (to check in near real-time number of connections) and in second I did only cat /proc/net/nf_conntrack, and as soon as enter was pressed nf_conntrack_count dropped hugely from 1993 to 1411, and then it recovered back, in 2-3s, to "normal" value. Tried with cp, grep, conntrack -L -p tcp, etc. and each time I ran command there was this drop.

Basically, every time there were reading of /proc/net/nf_conntrack- huge, temporary, drop in /proc/sys/net/netfilter/nf_conntrack_counthappened and monitoring sometimes pick low value(s) and represent it in the graph.

Further, I've noticed that there is huge difference in results from cat nf_conntrack and conntrack -L. Also, number of lines in nf_conntrack differs from nf_conntrack_count. Kernel is v4.19.5. Everything is so visible with these two commands, deployed three seconds apart:

[07:30:14] root@web1(~)$ wc -l /proc/net/nf_conntrack; \
                         cat /proc/sys/net/netfilter/nf_conntrack_count
1236 /proc/net/nf_conntrack
1575

[07:30:18] root@web1(~)$ cat /proc/sys/net/netfilter/nf_conntrack_count;\
                             wc -l /proc/net/nf_conntrack
2009
1191 /proc/net/nf_conntrack

My question is what is exactly going on here, why this is happening (the drop), why there is difference between in listed files, and how to prevent this drop?

Victoria Javi
  • 41
  • 1
  • 3

2 Answers2

1

I tried to do the same testing with grep -c tcp /proc/net/nf_conntrack and watch -n0 "cat /proc/sys/net/netfilter/nf_conntrack_count" on one our production server running CentOS 7.4.1708 with kernel 3.10.0-693.21.1.el7.x86_64 and I cannot confirm there is the same problem as you're facing.

Try to send at least your kernel version, maybe someone, who is running server with the same version, could test it as well.

One idea what could going on is that you are hitting some system memory or CPU limits while using grep and it has impact on nf_conntrack. You could try to run ie. nice -n19 grep -c tcp /proc/net/nf_conntrack and use ulimits or cgroups to trottle RAM. Another idea is to try google for kernel version or nf_conntrack version in conjunction with your problem definition. It could be some bug, but is not very probable.

patok
  • 693
  • 1
  • 5
  • 14
  • Kernel version was specified in original post, it is v4.19.5. Tried with `nice -n19 grep -c tcp /proc/net/nf_conntrack`, same results -- huge drop. – Victoria Javi Dec 11 '18 at 17:43
1

Overall I think that it depends on your kernel version and the number of the connections you are tracking. IIRC, the kernel needs to acquire some locks in order to generate and /proc/net/nf_conntrack, which is probably the reason you are seeing the drop.

A better way is to use the conntrack utility which gets the information using netlink and doesn't suffer from the same set of problems.

V13
  • 231
  • 1
  • 5
  • Yes, this indeed seems due locking + huge number of connections. As noted in OP, drop is happening if doing analysis with conntrack as well. – Victoria Javi Dec 18 '18 at 18:11