This problem was interesting as I've often wondered myself. I did a couple tests and found some interesting results. If I open up one connection to a server and wait 60 seconds it was invariably cleaned up(never got to 0.00/0/0). If I opened 100 connections, they too were cleaned up after 60 seconds. If I opened 101 connections I would start to see connections in the state you menitoned(that I've also seen before). And they appear to last roughly 120s or 2xMSL(which is 60) regardless of what fin_timeout is set to. I did some digging in the kernel source code and found what I believe is the 'reason'. There appears to be some code that tries to limit the amount of socket reaping that happens per 'cycle'. The cycle frequency itself is set on a scale based on HZ:
linux-source-2.6.38/include/net/inet_timewait_sock.h:
35 #define INET_TWDR_RECYCLE_SLOTS_LOG 5
36 #define INET_TWDR_RECYCLE_SLOTS (1 << INET_TWDR_RECYCLE_SLOTS_LOG)
37
38 /*
39 * If time > 4sec, it is "slow" path, no recycling is required,
40 * so that we select tick to get range about 4 seconds.
41 */
42 #if HZ <= 16 || HZ > 4096
43 # error Unsupported: HZ <= 16 or HZ > 4096
44 #elif HZ <= 32
45 # define INET_TWDR_RECYCLE_TICK (5 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
46 #elif HZ <= 64
47 # define INET_TWDR_RECYCLE_TICK (6 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
48 #elif HZ <= 128
49 # define INET_TWDR_RECYCLE_TICK (7 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
50 #elif HZ <= 256
51 # define INET_TWDR_RECYCLE_TICK (8 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
52 #elif HZ <= 512
53 # define INET_TWDR_RECYCLE_TICK (9 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
54 #elif HZ <= 1024
55 # define INET_TWDR_RECYCLE_TICK (10 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
56 #elif HZ <= 2048
57 # define INET_TWDR_RECYCLE_TICK (11 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
58 #else
59 # define INET_TWDR_RECYCLE_TICK (12 + 2 - INET_TWDR_RECYCLE_SLOTS_LOG)
60 #endif
61
62 /* TIME_WAIT reaping mechanism. */
63 #define INET_TWDR_TWKILL_SLOTS 8 /* Please keep this a power of 2. */
The number of slots is also set here:
65 #define INET_TWDR_TWKILL_QUOTA 100
In the actual timewait code you can see where it uses the quote to stop killing off TIME_WAIT connections if its already done too many:
linux-source-2.6.38/net/ipv4/inet_timewait_sock.c:
213 static int inet_twdr_do_twkill_work(struct inet_timewait_death_row *twdr,
214 const int slot)
215 {
...
240 if (killed > INET_TWDR_TWKILL_QUOTA) {
241 ret = 1;
242 break;
243 }
Theres more information here on why HZ is set to what it is:
http://kerneltrap.org/node/5411
But it isn't uncommon to increase it. I think however its usually more common to enable tw_reuse/recycling to get around this bucket/quota mechanism(which seems confusing to me now that I've read about it, increasing HZ would be a much safer and cleaner solution). I posted this as an answer but I think there could be more discussion here about what the 'right way' to fix it is. Thanks for the interesting question!