1

I have recently replaced a Fedora Core 3 server with one running CentOS 5 for the purposes of a mail relay in a DMZ.

My problem is that when the server is sending messages to a particular remote organizaion, messages larger than 500KB don't send properly. They seem to get hung up somewhere through the transmission, time out, then after the predictable retries, expire in the queue.

Sendmail describes the problem as: MDeferred: Connection timed out with [site].

Large (or larger) messages send successfully to every other remote system we have tested with. It is just this single organization which we are having trouble with. Similarly, we can send large or larger messages to this organization as long as our CentOS 5 relay is not involved.

We spent a lot of time with packet traces that were not very helpful. It appears that after a certain depth of transmission, the other side started requesting packet retransmission, which we did, but the retransmitted packets never seemed to reach their side.

Messing with iptables (ie turning it off completely) didn't help either.

Today we put an XP system in the DMZ as a peer to the relay, and it can send to the remote organization just fine while at the same time the relay could not. This, to my mind, rules out all the firewalls and network paths between us and the remote organization and points the finger directly at the mail relay.

Given that I am revisiting this sendmail after setting it up for Fedora Core 3, is there something wrong I might have done while setting this system that would manifest in this way?

David Mackintosh
  • 14,223
  • 6
  • 46
  • 77

1 Answers1

2

Most of the times that I've come across this problem, I disable TCP Window scaling and the problem is bypassed. In your /etc/sysctl.conf add the following lines at the bottom:

net.ipv4.tcp_rmem = 4096 87380 174760
net.ipv4.tcp_wmem = 4096 16384 131072
net.ipv4.tcp_window_scaling = 0

Then as root execute sysctl -p and see what happens. Note that this is not a solution to the problem, just a bypass. Things that I've found out that trigger this behavior include the switches that your machines are connected on, the actual cables, the software version of some device in between, and various combinations of the tg3 ethernet driver and chipsets. You may even observe that if you install another operating system (say OpenBSD) on the same machine the problem vanishes.

I've also seen others setting the MTU to 500 to make this go away.

But like I said, I offered a possible bypass, not a solution to your problem.

adamo
  • 6,867
  • 3
  • 29
  • 58
  • 2
    I think an explanation of *why* the OP should do this is in order. Random sysctl value tweaks are not the way one should maintain their servers. – adaptr Nov 05 '12 at 15:29
  • These sysctl.conf changes worked. Now monitoring to ensure we have not broken anything else by accident. The implied explanation of "do this to turn off TCP Window Scaling" is adequate for me. – David Mackintosh Nov 05 '12 at 15:46
  • Why does one disable TCP window scaling? Because somewhere between the two talking parties a piece of equipment has problems with fragmentation and reassembly of the TCP packets (or [corrupts them](http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt.html)). Depending the case and how much of the circuit you control, a tcpdump explains the "why". The decision is far from random. And it is not a solution, just a bypass. Why? Because smaller packets pass unharmed. – adamo Nov 05 '12 at 16:42