How to diagnose severe network problems in small network?

Question

We have fairly small network with managed and unmanaged switches (Netgear GS748T, Linksys SLM2024, DGS-1008D, DES-1008D, DES-1026G, SRW224G4), about 8-10 hosts Hyper-V with multiple virtual machines, few hosts with VWMare and about 100 local users and another 100 vpn users (not connected all the time). Lately we've introduced Forefront TMG (making it a central point) in our network and made big changes to VLANs (from one 192.168.1.X network to 5-10 VLAN's splitting network into test machines, critical servers, iSCSI, Heart Bit - cluster HV, trusted users, untrusted users, etc). Most if not all network cards use Teaming, Aggregation and Trunk.

For the last weeks, months the network has been unstable with iSCSI problems during night when backups are done. Yesterday our network decided to go down during the day and was unavailable for 2 hours. During that time switches hanged 2 times and required hard resets and overall the network was not working correctly during that time. After 2 hours everything went back to fairly normal but it seems like it's gonna come back anytime soon.

Switches don't offer much monitoring capabilities, neither does the backup iscsi drives. Some errors in TMG:

Forefront TMG disconnected a non-TCP connection from 172.16.10.5 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG disconnected a non-TCP connection from 172.16.10.12 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

The number of concurrent TCP connections from the source IP address 178.215.xxx.xxx exceeded the configured limit. As a result, Forefront TMG will not allow the creation of new TCP connections from this source IP. This IP address probably belongs to an attacker or an infected host. See product documentation for more info about Forefront TMG flood mitigation.

The number of denied connections from the source IP address 77.1xxx.xxx exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

Forefront TMG disconnected a non-TCP connection from 172.16.10.10 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG disconnected a non-TCP connection from 172.16.10.16 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

The number of denied connections from the source IP address 195.ZZZ exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

The number of denied connections from the source IP address 85.ZZZ exceeded the configured limit. This may indicate that the host is infected or is attempting an attack on the Forefront TMG computer.

Forefront TMG disconnected a non-TCP connection from 172.16.231.12 because the connection limit for this IP address was exceeded. Larger custom connection limits should be configured for the IP addresses of chained proxy servers and back-to-back Forefront TMG computers with a NAT relationship.

Forefront TMG was unable to decompress a response body from stooq.pl because the response was compressed by a method which is not supported by Forefront TMG. This happens when a Web server is configured to supply responses compressed by a method that is not supported by Forefront TMG regardless of the type of compression requested.

If you want Forefront TMG to block such responses, configure the policy rule's HTTP policy to block the Content-Encoding header in responses. Otherwise, such responses will be forwarded without decompression to the client and can be cached. You can cancel or reduce the frequency of the alert generated by this event in Forefront TMG Management.

The connectivity verifier "Farm: Sharepoint.xxx.pl - Farm" reported an error when trying to connect to 14cms.xxx.xx. Reason: The request has timed out.

The connectivity verifier "DHCP1" reported an error when trying to connect to DHCP1.xxx.xx. Reason: The request has timed out.

We already played with TMG and setup some higher limits for our AD/DNS servers as we seens this messages before but it seems like it's happening all over.

score 3 · Accepted Answer · answered Nov 17 '11 at 11:48

3

"During that time switches hanged 2 times and required hard resets"

I'm not trying to be elitist here, but Linksys/D-Link/Netgear isn't even mid-size grade hardware. iSCSI and Virtualization requires a very stable and quick network to perform properly.

I strongly suggest you buy better networking gear (Cisco, HP etc).

answered Nov 17 '11 at 11:48

pauska

19,532
4
55
75

1

Managed switches will really help you here. I'm not sure how you are doing VLANS without managed switches actually because that is defined on the switch... Changing things on the network card llike MTU (for iSCSI) or NIC clustering with some cards and unmanaged switches can cause real issues because you end up flooding the port(s) with traffic they cannot handle until everything fails back to the lowest possible setting o work. – Top__Hat Nov 17 '11 at 12:55
We have managed switch, and some unmanaged (at rooms to split for couple of computers) – MadBoy Nov 17 '11 at 13:07
Without managed switches how do you know that you don't have a switch loop somewhere? I'm with pauska, replace the unmanaged switches with enterprise grade managed switches so that you have insight and control of every switch in the environment. – joeqwerty Nov 17 '11 at 13:30
@joeqwerty we do have managed switches but also couple of unmanaged. But so far the managed switches don't really give us much insight what is going on hence I guess we need to replace all of them too. – MadBoy Nov 17 '11 at 15:29
Also it was all working fine till we started using TMG full power and switched to multiple vlan's throwing off our old Draytek. Same equipment under 192.168.1.x and no vlans was working fine.. at least it feels like after the changes (as I can't really tell the time line as it's been having it's ups and downs for longer while now). Or I should say physical vlan's were there. Now we use virtual VLAN's so basically every port is capable of everything. – MadBoy Nov 17 '11 at 15:30
@MadBoy: Gotcha. What I'm saying is that without being able to manage the unmanaged switches you have no way of knowing whether or not a switch loop exists, no way of knowing how the VLAN traffic is being transited, etc., etc. – joeqwerty Nov 17 '11 at 15:33
Could a switch with loop cause havoc inside network bringing everything to it's knees? – MadBoy Nov 17 '11 at 15:45
1

It certainly could and would, as packets would be infinitely fowarded around the network, consuming all of the CPU and memory resources on the switches and saturating all of the links with the forwarded traffic. SwitchA thinks HostA is via SwitchB, SwitchB thinks HostA is via SwitchA, and so on and so on... for every packet sent into the network. Catastrophe and calamity would ensue. – joeqwerty Nov 17 '11 at 18:37

score 0 · Answer 2 · answered Nov 17 '11 at 12:26

Look at the error messages in TMG relating to internal traffic (172.16.x.x ones are a good place to start). Figure out what hosts those relate to, and whether or not these are appropriate actions for the firewall to take for the traffic on those hosts.

Never assume that a firewall comes with appropriate configuration out of the box especially if that firewall is to be deployed internally.

I'd also suggest using separate switches for your iSCSI Storage Network, rather than trying to segregate the traffic with VLANs. A lot easier to get your head around, and you really need to get your iSCSI traffic right if you are using it for VM hard drives!

How to diagnose severe network problems in small network?

2 Answers2