2

We are having a strange behaviour in our ESX-cluster:

The Infrastructure:

we have 2 ESXi 5.5.0 build 2718055 in a cluster, managed by vCenter. We are using an Essentials licence, so we do not have distributed switches. Our company network has multiple vlans, from which about 10 are needed in vm servers. The hardware is HP DL380 Gen8, 8 1Gb eth-ports. The switch (Cisco 2960E and 3850E) ports connected to the servers are configured with the cisco trunk vlan - all packets arrive with their vlan tag. The physikal networking is completly redundant, one of two switches AND one of two network cards on a server can fail without crashing the VMs.

All switchports are configured the same,

I am using 2 virtual switches (on each host), each switch has assigned

The Problem

When i reboot a vm, placed on esx1 and with automatic ip address configuration, the machine won't get a DHCP connect - the network connection is available, if i set a manual ip address everything works fine, but pconfig /refresh is haning, and DHCPExplorer does not find a valid dhcp server (which i can ping if i assigne a manual ip address).

Now i have to migrate the machine to esx2 and wait for some time (or do ipconfig /renew or disable and enable the nic) the machine will get a dhcp address. After that i can move the machine back to esx1, and it will work perfectly fine. After that i even get positive results from dhcp explorer.

I was then testing if the behaviour was connected to the physical part of the network: i removed all physical nics but one from the portgroup with the affected vlan, did some reboots with a dhcp machine, and then tested it with another nic - in short i forced all the traffic from this port group to go through one physical port of the nic and the switch.

The result was: the problem only occours on two different ports on two different nics, but they are both connected to the same switch.

It seems to me as if this switch is somehow blocking access to the dhcp service. Has anyone seen a behaviour like this? I am going out of opptions - soon we want to upgrade to ESX 6, but since we do also have VMWare View Desktop Virtualisation, the upgrade process will include a lot of work and testing and can't be done quickly...

EDIT:

Since the visual config of our switches is too large for the screen, i did an export of the virtual switches and portgroups via powershell.

The problematic host is host-1002, the problematic nics i identified are vmnic4 and vmnic8, the port groups where the problem was observed are PortGroup35 and PortGroup41

 Get-Virtualswitch|select Name, ID, NumPorts, NumPortsAvailable, Nic, MTU, VMHostID

RESULT:

Name              : vSwitch0
Id                : key-vim.host.VirtualSwitch-vSwitch0
NumPorts          : 4352
NumPortsAvailable : 4309
Nic               : {vmnic7, vmnic0, vmnic2, vmnic9}
Mtu               : 1500
VMHostId          : HostSystem-host-1001

Name              : vSwitch2
Id                : key-vim.host.VirtualSwitch-vSwitch2
NumPorts          : 4352
NumPortsAvailable : 4309
Nic               : {vmnic3, vmnic1, vmnic6, vmnic8}
Mtu               : 1500
VMHostId          : HostSystem-host-1001

Name              : vSwitch5
Id                : key-vim.host.VirtualSwitch-vSwitch5
NumPorts          : 4352
NumPortsAvailable : 4309
Nic               : {vmnic4}
Mtu               : 1500
VMHostId          : HostSystem-host-1001

Name              : vSwitch0
Id                : key-vim.host.VirtualSwitch-vSwitch0
NumPorts          : 4352
NumPortsAvailable : 4304
Nic               : {vmnic7, vmnic3, vmnic5, vmnic9}
Mtu               : 1500
VMHostId          : HostSystem-host-1002

Name              : vSwitch2
Id                : key-vim.host.VirtualSwitch-vSwitch2
NumPorts          : 4352
NumPortsAvailable : 4304
Nic               : {vmnic8, vmnic4, vmnic6, vmnic2}
Mtu               : 1500
VMHostId          : HostSystem-host-1002

Name              : vSwitch5
Id                : key-vim.host.VirtualSwitch-vSwitch5
NumPorts          : 4352
NumPortsAvailable : 4304
Nic               : {vmnic1}
Mtu               : 1500
VMHostId          : HostSystem-host-1002


Get-Virtualportgroup|select Name, VirtualSwitchId, Key, VLANId, VMHostID

RESULT:

Name            : PORTGROUP82
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP82
VLanId          : 82
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP90
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP90
VLanId          : 90
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP83
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP83
VLanId          : 83
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP16
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP16
VLanId          : 16
VMHostId        : HostSystem-host-1001

Name            : Management Network
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-Management Network
VLanId          : 41
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP80
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP80
VLanId          : 80
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP41
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP41
VLanId          : 41
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP35
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP35
VLanId          : 35
VMHostId        : HostSystem-host-1001

Name            : VMkernel
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch5
Key             : key-vim.host.PortGroup-VMkernel
VLanId          : 0
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP43
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP43
VLanId          : 43
VMHostId        : HostSystem-host-1001

Name            : PORTGROUP82
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP82
VLanId          : 82
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP83
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP83
VLanId          : 83
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP90
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP90
VLanId          : 90
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP16
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP16
VLanId          : 16
VMHostId        : HostSystem-host-1002

Name            : Management Network
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-Management Network
VLanId          : 41
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP80
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP80
VLanId          : 80
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP41
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP41
VLanId          : 41
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP35
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch2
Key             : key-vim.host.PortGroup-PORTGROUP35
VLanId          : 35
VMHostId        : HostSystem-host-1002

Name            : VMkernel
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch5
Key             : key-vim.host.PortGroup-VMkernel
VLanId          : 0
VMHostId        : HostSystem-host-1002

Name            : PORTGROUP43
VirtualSwitchId : key-vim.host.VirtualSwitch-vSwitch0
Key             : key-vim.host.PortGroup-PORTGROUP43
VLanId          : 43
VMHostId        : HostSystem-host-1002

EDIT: NEW INFORMATION

Now i realised, why the problem is only happening on esx1: the dhcp server for these machines is a vm, placed on esx2. So the dhcp requests from machines on esx2 would not even have to leave the virtual switch. If i move the dhcp server to esx1, the problem is solved there and starting on esx2. Still only one switch is affected, the other one is working fine. So in my opinion the problem definitely lies in the physical switch, not the virtual one.

Tobias
  • 1,236
  • 13
  • 25
  • More details please - ideally show us as much of your VSS configs as possible. It's a pity you only have that licence as Host Profiles would be ideal for this situation, to ensure both host are configured identically - I suspect a very small configuration difference on either host 1's VSS or on the switch ports. – Chopper3 Oct 07 '15 at 12:20
  • I'm not in the office right now, but i will soon post the VSS config. But since the network self is working (when i assigne a manual address i get a conection immediately), i can't belive it's the virtual switch. – Tobias Oct 07 '15 at 12:41
  • I'm thinking more along the lines of 'DHCP Helper' config on the switch rather than just the L2-specific config. – Chopper3 Oct 07 '15 at 12:52
  • On the VSS there is no ip helper config. And since the DHCP-Server is in the same subnet (private /24) as the machines, Why should there be an ip helper configured? – Tobias Oct 07 '15 at 12:56
  • It's a cisco thing, not a VMW thing - and physical L3 switches need to have this setting correctly configured to allow for DHCP to work, just being in the same VLAN usually isn't enough. – Chopper3 Oct 07 '15 at 13:06
  • Okay, did not know that! I will ask my network colleague... – Tobias Oct 07 '15 at 13:09
  • @Chopper3 I nailed down the problem to the physical switch (see edit of my question). What exactly do i have to set in the cisco switch to allow dhcp traffic through the same VLAN? – Tobias Oct 08 '15 at 07:46
  • What balancing mode configured on your vSwitches? – Alexander Tolkachev May 15 '17 at 22:06

2 Answers2

0

Your switch may have inconsistent spanning tree settings on the different switch ports.

How long are you waiting before you consider this "failed"? Do you have access to the Cisco switch configuration?


Outside of that, it would be good to see your Virtual Switch configuration like this example.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I don't have access to the switch config, but i'm sharing office with the guy who has... We checked it, we can't see any difference in the port config. Spanning tree was our first clou, and we saw that the SP settings were different in the begining. But changeing this did not solve the problem. And how l long i wait? Sometimes it happenes on the evening, and when i arrive in the morning, the machine is still showing the 169. address. So i would say: hours sometimes. – Tobias Oct 07 '15 at 12:25
  • @Tobias See my edit and post a screenshot of your switch configuration. – ewwhite Oct 07 '15 at 12:28
0

Thanks for updating your question and comments, basically you need to set a 'DHCP Helper' on the specific switch for that port/VLAN.

Basically on the switch do;

enable conf t int {whatever port} ip helper-address {DHCP server IP or cluster VIP}

then test and if successful write your config back to startup.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • We will test this. But even if it will work: i just cannot belive that i have to manually set the dhcp server address on every switch port, although i just want to reach the server from within the same subnet! If this is the case, 1) i don't understand why it is working on all other switches (without configuring the dhcp helper), but not on this one, 2) in my opinion (as Windows admin with little knowledge about networking and cisco) this behaviour makes cisco the worst network hardware supplier in the world. – Tobias Oct 08 '15 at 08:08