In one site, we have vSphere hosts with 2 10G NICs.
- Both NICs are in one vSwitch
- failover policy IP Hash for static LAG (this is to balance/failover VM traffic)
- Both NICs (HPE 534FLR-SFP+/QLogic 57810) have hardware iSCSI initiators enabled
- Each initiator is bound to a vmkernel port one-to-one.
- One iSCSI subnet
- Both HBAs can access all SAN targets across the switch stack
Switches are Cisco C3850 series and MPIO works like a charm - tested (failover on path level etc…).
We deployed similar configuration in another site a few days ago. Same vSphere configuration, same NICs. However this time this configuration does not work properly.
Initiators can only access Targets in the ports of the same switch (that initiator is connected to) but not in the ports of the other switch.
With tcpdump I can see that vmkernel ports do discovery to all targets (done by vSphere according to documentation), static discovery targets do appear (SAN sees that it has been poked), however paths are never created and esxcli shows 0x0004 error (transport error?) for targets in another switch. This is quite hard to investigate as well as we can't directly see HBA traffic. Software iSCSI works like a charm however (bound to same vmkernel ports).
Switches are Cisco Nexus this time (I'll update the model when I know it) and stacking is VCP (?) instead of C3850 native(?). Otherwise sites are mostly the same but IMHO differences are so minor not to make a difference. Just to point out some:
- HPE Proliant Gen9 vs Gen10
- vSphere 6.5 vs 6.7 - as our backup now supports 6.7, we'll update old site shortly.
I've searched VMware documentation and I've found nothing that says that converged networking shouldn't work. We consulted our networking partner but they didn't understand how it currently worked and thought that it shouldn't work at all.
Is this configuration normal or are we depending on some implementation quirk of C3850 that doesn't work on other switches? Or is there something obviously wrong with switches?