1

I have a fairly simple Openstack setup for a PoC. 2 nodes, both running Nova, and everything else on node 1. It is running CentOS 6 and was set up using RDO. Importantly I am using Neutron for the networking, with GRE tenant networks set up from the RDO docs for an existing network.

Periodically (every few days I reckon) I lose all communication with Openvswitch (and thus my instances). I know it OVS, because I can SSH into node 2, then connect to node 1 via their private network. The most telling thing I see in the logs is this:

unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)

In addition OVS is using HUGE amounts of CPU (800% on my 16-core boxes), and when I try and do a clean shutdown, it just never happens because it cannot kill ovsdb-server.

I have done some Googling and found some old suggestions based on older Openstack releases where people had OVS/kernel version mismatches. As I am running the versions from RDO I reckon I can discount that (unless Red Hat have made a massive screw up).

Anyone else seen this? have any suggestions?

PS: Do not tell me to recompile Openvswitch, for various reasons that is not happening in the immediate future.

masegaloeh
  • 17,978
  • 9
  • 56
  • 104
chriscowley
  • 523
  • 4
  • 17
  • Can I ask you what soft of hardware you're on? We had severe issues with openvswitch. For us it used to cause kernel panic, so, we decided to replace ovs with linux bridge. EVerything has been quite smooth ever since. We're on Cisco UCS hardware and OVS is a pain!!! It is not ready for production yet! – Nikolas Sakic Mar 19 '15 at 02:40

1 Answers1

2

Which version OpenStack, which version RDO repo are you using? I'm merely guessing with such little detail, but looks as you indicate some kind of issue with OpenvSwitch and your kernel, a runaway OVS process. Could likely be database or messaging agent related.

Check your qpid logs: /var/log/messages for something that shows a reason for disconnect at the time of your instance communication loss. This could reveal as to why there may be messaging disconnects and whether caused by messaging connect failure (external/tertiary cause); or the other way around, caused by OVS disconnect (likely OVS/kernel build issue).

Since RDO is "...tested on a RHEL 6.4", I would be using CentOS 6.4 minimum, rather than 6 as you state. Even better use 6.5 as there are a number of components included in the kernel, rather than patched as required with RDO.

Additional troubleshooting on your behalf is difficult without logs and details of your config, but after you have assessed this, suffice to say that there are known Neutron configuration challenges to overcome with GRE and MTU settings.

In any case for a successful OpenStack build (no matter how basic, it is complicated), you need to start with a supported and up to date build of OS, kernel and OVS. How can you be sure that you can discount "OVS/kernel version mismatch", what versions are you using?

I'd suggest you configure with latest CentOS 6.5 and RDO, then re-post if issue persists (with updated details, logfiles, etc) additionally on RDO forum: http://openstack.redhat.com/forum/ as then you will get the distro specific details that you may need.

EDIT: Check dhcp.ini and dnsmask config via these articles for MTU settings, apparrently 1454 should be about right for guest instances when running GRE: http://bderzhavets.blogspot.com.au/2014/01/setting-up-two-physical-node-openstack.html https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/

Don't forget there could still be issues with MTU and GRE depending on your kernel and OVS versions, so please advise what versions you have and update your post, so you can assist with others having similar issues as well, On both nodes show results for:

uname -a

rpm -qpi | grep openvswitch

Also take a look at your OVS GRE flows and run some tcpdumps in the relevant qrouter namespace when you are making your large 20G transfer, this guide from RDO will help, take a look at Joe Talerico's great GRE debugging on two node explanation at 60 minutes onwards: http://www.youtube.com/watch?v=wEa_8ESxPAY&feature=share&t=1h20s

And finally you also need to check you aren't being affected by Generic Receive Offload config as per post #24: https://bugs.launchpad.net/neutron/+bug/1252900