Connection timeouts from Nova_compute to Keystone, RabbitMQ etc

Question

I've working on (and off) a deployment of Openstack over the past few months (nearly a year), and I've come across a number of issues during the deployment, most of which was either bad switch configuration, or a bad configuration on the heat templates.

I've been able to complete a successful deployment of Openstack multiple times with a fresh deployment, however as I was preparing the Overcloud with projects, I was unable to create an instance. From the output of "compute service list":

openstack compute service list
+----+----------------+----------------------+----------+---------+-------+----------------------------+
| ID | Binary         | Host                 | Zone     | Status  | State | Updated At                 |
+----+----------------+----------------------+----------+---------+-------+----------------------------+
|  1 | nova-conductor | controller-0.host.cp | internal | enabled | up    | 2021-04-20T20:43:03.000000 |
|  2 | nova-scheduler | controller-0.host.cp | internal | enabled | up    | 2021-04-20T20:43:01.000000 |
| 12 | nova-compute   | compute-0.host.cp    | nova     | enabled | down  | 2021-04-20T09:47:52.000000 |
+----+----------------+----------------------+----------+---------+-------+----------------------------+

I've also noticed that I attempted a scale out with one additional node, but it's not present in the list above, or in the "hypervisor list", but it is visible from a "server list" from the undercloud node:

openstack server list
+--------------------------------------+--------------+--------+-----------------------+----------------+-----------+
| ID                                   | Name         | Status | Networks              | Image          | Flavor    |
+--------------------------------------+--------------+--------+-----------------------+----------------+-----------+
| 5cb29129-7ce8-439a-b00b-3868d5a9aa74 | compute-1    | ACTIVE | ctlplane=10.128.0.136 | overcloud-full | baremetal |
| 58c3d587-d2a8-4601-87a7-3fd3d32a78b6 | controller-0 | ACTIVE | ctlplane=10.128.0.5   | overcloud-full | baremetal |
| 288dde8f-5664-42b2-b9f4-333992964dde | compute-0    | ACTIVE | ctlplane=10.128.0.75  | overcloud-full | baremetal |
+--------------------------------------+--------------+--------+-----------------------+----------------+-----------+

I've carried out 2 fresh installs, and I'm now faced with the following issue for all compute services that are intended to connect to the Controller node:

2021-04-23 22:28:37.891 7 ERROR nova keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://10.127.2.8:5000/v3/auth/tokens: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

A manual curl from the compute node to the keystone endpoint yields the following (expected) output:

curl http://10.127.2.8:5000/v3/auth/tokens
{"error":{"code":401,"message":"The request you have made requires authentication.","title":"Unauthorized"}}

I don't believe that this is something in the network stack that's causing this issue, and is instead something else. I'd appreciate any assistance with this.

Deployment Information: Controller Nodes = 1 Compute nodes = 2 deployed, 4 introspected OS = CentOS Steam 8 (both undercloud and overcloud) Networking:

4 Interfaces: 1 primary, 2 port bond (OVS + LACP), 1 storage port
2 Juniper EX3400's clustered (LACP configured on bonded ports)

Let me know if any further information is required.

EDIT:

Here is a TCP dump from both Compute and Controller, outlining the transaction of the call to keystone: https://pastebin.com/ADT4RCun

Check the Keystone and web server logs. Some tests I would conduct (in no particular order): Double-check that the compute node has the right credentials; `tcpdump -xX port 5000` to see the exact communication with Keystone; `openstack token issue` on the controller, then run `curl -H "x-auth-token THE_TOKEN_YOU_ISSUED" ...` on the compute node; install the openstack client on the compute node and use `--debug`. — berndbausch, Apr 24 '21 at 01:04
So for testing keystone, this is the output that I get: `{"error":{"code":404,"message":"No token in the request","title":"Not Found"}}`. As for the tcpdump, I had to specify vlan111 as the interface, otherwise it defaults to br-ex. And I can see that from a TCP standpoint the 2 ends are communicating, but I never see the keystone auth response from the controller on the compute's tcpdump output. I'll add the tcpdump to the main question once I've pulled it all from the hosts — humroben, Apr 24 '21 at 11:01
I also took the POST request data from the TCP dump, and I'm able to replicate without any of the nova services. The output I ended up getting was this `curl: (56) Recv failure: Connection reset by peer`. — humroben, Apr 24 '21 at 11:11
The TCPdump ended up being all I needed. It was a mis-configuration from my side regarding the switches. Thanks for that — humroben, Apr 24 '21 at 11:37
Perfect. My curl command missed a colon, which is probably why it failed. Correction: `curl -H "x-auth-token: THE_TOKEN_YOU_ISSUED" ... ` — berndbausch, Apr 24 '21 at 11:38

score 0 · Answer 1 · answered Apr 24 '21 at 11:37

After looking over the TCP dump that I had added to the question, I noticed that all requests below a length of 1500 were successful, anything above was dropped.

As part of the nic configuration I had the MTU set to 9000, so all the interfaces from the perspective of the servers had this set. The configuration was never set on the Switches.

From the nic configs:
    - type: ovs_bridge
      name: bridge_name
      dns_servers:
        get_param: DnsServers
      members:
      - type: ovs_bond
        name: bond0
        mtu: 9000

From the switch side:
> show interfaces ae5    
Physical interface: ae5, Enabled, Physical link is Up
  Interface index: 231, SNMP ifIndex: 713
  Link-level type: Ethernet, MTU: 1514, Speed: 2Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Minimum links needed: 1, Minimum bandwidth needed: 1bps

After increasing the MTU on the Aggregates, the response from the keystone service was successfully received. Now I'll need to make sure that this is the case for all non aggregate ports on the switch as well.

Hi I am currently stuck on that same error while deploying openstack using kolla ansible on [heat : Restart heat-api-cfn container]. It keep on saying Connection Timeout while accessing some urls which are easily accessible from browser so no internet connection issues. I have pasted question https://unix.stackexchange.com/questions/694730/installing-openstack-using-kolla-ansible-internal-server-error — aneela, Mar 17 '22 at 07:01

Connection timeouts from Nova_compute to Keystone, RabbitMQ etc

1 Answers1