Expired DHCP lease on restored VM snapshot interfering with active gRPC connection?

Question

I am using gRPC to communicate between Java (running on the host) and Python (running on my Guest VMs). My software sets up some VMs at startup with libvirt. I specify my network with a DHCP range like this:

<network>
  ...
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.128' end='192.168.122.254'/>
      <host mac="00:16:3e:77:e2:ed" ip="192.168.122.128"/>
      <host mac="00:16:3e:3e:a9:1a" ip="192.168.122.129"/>
      ...
    </dhcp>
  </ip>
  ...
</network>

Once everything is setup and my VMs are running I create a snapshot of each VM. Now the workflow of the software is as follows: Open gRPC interface to a guest VM with the specified IP above for a mac address, set the correct time (time now, different due to snapshot and no internet access), do some work, then restore the snapshot and repeat.

In the Python code on the VM I send a ping message every 5s to the host (kind of a heartbeat). Most of the time this works just fine and I receive the message on the host in the Java code. However, from time to time I do not receive a message anymore on the host for 3 minutes (after which I throw a custom timeout exception). There is no error or endless loop in the Python code and no exception is received on the Java side, either. So the problem must be elsewhere.

While examining the cause of this strange behavior I came across DHCP lease times. My VMs get their IP from the DHCP range configured via libvirt (see above). The DHCP lease is set to expire in 1 hour. Interestingly, this timeout from the heartbeat messages often happens (if it happens, not always does) around 1h after I made my initial snapshots from which I restore. Could my problem be related to the DHCP lease renew procedure where a VM is suddenly in a state where it does not have its initial IP anymore? So for example, that I open my gRPC connection to 192.168.122.128, get the heartbeat messages and suddenly due to the expiration of the DHCP lease the IP is changed and gRPC fails to deliver any further messages? If that could be the problem what could I do against it?

I have additionally run a custom test where I open a gRPC connection to a VM and then I manually change the IP address on the VM and then change it back to its original IP. I then observed the exact same behavior: I got a timeout exception after 3 minutes since no heartbeat messages were received anymore after my manual change. So I have a strong feeling that somehow the IP on a VM is changed which could be related to an expiring DHCP lease and this is interfering with the gRPC connection.

Could that be the problem or could it be something completely different that is causing problems with the gRPC connection? What could I do against this problem? Any help or further elaboration is appreciated.

score 0 · Accepted Answer · answered Feb 21 '19 at 15:52

I am pretty sure that most of those occurrences where gRPC was stuck could be traced back to DHCP lease time renewals. I also suspect that in general, when the DHCP server cannot be reached at any point of time, then the gRPC connection gets stuck (I tested this by manually setting the system time and calling ipconfig /renew from command line. I then got a DHCP timeout first but afterwards it worked. But as a result gRPC was stuck, even though I got the same IP again). Since I still do not fully understand the behavior of DHCP lease renewal and general things about querying the DHCP server and its timeouts I have decided to not fiddle around with it anymore. I've decided to go with static IPs for my VM snapshots (disabled DHCP). If anybody runs in the same problem you might want to try this out. Hope it helps.

Expired DHCP lease on restored VM snapshot interfering with active gRPC connection?

1 Answers1