I am using gRPC to communicate between Java (running on the host) and Python (running on my Guest VMs). My software sets up some VMs at startup with libvirt. I specify my network with a DHCP range like this:
<network>
...
<ip address='192.168.122.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.122.128' end='192.168.122.254'/>
<host mac="00:16:3e:77:e2:ed" ip="192.168.122.128"/>
<host mac="00:16:3e:3e:a9:1a" ip="192.168.122.129"/>
...
</dhcp>
</ip>
...
</network>
Once everything is setup and my VMs are running I create a snapshot of each VM. Now the workflow of the software is as follows: Open gRPC interface to a guest VM with the specified IP above for a mac address, set the correct time (time now, different due to snapshot and no internet access), do some work, then restore the snapshot and repeat.
In the Python code on the VM I send a ping message every 5s to the host (kind of a heartbeat). Most of the time this works just fine and I receive the message on the host in the Java code. However, from time to time I do not receive a message anymore on the host for 3 minutes (after which I throw a custom timeout exception). There is no error or endless loop in the Python code and no exception is received on the Java side, either. So the problem must be elsewhere.
While examining the cause of this strange behavior I came across DHCP lease times. My VMs get their IP from the DHCP range configured via libvirt (see above). The DHCP lease is set to expire in 1 hour. Interestingly, this timeout from the heartbeat messages often happens (if it happens, not always does) around 1h after I made my initial snapshots from which I restore. Could my problem be related to the DHCP lease renew procedure where a VM is suddenly in a state where it does not have its initial IP anymore? So for example, that I open my gRPC connection to 192.168.122.128, get the heartbeat messages and suddenly due to the expiration of the DHCP lease the IP is changed and gRPC fails to deliver any further messages? If that could be the problem what could I do against it?
I have additionally run a custom test where I open a gRPC connection to a VM and then I manually change the IP address on the VM and then change it back to its original IP. I then observed the exact same behavior: I got a timeout exception after 3 minutes since no heartbeat messages were received anymore after my manual change. So I have a strong feeling that somehow the IP on a VM is changed which could be related to an expiring DHCP lease and this is interfering with the gRPC connection.
Could that be the problem or could it be something completely different that is causing problems with the gRPC connection? What could I do against this problem? Any help or further elaboration is appreciated.