Restore batch job from checkpoints on NFS mount

Question

I'm running long batch jobs on Kubernetes cluster, that operate on NFS directory to save artifacts and checkpoints. Every job has its own directory that it cd into and run a script to restore its state and continue the computation. No two jobs ever operate on a single directory!

Whenever the container with the task fails, Kubernetes spawns a new one to restore until it's finished. However, since our cluster consists of several machines the new container may be spawned on another node. The nodes independently mount the shared storage, to access the same data, but there is a delay before the changes made by the failed pod are visible to the new one. The restored job may then fail to list the checkpoint created by its previous pod and accidentally continue from the previous one!

How can I avoid this issue? Can I force NFS to synchronize its cache of the directory? Should I use some other synchronization mechanism to ensure that the new pod is not launched until the directory is up to date? Is there a better way to share my storage?

Restore batch job from checkpoints on NFS mount

0 Answers0