Deploy hundreds of thousands of short lived jobs per day

Question

I have a system that needs to deploy hundreds of thousands of short-lived jobs per day. Each job runs anywhere from a few seconds, to a couple of hours. Each job makes HTTP requests to external web servers, writes data to disk (anywhere from a few megabytes to hundreds of gigabytes), and makes a series of connections to databases.

Every job is the same Docker container, running the same single Java process. Each job has a different configuration, passed as a environment variable.

We currently deploy these jobs on a Kubernetes cluster using the "Job" spec. However, the cluster is not immediately available for jobs when a large influx of jobs are to be ran. We also have to constantly query the Kubernetes cluster to determine if the Job has been finished, or was killed (e.g. out of memory).

I'd like to find a solution that would allow us to deploy these jobs as quickly as possible with the least amount of concern about whether resources are available, or requiring us to query a system to determine if the job has completed.

AWS Lambda comes to mind, but I have little experience with it.

As a architectural note, we have our process that serves a scheduler, in that calculates what job should be ran, and when. that process currently submits the job to the Kubernetes cluster.

Given the above description, what architectures should I be evaluating to minimize the amount of concern this system has around 1) if resources are available to handle the job and 2) Whether that job fail for any "non application" reason.

This system currently runs on GCP and AWS. We're open to any solution, even if it means selecting a single (and potentially different) platform.

score 1 · Answer 1 · answered Aug 06 '19 at 06:07

1

If the jobs are short-lived, your purpose might be better served by implementing a job queue, and a set of longer-lived workers which consume jobs from the queue. Is there a reason you need to run the jobs in k8s itself?

answered Aug 06 '19 at 06:07

Matt Zimmerman

361
1
10

I think that's effectively what we are doing right now: We have a job queue, and each job is ran as a docker container a kubernetes cluster, where each pod node represents a "worker." If we were to execute these outside of Kubernetes, we'd still the nodes, and we'd still have those nodes running when there were no jobs to do. I also don't know of another production-grade way to run the Docker images. – Brett Aug 07 '19 at 16:38
What you are doing now is starting a new worker process for each job. I'm proposing the idea of having longer-lived workers which process multiple jobs during their life cycle – Matt Zimmerman Aug 07 '19 at 19:19
Every task that is ran is a container inside a node (a long-lived instance). – Brett Aug 08 '19 at 22:09
Instead of starting and stopping hundreds of thousands of containers running the same code, but different configuration, start a set of these containers and have them read the hundreds of thousands of configurations from a queue and process them one at a time. – Matt Zimmerman Sep 04 '21 at 17:05

score 0 · Answer 2 · answered Jul 12 '19 at 02:22

0

Presumably your cluster is resource limited. Achieving a higher job volume, if that is a requirement, must involve a more efficient application or more resources.

Large providers like you use will rent you as many instances as your budget allows. Scale up your cluster, possibly automatically. Possibly some spare capacity is needed if you schedule jobs on short notice.

An alternative to polling the Kubernetes job is your code passing a message. At the end of a job, do some kind of call back to your scheduler indicating finished.

Of course, it may have died and will never report back. Eventually this needs to become a failure state. Consider polling the job at intervals after your typical shortest job time, and giving up on it after a hard limit like activeDeadlineSeconds.

answered Jul 12 '19 at 02:22

John Mahowald

30,009
1
17
32

Thanks. All jobs actually do make several requests to an API to update their status and progress. But when the job completes, we still have "delete" the job, as there is a limit to the number of completed jobs Kubernetes can store. There's a new API in the next release of Kubernetes that allows for Jobs to be automatically cleaned up. However, we still have to check on the Job in case it was terminated for thinks like OOM. We do scale the cluster, but scaling it can take several minutes, during which jobs cannot be reliably submitted to the cluster. – Brett Jul 12 '19 at 11:23
Then you have to stay ahead of jobs by maintaining idle capacity 5 minutes before it is needed. At least enough for high priority jobs that need to start now. No way to avoid that if you always want jobs to have resources. – John Mahowald Jul 12 '19 at 13:11
Of course, we do that. So the question in the post is what options are available to avoid doing this. AWS Lambda seems like an option, but the amount of data we write to disk might be a limit factor. We're looking to explore other options. – Brett Jul 12 '19 at 17:09

Deploy hundreds of thousands of short lived jobs per day

2 Answers2