I have a system that needs to deploy hundreds of thousands of short-lived jobs per day. Each job runs anywhere from a few seconds, to a couple of hours. Each job makes HTTP requests to external web servers, writes data to disk (anywhere from a few megabytes to hundreds of gigabytes), and makes a series of connections to databases.
Every job is the same Docker container, running the same single Java process. Each job has a different configuration, passed as a environment variable.
We currently deploy these jobs on a Kubernetes cluster using the "Job" spec. However, the cluster is not immediately available for jobs when a large influx of jobs are to be ran. We also have to constantly query the Kubernetes cluster to determine if the Job has been finished, or was killed (e.g. out of memory).
I'd like to find a solution that would allow us to deploy these jobs as quickly as possible with the least amount of concern about whether resources are available, or requiring us to query a system to determine if the job has completed.
AWS Lambda comes to mind, but I have little experience with it.
As a architectural note, we have our process that serves a scheduler, in that calculates what job should be ran, and when. that process currently submits the job to the Kubernetes cluster.
Given the above description, what architectures should I be evaluating to minimize the amount of concern this system has around 1) if resources are available to handle the job and 2) Whether that job fail for any "non application" reason.
This system currently runs on GCP and AWS. We're open to any solution, even if it means selecting a single (and potentially different) platform.