0

I have an enormous corpus of text data stored in millions of files on S3. It's very common that I want to perform some operation on every one of those files, which uses only that file and creates a new file from it. Usually, I use my company's DataBricks for this, but it's so locked down that it's hard to deploy complex code there.

I've been considering using AWS Batch with Spot Instances as an alternative to DataBricks for some of these jobs. I'd certainly want to use multiple nodes, because the largest single node would be quite incapable of finishing the work in a reasonable time frame. There are, of course, technologies like Apache Spark that are designed for distributed computing, but I'm (a) not confident in my ability to set up my own Spark cluster and (b) not convinced that Spark is necessary for such a simple distributed computing job. Fundamentally, all I need is for the nodes to communicate which files they are planning to work on, what they have finished, and when they turn off. It would be straightforward, if tedious, to maintain all that information in a database, and I have no need of translating all my data into another distributed filesystem.

Is there a good existing technology for this kind of use case?

Zorgoth
  • 101
  • You mentioned AWS Batch. What did your research tell you about whether it was suitable for your use case? – Tim May 23 '22 at 19:57
  • Oh, good point. I just realized after looking that up that multi-node jobs aren't supported with Spot instances. It seems like I would be forced to submit multiple single-node jobs if I was going to use it, which is somewhat less appealing. – Zorgoth May 23 '22 at 20:05

0 Answers0