Apache Spark infrastructure - combining compute and storage nodes

Question

I have an infrastructure question around Apache Spark, which I'm looking at rolling out in a greenfield project with (at most) approximately 4 TB of data used for modelling at any given time. Application domain will be analytics, and training of the models will probably be done in a batch overnight, rather than in real time.

Traditional three tiered applications separated the database and application sides of the workload, meaning that two different servers could be optimised to perform storage and computing tasks, respectively. This makes it easy to architect a system because various providers (like Dell for example) have offerings optimised for each application.

New frameworks like Spark seem to combine both aspects to avoid moving data between nodes - and the network load that this causes - but I'm wondering how this works at the infrastructure level.

Are people combining large amounts of storage and computing power in a single machine? What might a standard system topology look like for my application and what factors would I consider in planning it? Finally, are there any blade servers offering high storage density as well as good computing power?

I'd ideally like to work with no more than 5 nodes, but I don't know of any resources of guidance to help with planning an implementation like this. Any suggestions appreciated in that respect.

score 1 · Answer 1 · answered Jun 15 '15 at 06:58

I'm going to answer my own question as I've found some resources, however I'll also mark any quality answers which come in as being answers as well, so feel free to contribute. Comments on my thoughts here are also more than welcome.

This link has some info on provisioning hardware for Spark, and from what I can understand you can basically treat Spark as the application layer in a three tier stack. So you might run (for example) Cassandra or HBase on your storage nodes and keep Spark on "application" nodes with stronger CPUs and memory but less storage available. 10 Gbps ethernet between the nodes sounds like it will be important in these use cases.

I suppose this raises the question as to how one does processing on a very large dataset considering that you might ultimately still be streaming data out of a Hbase database to do the processing, but I think this boils down to application architecture, so it'll fall outside the scope of this site.

Apache Spark infrastructure - combining compute and storage nodes

1 Answers1