I have an infrastructure question around Apache Spark, which I'm looking at rolling out in a greenfield project with (at most) approximately 4 TB of data used for modelling at any given time. Application domain will be analytics, and training of the models will probably be done in a batch overnight, rather than in real time.
Traditional three tiered applications separated the database and application sides of the workload, meaning that two different servers could be optimised to perform storage and computing tasks, respectively. This makes it easy to architect a system because various providers (like Dell for example) have offerings optimised for each application.
New frameworks like Spark seem to combine both aspects to avoid moving data between nodes - and the network load that this causes - but I'm wondering how this works at the infrastructure level.
Are people combining large amounts of storage and computing power in a single machine? What might a standard system topology look like for my application and what factors would I consider in planning it? Finally, are there any blade servers offering high storage density as well as good computing power?
I'd ideally like to work with no more than 5 nodes, but I don't know of any resources of guidance to help with planning an implementation like this. Any suggestions appreciated in that respect.