0

I am a VERY new sysadmin (Class of '16) and I've been asked to create a big data cluster with 3 bare metal PowerEdge Servers. I have the following request to be put on the cluster:

*Hadoop2 *YARN *Java 7&8 *Spark *SBT *Maven *Scala *P7zip *Pig *Hive *R (libraries for Spark and Hadoop) *Zeppelin *Cassandra

I would like to know if these can all 'play well together' since I know very little of big data and searches result in a lot of "x VS y" pages rather than "x AND y". And is there a preferred industry standard?

Thank you in advance for your advice!

Beth L
  • 3
  • 1
  • 1
    Have you been tasked with designing a new solution without real requirements or end goal, or is this explicitly supposed to be a learning experience? – mfinni Mar 16 '18 at 14:42

1 Answers1

1

Certainly they can co-exist on those servers, though typically you'd use one kind of server to hold the actual data and another to do the compute-heavy work. It's also slightly non-standard to then run a Cassandra DB on the same servers too but again you can do all of this, it'll work, it's not just not exactly how I'd do it.

In case the servers haven't been ordered yet and you can influence their specification one thing I would try to do is have a bank of big, slow disks for data (typically multi-TB 7.2krpm 3.5" disks) and then some SSD or 10krpm disks for DB and compute work. Running the whole thing off one type of disk doesn't often make sense. This will also be quite memory intensive, don't skimp on that, also you probably need a sensible number of CPU cores, I'd say at last 12 or more per server for all this work.

Anyway, I hope this helps and look at both Cloudera and Ambari for their Hadoop environments, they're not free but can take a lot of the headache away from you.

Chopper3
  • 100,240
  • 9
  • 106
  • 238