CPU or Disk bottleneck?

0

Say I have machine A and B, where machine B has a moderately faster disk but a comparable processor to machine A, everything else is the same. I perform a large Spark job locally on both machines where the input dataset is too large to fit into memory, forcing disk usage. As I run this large Spark job, I collect system metrics using sysstat/sar. The point of this is to compare the processors.

Machine B is able to finish the job roughly 10% faster. I see that machine B is able to achieve superior sector read/writes per second (30% more), with lower average I/O request response times (up to 250% better), by using sar. I jumped to the conclusion that machine B has an unfair advantage over machine A, because of it's faster disk.

My question is, how would I be able to determine if machine B's processor is just more effective at utilizing disk I/0 than machine A? More specifically, how can I make sure that the differences in disk speeds don't cause an unfair advantage, in order to make a fair comparison between the processors? Is there any system metrics that would give more information about this?

cbass

Posted 2017-06-27T14:27:38.743

Reputation: 1

1Set up a "LiveCD" style install and use a single temporary disk for any reading/writing in each machine? e.g. for the tests, remove the hard drives from each and use a single separate special hard drive for both tests. – Yorik – 2017-06-27T14:47:53.510

would swapping the HDD be out of the question and running the same processes? then you could see if machine A finishes faster than machine B? – TiO – 2017-06-27T15:00:08.603

JOC, what exactly are you trying to accomplish? if you are just trying to compare CPUs, there are other ways to do that, that don't introduce the disk as a variable factor. most benchmarking utilites would fit the bill better. – Frank Thomas – 2017-06-27T15:20:31.983

Answers

1

If you think Disk I/O bottleneck is unfair, then you should take it out of the equation, and easy way of doing so, is doing all the work on ram disks (of course you will need ram and it will be limited on space) And then if RAM technology of both is not the same you will have another unfair scenario.

Likewise you could use a central NFS server, and then the bottleneck would be the network.

So if your baseline would be that Spark job, and the whole idea is to compare and not to find the faster configuration. I may advice to level up the situation and having the whole dataset on an USB storage and then it disk i/o should match (as long as you use the same type of connector both USB2 or both USB3)

Jorge Gutierrez

Posted 2017-06-27T14:27:38.743

Reputation: 11