How to pick fastest machine out of a pool to execute compute task

3

1

I have a compute task programmed in Matlab, which consists of processing of large amounts of data which are read into memory at startup. The runtime is in the range of hours to days. The task is single-threaded.

In order to run this task, I have a number of different Linux machines at my disposal. The machines are under different amounts of system load due to tasks already running, which will continue running for the forseeable future. All the machines have an amd64 architecture, but they differ in terms of number of physical cores, number of virtual cores, and CPU model including clock speed and other factors.

My question: Is there a principled way to choose one of the machines for execution of the task, with the goal to have the task finished as soon as possible?


The following part was updated based on the answer of Doktoro Reichard.

My rough idea how to get an approximate indication of which machine will be fastest is to combine two aspects:

(1) Estimate how large a part of its processing ressources a machine would allocate to my task if run there, which I call "relative speed": If the machine is idle, my task will execute at relative speed 1 by definition. If the current load is L and the number of cores is C, after I submit my task the load will be L+1, and the relative execution speed will be R = min(C / (L+1), 1). The min operator because the process can not utilize more than one core. – Is this calculation correct as an indicator of achievable processing speed relative to the optimum possible on the machine given its current load? And, is the relevant number C the number of physical cores or the number of virtual cores, the latter being twice as large because of hyperthreading?

(2) Estimate how fast the task would be executed relatively on the different machines if they were all idle, based on an indicator of machine performance. This should not be clock speed, but an appropriately chosen benchmark.

The machine is then picked based on which one has the largest product of number (1) and (2).

A. Donda

Posted 2013-10-04T15:31:00.493

Reputation: 660

Deleting and re-posting the question is doing more harm than good. You should have edited the original and waited a bit, since it automatically goes into a reopen queue. That said, since it's now deleted already and this question looks fine to me, I'd say let it stay. – slhck – 2013-10-04T15:38:13.577

Mainframes used to manage tasks like this. They could place a job into a priority status if it wanted. Your question entirely points to handling these tasks like a mainframe would. – Ramhound – 2013-10-04T15:39:15.083

@slhck, I understand, sorry. But I also have to say that closing a question without a single comment as to what's wrong is also quite harmful. It discourages participation by non-experts, and it does not help to improve questions. – A. Donda – 2013-10-04T15:41:19.683

@Ramhound, yes I know, but in this case I have to do the management myself. Do you possibly have any pointers on the kind of heuristics such management systems use? – A. Donda – 2013-10-04T15:42:15.030

@A.Donda - The reason your previous question was closed was given. You even quoted the exact reason. Your original question was going to lead to opinions which are useless since everyone has one. – Ramhound – 2013-10-04T15:48:26.793

But what exactly was "opinion-based" about this question? To me it appears that "which machine will execute fastest" is a matter of fact. Possibly unknown facts, but still facts. I can't see anything "opinion-based" there. Apparently I'm wrong, but simple courtesy would suggest to give me a hint. Btw. I happen to answer Matlab-related questions on SO, and even though a lot of them are not well posed, I still do my best to help the poster out. I don't see why people are not willing to extend the same niceness to me. – A. Donda – 2013-10-04T15:51:45.183

@A.Donda - This question is 10x better then your previous one...I read your previous one...it needed some work. – Ramhound – 2013-10-04T16:00:09.777

Answers

2

This isn't going to have a definite answer mainly due to the way computers work, but I'll try to give some guidelines as to how to figure out what is the fastest.

I'll analyze your statements in order to tell you what you can and can not figure out.

1. Data

From your first paragraphs you stated you loaded and read all data into and from memory. This is good for speed, as in terms of bandwidth, memory is second to none. If your program used a disk (regardless of it being a HDD, a SSD or a pen drive), it could be a possibility that could be a bottleneck in your program's running speed.

This is due to transfer speed. RAM has almost direct access to the processor. Disks have to pass through a connection with a bandwidth much lower than RAM, and in the case of HDD's, there is the time needed to fetch and store data in consideration.

2. OS

The OS used does pack some influence on speed, but can be considered residual.

3. Processor Architecture (or Instruction Sets)

This is a relevant aspect. Although you stated the machines used amd64 processors, there might be some differences in the instruction sets being used.

Consider for instance the Opteron and the Sempron series of processors. A key difference in them is that the latter has SSE3 support.

SSE3 allows newer and more efficient ways to deal with data (specifically array operations), operations that in earlier instruction sets would be done using less efficient processes.

So, in this aspect, newer processors are faster by design, as they support more efficient instruction sets.

4. System load

This is the final nail in the coffin, as you might put. You can't calculate load in a computer in a linear way, unless you know how each program works. With this being said, you might have 10.000 processes running but with a residual load or with a single multi-threaded process wasting all processor time.

However... let's analyze this further. Adding processes implies adding some sort of data to the underlying system, in order for it to know the process exists and how much of the processor's time should be made ready for it. In this aspect, less processes are better for speed, as the kernel/processor can decide better how much time to made available.

Another thing to consider is the priority the kernel gives to the processes. The processes with biggest priority will occupy most of the processor's time.

You can ultimately conceive of a system that can give your process maximum priority, and leave all other processes waiting eternally until your process halts. In this, your process speed will be determined by the CPU.

5. CPU

Let's consider your points: cores and CPU clock speed.

It can be conceivable that the kernel can shift some processes to other cores. In a limit case, your process can have a full core designated to it. In this aspect, more cores can enable more processes to run simultaneously (and by consequence, faster).

I don't know much about multi-threading so I'll leave that for someone who knows.

Clock speed is not a clear indicator of processor efficiency. To ground this, I present you the "battle" between the use of Intel and PowerPC processor in Apple hardware. Apple argued that the main reason for preferring PowerPC processors to Intel despite increasing processor speeds in Intel processors was that PowerPC performed better, because they could handle more operations per second that it's rival. In the end, Apple opted for Intel due to power concerns and other economical factors.

FLOPS (for FLoating-point Operations Per Second) is a measure of computer performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second. This could be a better measure than clock speed, if your work relies heavily on these kinds of operations that, for you using Matlab, could be a possibility.

It isn't however a very divulged quantity (as it depends on what kind of operations are you executing). I found some benchmarks on Overclock.net. I point these two out:

  • AMD Phenom Ix4 9850 @ 2.83GHz RAM 754MHz 5-5-5-15 GFLOPS:27.5
  • AMD Phenom IIx2 555 @ 4.12GHz RAM 1000MHz 5-5-5-18 GFLOPS:26

As you can see, even processors with twice the speed can perform worse than processors with half the speed.

Bottom line

There isn't a clear formula you can use to estimate running time, due to the endless amount of factors involved in the processing of a program. There are some rules of thumb I sum up (I've tried to sort them by importance):

  • It is faster to have all (or mostly-used) data on RAM. (data storage is the main bottleneck in any process).
  • The less processes running in the same machine, the better.
  • More cores are better.
  • Newer processors are faster by design.
  • CPU speed is a rough indicator of speed (as RAM latency and other factors intervene)
  • You can ask the machine to give your process the highest priority. The machine will try to allocate more time to your process. (when I say will try, I mean that the machine has no obligation to give your process more time, it tries to balance everything out, which is why the 2nd item exists).

I'm still a little hesitant to give an all-out formula, but I believe I can give you a very, very approximate one, based on your comments' output. I can't speak about multi-threading so I will consider all cores to be independent. For this demonstration the following is assumed:

  • All processes have the same load.
  • Processes aren't waiting for input/output.
  • Memory speed is assumed not to be a factor.
  • All processes are single-threaded. They have the same priority.

With this, the bottleneck will be the CPU's capacity. So, for any CPU, the relative load for any process would be the following:

R = min(C / N ; 1)

With R being the relative load, C the number of cores in the CPU and N the number of active processes. This assumes however that the system can distribute evenly and perfectly all load throughout the cores, which it might not be always true.

With the relative load of the CPU, multiply it by the unit of measurement (UM) of your choice (e.g. GHz or GFLOPS) and then you get a measure of how "fast" the process might be.

Speed = R * UM

So, with this, your formula is correct. But please, pretty please note the amount of assumptions I had to make. This is far from a real case. This will not give you an exact quantity, but rather, an educated guess.

Your second point is the same as the first (in fact, you answer it on your question). In short, it is the machine's performance indicator (as R = 1), so the question here is how to choose one. This is something you should analyze yourself: you can use GHz times number of cores, or GFLOPS or some combination of the two.

There are programs (that I should have remembered earlier when writing the first answer) that can do some benchmarks on CPU's and from those, you can get some values that might help your decision. I have SiSoftware Sandra (that has, on my rather old version, a Processor Arithmetic benchmark) but I suppose there are others.

In this case, multiplying 1) and 2) wouldn't give any difference, unless you used different measure units.

Doktoro Reichard

Posted 2013-10-04T15:31:00.493

Reputation: 4 896

Hello Dokotoro Reichard, thanks a lot for your very detailed answer! I realize now that a simple measure like clock speed is not adequate to quantify the computing power of a machine, but a lot of other factors complicate matters, even if all the machines have processors from the same "family", as your example points out. – A. Donda – 2013-10-05T10:07:19.863

But it still appears to me that it should be possible to answer my question at least approximately by combining (1) a measure based on the utilization state of the machine (system load relative to number of cores) and (2) a measure characterizing the computing power of the machine (but rather a well-chosen benchmark instead of clock speed). Do you agree? I edited my question to reflect this. – A. Donda – 2013-10-05T10:08:56.557

Like I said, there isn't a clear formula to calculate this, which is why I gave those rules. The only way you could have such kind of formula would be to either 1. assume that all programs used the same quantity of resources or 2. know how much load each program has. But the 1st relies on an assumption that might be incorrect. The 2nd can be more reliable, as you can know how much processing each program will need, but fails on the fact that each process might have different priorities and as such, execution times will depend on such priorities given. – Doktoro Reichard – 2013-10-05T18:47:19.787

Modern processors also aren't limited to do a single process, as they can place a process on halt while another is being processed; this complicates matters. I would, however, wait for more answers, as this is an interesting problem. – Doktoro Reichard – 2013-10-05T18:48:44.807

The computers are mainly used for Matlab compute tasks, so one can assume that the observed load is generated by such long-running, CPU-intensive processes. Process priorities are not (and should not) be manipulated, so one can assume that the processes all run with the same priority. Matlab has some multi-threaded implementations of built-in functions, but there are only a few of them that are active only a small part of the time, so one could assume that the processes are effectively single-threaded. Would this help to answer my question? – A. Donda – 2013-10-07T12:57:24.240

I understand the caveats, but still the problem is that these multiple "the more the better"-type rules in parallel do not allow me to pick one machine – which is what I have to do. – A. Donda – 2013-10-07T12:58:41.730

I have tried to write a formula (it ended up being the same as yours) but note the caveats I had to assume. If you could benchmark (see enclosed link) the machines, you would then get several performance values that you could weigh to form a single unit of performance measurement. – Doktoro Reichard – 2013-10-07T15:18:43.187

Again, thanks a lot! I feel a bit guilty for having coerced you to give an answer you don't consider too meaningful. ;-) The benefit for me is the acknowledgement that my way of thinking hasn't been completely wrong, just very very idealized. I'm planning on running some experiments to see whether there's at least some rough agreement between such an idealized approach and the actual runtime of my tasks. – If you don't mind, I'll leave the question open for a little while longer to see whether someone else chimes in, though I don't expect it. – A. Donda – 2013-10-07T18:37:16.350