Why does my python script run much faster when using 5 CPUs rather than 12?

0

I am running a TensorFLow/Keras machine learning python script on a computer which has 12 CPUs.

When I execute: taskset -c cpu_list main.py in my Ubuntu terminal,I find that the optimal number of CPUs for the script is 5.

The difference is fairly significant, around 200% decrease in time when changing from 12 CPUs to 5.

Moreover 1 CPU has a similar runtime to that of the 12.

I am confused as to why this is the case and why it isn't that using all 12 gives the fastest runtime as it would have more CPU's available for computation?

Matthew

Posted 2019-08-02T09:48:15.900

Reputation: 103

Could be hitting a thermal limit and throttling. What CPU? – Mokubai – 2019-08-02T11:19:53.080

@Mokubai 12 of Intel Xeon CPU E5-2620 at 2.4GHz – Matthew – 2019-08-02T11:45:35.690

Answers

6

You have 12 threads, not 12 full cores.

https://ark.intel.com/content/www/us/en/ark/products/64594/intel-xeon-processor-e5-2620-15m-cache-2-00-ghz-7-20-gt-s-intel-qpi.html

enter image description here

So beyond 6 CPUs allocated you will be sharing the cores across threads. If a task is only using a particular subset of core functionality then you may not see any performance benefit as the tasks will be effectively waiting on each other to do their work. This is called Hyperthreading and works best when tasks have a diverse mixture of floating point and integer calculations.

A fully loaded CPU will also be at it's thermal limits and so will not Turbo Boost as high and you may loose performance that way.

By fully loading all 6/12 cores you could also be starving your system of time to do other tasks such as loading or saving data from disks, you could be hitting a bandwidth limit in copying data to or from memory and many other things.

If you do not fully understand the task you are performing and how it probes the limits of your system then just throwing more CPUs at a problem is not necessarily going to help. The bandwidth of memory or CPU-GPU links can be enormous for "normal" computer tasks, but completely inadequate for others such as neural learning tasks.

One reason we shifted these tasks to GPUs was their massive amount of small and fast cores, but their massively increased bandwidth compared to a general purpose CPU is also a key factor that is often underestimated. A modern GPU has a memory bandwidth of 256GB/s, your CPU has 42GB/s.

So there are many limits you could be hitting beyond 5 cores.

Quite often you see people throwing money at getting an "insane beast" of a machine when they don't know where their bottlenecks are, in the process they end up with a machine that has some very specific and worrying limitations that they could have avoided at the same time as saving a lot of money. Xeon processors might help in a lot of cases, but in a number of cases their large core counts and resultant lower clock speeds can actually be worse.

Mokubai

Posted 2019-08-02T09:48:15.900

Reputation: 64 434