1

I'm managing a compute node in a HTC cluster. The node is a 56 cores / 112 threads dual Xeon machine, and the typical workload consists of many instances of single threaded Monte Carlo simulation jobs. Benchmarks show that the throughput scales nicely with the number of jobs up to about 56, with some non linearity due to turbo boost frequencies that are not sustained for large numbers of active jobs. All of this makes sense to me and I'd say it's the expected behavior.

The thing that I don't full understand is that scaling is almost completely lost for higher job count. Going to 64 jobs and higher up to 112 the throughput remains constant: the benefit of running more jobs in parallel is completely offset by the longer duration of the single job. I know that scaling is far from linear for hyperthreading, but a null scaling surprised me a bit.

Based on my extremely limited knowledge about the working principle of hyperthreading, my guess is that it might be effective for running two threads of the same process but not for running two separate processes. I'd need some confirmation about this, to definitely rule out the hypotheses of a malfunctioning and eventually disable the hyperthreading.

Nicola Mori
  • 261
  • 1
  • 7

2 Answers2

1

Simplified hyper threading takes advantage of the fact that in many process threads there is idle time when the core would normally be waiting for other tasks to complete. By switching between two threads the processor core won't be idle when waiting in one thread but can do something useful in the other thread. See https://www.intel.com/content/www/us/en/gaming/resources/hyper-threading.html

In specific workloads there can be very little of such wait time and thus a single thread already fully load a single core. Sharing that core with another thread won't have any overall benefit.

Bob
  • 5,335
  • 5
  • 24
1

Bad analogy: Hyper-Threading, or in general SMT, is like a time-share vacation property. Scheduling all 52 weeks of the year is fine, everyone gets the place to themselves. Get a couple more people in on the scheme, might even still work, make use of the time facilities are otherwise idle due to cancellations. But more double booking will not magically make one house into two.

CPU cores have several of the various types of execution units, integer, floating point, and others. But only so many. (Block diagrams exist to see these in the CPU design, see for example Cascade Lake.) Superscalar architectures are trying to wring out multiple instructions per clock already. So while another hardware thread might borrow an unused integer unit for a cycle, it has to share. And memory is always too slow, very likely DRAM and bus bandwidth are the real limiting factors.

Testing gradually increasing job counts is an excellent way to see the diminishing returns of SMT. Especially with your compute heavy HTC workload that probably is quite stable and predictable. Your 115% of CPUs is roughly what I would expect. No point in going higher.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Thanks for the explanation, if I correctly understand it confirms my naive intuition. I'll keep hyperthreading enabled and limit the maximum number of simultaneous jobs to 56. – Nicola Mori Mar 02 '22 at 07:09