What is better for a Java Web Application: more CPU cores or a higher clock speed?

Question

I'm not sure whether serverfault is the right place to ask this, but I wonder what choice you would make if you had to select a new CPU type for your Java Web Application:

a) a CPU with 32 cores and clock speed 2.5 Ghz

or

b) a CPU with 8 cores but clock speed of 3.8 Ghz

Given the fact that each of the web application's incoming HTTP request is served by a free Java thread, it might make sense to choose a), because you can process four time more HTTP requests at the same time. However, on the other hand, CPU b) can finish the processing of a single HTTP request much faster...

What do you think?

Sidenotes:

it has to be a physical machine, VMs or cloud solutions are not an option in this case
RAM is not important, the server will have 512GB of RAM in the end
Caching: the Java web application features an extensive caching framework, so the choice is really on the CPUs.

Why not simply try both? (e.g., are you actually looking at _purchasing hardware_, or is this a VM?) You also omit RAM (as noted below) _and cache_, which can be critical for this type of application. — chrylis -cautiouslyoptimistic-, Aug 28 '20 at 00:58
Hi, my client needs a physical machine, VMs are not an option. RAM is not important, both machines will have 512GB of RAM and the software features an extensive caching framework, so the choice is really on the CPUs. — bzero, Aug 28 '20 at 07:09
Clock speed is not a very useful measure (e.g. think about the difference between a modern 2.5 Ghz CPU and a 10 year old 2.5 Ghz CPU). A better measure is to use a generic benchmark (both single threaded and multi threaded) for each CPU you are comparing (e.g. [PassMark](https://www.cpubenchmark.net/)). An even better measure is to benchmark it using the actual code you will be running. — Jon Bentley, Aug 28 '20 at 08:03
What is it actually doing? Many (most?) web apps do a lot of database queries, and most of the time is spent waiting for those queries to execute rather than actual CPU work. If that is you case, is the database server running on the same box or a separate one? What kind of database is it? And when it comes to databases, then the most common bottleneck is usually I/O rather than CPU. RAM helps, of course, but you said you would have plenty. And of course, the best option could be to have many servers rather than a single one. It scales better in the long run if you expect lots of traffic... — jcaron, Aug 28 '20 at 10:06
@jcaron: in this case, only little database queries are performed, it's super quick and negligible compared to the overall transaction time. — bzero, Aug 28 '20 at 12:21
@JonBentley: I have official CPU benchmarks from https://www.cpubenchmark.net/ , the CPU with the higher clock speed has 18'000 , the CPU with 32 cores has 48'000 . But as I said, those 32 cores are only useful, if there are really 32 requests in parallel at most of the time, I think... — bzero, Aug 28 '20 at 12:23
Hyperthreading might be very relevant. Sun developed the T1 CPU back in 2005 with such web apps in mind. It ran 4 threads per core, acknowledging that Java code spends a lot of CPU time waiting on cache and RAM. x86 generally sticks to 2 threads/core. — MSalters, Aug 28 '20 at 15:20
@bzero That's why I said you need to look at both the single threaded and the multi threaded benchmarks. The point is to get a somewhat accurate comparison between the two CPUs; this is a separate issue to whether or not you go with more cores vs less cores. If you look only at clock speed, you are comparing apples with oranges. — Jon Bentley, Aug 31 '20 at 09:34

score 30 · Answer 1 · answered Aug 27 '20 at 19:45

tldr; The real answer is probably "more RAM", but as you've asked your question the answer is, of course, it depends. Then again, 32 cores @2.5Ghz will almost certainly beat 8 cores @3.8Ghz - it's 4 times more cores vs. 1.5 times faster clock. Not a very fair fight.

A few factors you should consider are transaction response time, concurrent users and application architecture.

Transaction response time If your Java application responds to most requests in a few milliseconds then having more cores to handle more concurrent requests is probably the way to go. But if your application mostly handles longer running, more complex transactions it might benefit from faster cores. (or it might not - see below)

Concurrent users and requests If your Java application receives a large number of concurrent requests then more cores will probably help. If you don't have that many concurrent requests then you might just be paying for a bunch of extra idle cores.

Application architecture Those long running requests I mentioned won't benefit much from faster cores if the app server spends most of the transaction time waiting for responses from web services, databases, kafaka/mq/etc. I've seen plenty of applications with 20-30 second transactions that only spend a small portion of their response time processing in the application itself, and the rest of the time waiting for responses from databases and web services.

You also have to make sure the different parts of your application fit together well. It doesn't do you much good to have 32 or 64 threads each handling a request all queuing up waiting for one of 10 connections in JDBC pool, aka the pig in a python problem. A bit of planning and design now will save you a lot of performance troubleshooting later.

One last thing - what CPUs could you possibly be comparing? The cheapest 32 core 2.5 GHz CPU I can find costs at least 3 or 4 times more than any 8 core 3.8 Ghz CPU.

Good answer, thank you! The application has no external interfaces, so it really processes an incoming request and can process it 100%, there is never a waiting time inolved. So the "transaction" usually takes about 100 to 200ms per thread and request. — bzero, Aug 28 '20 at 07:11
Great answer - ultimately it depends if you want to optimize for total throughput or individual user experience. If your server is rarely so busy that all cores are in use, then fewer faster cores will lead to a more responsive user experience (assuming single threaded processing of a request). Alternatively, if you expect the server to often spike to the point of queuing request, more cores will deliver more throughput overall. If your bottleneck is memory, IO, or something else (e.g. back end database), then it probably won't matter either way. — ptyx, Aug 28 '20 at 17:01

score 9 · Answer 2 · answered Aug 27 '20 at 14:09

Assuming your Java web server is appropriately configured, you should go for more cores.

There are still dependencies, like semaphores, concurrent accesses that will still have some threads waiting, whatever the number of cores or speed. But it's better when it's managed by the CPU (cores) than by the OS (multi-threading).

And anyway, 32 cores @2.5Ghz will handle more threads and better than 8 cores @3.8Ghz.

Also, the heat produced by the CPU depends on the frequency (among other things) and this is not linear. Meaning, 3.8Ghz will generate more heat than 3.8/2.5 x (has to be confirmed based on your exact CPUs types/brands... many sites offer detailed information).

score 7 · Answer 3 · answered Aug 28 '20 at 13:10

You tell us that the a request takes about 100-200 ms to execute, and that it's mostly processing time (though it's difficult to separate what is actual CPU execution from what is in reality memory access), very little I/O, waits for databases, etc.

You would have to benchmark how long it actually takes on each of the two CPUs, but let's suppose it takes 150 ms on the slower CPU (with 32 cores) and 100 ms on the faster one (with only 8 cores).

Then the first CPU would be able to handle up to 32/0.15 = 213 requests per second.

The second CPU would be able to handle up to 8/0.1 = 80 requests per second.

So the big question is: how many requests per second do you expect? If you are nowhere near dozens of requests per second, then you don't need the first CPU, and the second one will give you faster execution time on each request. If you do need over 100 requests per second, then the first one makes sense (or it probably makes even more sense to have more than one server).

Note that this is very very back-of-the-envelope-type estimations. The only way to know for sure is to benchmark each of the servers with a real-life load. As stated above, fast CPUs or CPUs with lots of cores can quickly become starved for memory access. The size of the various CPU caches is very important here, as well as the "working set" of each request. And that's considering truly CPU-bound work, with no system calls, no shared resources, no I/O...

score 3 · Answer 4 · answered Aug 28 '20 at 17:00

Faster cores are generally better than more cores. IE if two processors have the same price, memory bandwidth, and multi-threaded benchmark scores, prefer the one with fewer faster cores.

More cores only help if you have enough concurrent requests.

Faster cores improve both total throughput and improve the response time for each request.

Johannes Pille · Answer 5 · 2020-08-30T07:28:46.633

Preliminary note
I'd like to second @PossiblyUsefulProbablyNot's definitely useful answer.

tldr; The real answer is probably "more RAM"

Especially this point.

Caveat
Not so much of an admin per sé.
More of a software engineering perspective, maybe.

No alternative to measurement

What we know
So, the machine is

going to run an (Enterprise?) Java-based backend-application of sorts
publicly (within some sizeable context, anyway) expose an HTTP API handling client requests
presumably with some form of Database attached
is otherwise described as not very much I/O-bound
does not rely on the availability, latency or throughput of 3rd party services

Not all that vague a picture, the OP is painting. But at the same time far from adequate enough data to give an answer pertaining to the OPs individual situation.
Sure, 32 cores at 2/3 the clock speed is likely to perform better than 1/4 of the cores at comparatively small a speed advantage. Sure, heat generated doesn't scale well with clock speeds above the 4GHz threshold. And sure, if I'd have to blindly have to put my eggs in one basket, I'd pick the 32 cores any day of the week.

What we don't know
Way too much, still.

However, beyond these simple truths, I'd be very skeptical of an hypothetical attempt at a more concrete and objective answer. Iff it is at possible (and you have ample reason to remain convinced about ops per unit time being a valid concern), get your hands on the hardware you intend to run the system on, measure and test it, end-to-end.
An informed decision involves relevant and believable data.

OP wrote: RAM is not important

In the vast majority of cases, memory is the bottleneck.

Granted, the OP is primarily asking about CPU cores vs. clock speed and thus memory appears on the fringes of being off-topic.

I don't think it is, though. To me, it appears much more likely the question if based on a false premise. Now, don't get me wrong, @OP, your question is on-topic, well phrased and your concern obviously real. I am simply not convinced that the answer to which CPU would perform "better" in your use-case is at all relevant (to you).

Why memory matters (to the CPU)

Main memory is excruciatingly slow.
Historically, as compared to the hard drive, we tend to think of RAM as "the fast type of storage". In the context of that comparison, it still holds true. However, over the course of the recent decades, processor speeds have consistently grown at significantly more rapid a rate than has the performance of DRAM. This development over time has led to what is commonly known as the "Processor-Memory-Gap".

The Gap between Processor and Memory Speeds (source: Carlos Carvalho, Departamento de Informática, Universidade do Minho)

Fetching a cache line from main memory into a CPU register occupies roughly ~100 clock cycles of time. During this time, your operating system will report one of the two hardware threads in one of the 4 (?) cores of your x86 architecture as busy.
As far as the availability of this hardware thread is concerned, your OS ain't lying, it is busy waiting. However, the processing unit itself, disregarding the cache line that is crawling towards it, is de facto idle.
No instructions / operations / calculations performed during this time.

+----------+---------------+---------------------------------------------------------------------------------------------------+
|  Type of |    size of    |                                Latency due to fetching a cache line                               |
| mem / op |     cache     +--------+--------+------------+--------------------------------------------------------------------+
|          |   (register)  |  clock |  real  | normalized |                            now I feel it                           |
|          |               | cycles |  time  |            |                                                                    |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+
|   tick   |      16KB     |    1   | 0.25ns |     1s     |             Dinner is already served. Sit down, enjoy.             |
|          | *the* 64 Bits |        |        |            |                                                                    |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+
|    L1    |      64KB     |    4   |   1ns  |     4s     |               Preparations are done, food's cooking.               |
|          |               |        |        |            |                 Want a cold one to bridge the gap?                 |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+
|    L2    |     2048KB    |   11   |  ~3ns  |     12s    |        Would you be so kind as to help me dice the broccoli?       |
|          |               |        |        |            |    If you want a beer, you will have to go to the corner store.    |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+
|    L3    |     8192KB    |   39   |  ~10ns |     40s    |    The car is in the shop, you'll have to get groceries by bike.   |
|          |               |        |        |            |             Also, food ain't gonna cook itself, buddy.             |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+
|   DRAM   |     ~20GB     |   107  |  ~30ns |    2min    |      First year of college. First day of the holiday weekend.      |
|          |               |        |        |            |         Snow storm. The roommate's are with their families.        |
|          |               |        |        |            | You have a piece of toast, two cigarettes and 3 days ahead of you. |
+----------+---------------+--------+--------+------------+--------------------------------------------------------------------+

Latency figures of the Core-i7-9XX series chips (source: Scott Meyers, 2010)

Bottom line If proper measurement is not an option, rather than debating cores vs. clock speed, the safest investment for excess hardware budget is in CPU cache size.

So, if memory is regularly keeping individual hardware threads idle, surely more ~cow bell~ cores is the solution?

In theory, if software was ready, multi/hyper-threading could be fast

Suppose you are looking at you tax returns (e.g.) of the last few years, say 8 years of data in total. You are holding 12 monthly values (columns) per year (row).

Now, a byte can hold 256 individual values (as its 8 individual binary digits, may assume 2 states each, which results in 8^2 = 256 permutations of distinct state. Regardless of the currency, 256 feels a little on the low end to be able to represent the upper boundary of salary figures. Further, for the sake of argument, let's assume the smallest denomination ("cents") to not matter (everybody earns whole integer values of the main denomination). Lastly suppose the employer is aware of the salary gap between upper management and the regular workforce and hence keeps those selected few in an entirely different accounting system altogether.

So, in this simplified scenario, let's assume that twice the aforementioned amount of memory space, i.e. 2 byte (or a "halfword"), when used in unsigned form, i.e. representing the range from [0, 2^16 = 65536), suffices to express all employee's monthly salary values.

So in the language / RDBS / OS of your choice, you are now holding a matrix (some 2-dimensional data structure, a "list of lists") with values of uniform data size (2-byte / 16 Bit).
In, say C++, that would be a std::vector<std::vector<uint16_t>>. I am guessing you'd use a vector of vector of short in Java as well.

Now, here's the prize question:
Say you want to adjust the values for those 8 years for inflation (or some other arbitrary reason to write to the address space). We are looking at a uniform distribution of 16 Bit values. You will need to visit every value in the matrix once, read it, modify it, and then write it to the address space.
Does it matter how you go about traversing the data?

The answer is: yes, very much so. If you iterate over the rows first (the inner data structure), you will get near perfect scalability in a concurrent execution environment. Here, an extra thread and hence half the data in one and the other half in the other will run you job twice as fast. 4 threads? 4 times the performance gain.
If however you choose to do the columns first, two threads will run your task significantly slower. You will need approx 10 parallel threads of execution to only to mitigate (!) the negative effect that the choice of major traversal direction just had. And as long as your code ran in a single thread of execution, you couldn't have measured a difference.

+------+------+------+------+------+------+------+
| Year |  Jan |  Feb | Mar  | Apr  | ...  | Dec  |
+------+------+------+------+------+------+------+
| 2019 | 8500 | 9000 | 9000 | 9000 | 9000 | 9000 | <--- contiguous in memory
+------+------+------+------+------+------+------+
| 2018 | 8500 | 8500 | 8500 | 8500 | 8500 | 8500 | <--- 12 * 16Bit (2Byte)
+------+------+------+------+------+------+------+
| 2017 | 8500 | 8500 | 8500 | 8500 | 8500 | 8500 | <--- 3 * (4 * 16Bit = 64Bit (8Byte) 
+------+------+------+------+------+------+------+
| ...  | 8500 | 7500 | 7500 | 7500 | 7500 | 7500 | <--- 3 cache lines
+------+------+------+------+------+------+------+
| 2011 | 7500 | 7200 | 7200 | 7200 | 7200 | 7200 | <--- 3 lines, likely from the same
+------+------+------+------+------+------+------+      virtual memory page, described by 
                                                        the same page block.

The OP wrote: a) a CPU with 32 cores and clock speed 2.5 Ghz
or
b) a CPU with 8 cores but clock speed of 3.8 Ghz

All else being equal:

--> Consider cache size, memory size, the hardware's speculative pre-fetching capabilities and running software that can actually leverage parallelisation all more important than clock speed.

--> Even without reliance on 3rd party distributed systems, make sure you truly aren't I/O bound under production conditions. If you must have the hardware in-house and can't let AWS / GCloud / Azure / Heroku / Whatever-XaaS-IsHipNow deal with that pain, spend on the SSDs you put your DB on. While you do not want to have the database live on the same physical machine as does your application, make sure the network distance (measure latency here too) is as short as possible.

--> The choice of a renowned, vetted, top-of-the-line, "Enterprise-level" HTTP Server Library that is beyond the shadow of any doubt built for concurrency, does not alone suffice. Make sure any 3rd party libraries you run in your routes are. Make sure your in-house code is as well.

VMs or cloud solutions are not an option in this case

This I get.
Various valid reasons exist.

it has to be a physical machine [...]
[...] CPU with 32 cores and clock speed 2.5 Ghz

But this not so much.
Neither AWS nor Azure invented distributed systems, micro-clustering or load balancing. It's more painful to setup on bare metal hardware and without MegaCorp-style resources, but you can run a distributed mesh of K8 clusters right in your own living room. And tooling for recurring health checks and automatic provisioning on peak load exists for self-hosted projects too.

OP wrote: RAM is not important

Here's a ~hypothetical~ reproducible scenario: Enable zram as your swapspace, because, RAM is cheap and not important and all that. Now run a steady, memory-intensive task that doesn't result in frequent paging exactly. When you have reached the point of serious LRU inversion, your fan will get loud and your CPU cores hot - because it is busy dealing with memory management (moving crap in and out of swap).

OP wrote: RAM is not important

In case I haven't expressed myself clearly enough: I think you should reconsider this opinion.

TL;DR?
32 cores.
More is better.

Many thanks for this interesting answer! Of course I consider "more RAM", however I think 512GB memory is enough for this setup: 1 Java Web Application gets 32GB of heap space (Xmx, Xms). The rest of the memory is for Ubuntu's file caching. As the memory setup is given (can not choose any other memory) I wrote that "memory is not important", which means "not important for my decision in choosing a CPU". — bzero, Aug 31 '20 at 08:31

What is better for a Java Web Application: more CPU cores or a higher clock speed?

5 Answers5

No alternative to measurement

In the vast majority of cases, memory is the bottleneck.

Why memory matters (to the CPU)

In theory, if software was ready, multi/hyper-threading could be fast