I'm in the process of porting a complex performance oriented application to run on a new dual socket machine. I encountered some performance anomalies while doing so and, after much experimentation, discovered that memory bandwidth on the new machine seems to be substantially slower than I would have expected.
The machine uses a Supermicro X11DGQ motherboard with 2 X Intel Xeon Gold 6148 processors and 6 x 32 GB DDR4-2133 RAM (192 GB total). The system is running Ubuntu 16.04.4 with 4.13 kernel.
I wrote a simple memory test utility that repeatedly runs and times a memcpy
to determine an average duration and rate:
#include <algorithm>
#include <chrono>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <unistd.h>
const uint64_t MB_SCALER = 1024 * 1024L;
// g++ -std=c++11 -O3 -march=native -o mem_test mem_test.cc
int main(int argc, char** argv)
{
uint64_t buffer_size = 64 * MB_SCALER;
uint32_t num_loops = 100;
std::cout << "Memory Tester\n" << std::endl;
if (argc < 2)
{
std::cout << "Using default values.\n" << std::endl;
}
// Parse buffer size
if (argc >= 2)
{
buffer_size = std::strtoul(argv[1], nullptr, 10) * MB_SCALER;
}
// Parse num loops
if (argc >= 3)
{
num_loops = std::strtoul(argv[2], nullptr, 10);
}
std::cout << " Num loops: " << num_loops << std::endl;
std::cout << " Buffer size: " << (buffer_size / MB_SCALER) << " MB"
<< std::endl;
// Allocate buffers
char* buffer1 = nullptr;
posix_memalign((void**)&buffer1, getpagesize(), buffer_size);
std::memset(buffer1, 0x5A, buffer_size);
char* buffer2 = nullptr;
posix_memalign((void**)&buffer2, getpagesize(), buffer_size);
std::memset(buffer2, 0xC3, buffer_size);
// Loop and copy memory, measuring duration each time
double average_duration = 0;
for (uint32_t loop_idx = 0; loop_idx < num_loops; ++loop_idx)
{
auto iter_start = std::chrono::system_clock::now();
std::memcpy(buffer2, buffer1, buffer_size);
auto iter_end = std::chrono::system_clock::now();
// Calculate and accumulate duration
auto diff = iter_end - iter_start;
auto duration = std::chrono::duration<double, std::milli>(diff).count();
average_duration += duration;
}
// Calculate and display average duration
average_duration /= num_loops;
std::cout << " Duration: " << std::setprecision(4) << std::fixed
<< average_duration << " ms" << std::endl;
// Calculate and display rate
double rate = (buffer_size / MB_SCALER) / (average_duration / 1000);
std::cout << " Rate: " << std::setprecision(2) << std::fixed
<< rate << " MB/s" << std::endl;
std::free(buffer1);
std::free(buffer2);
}
I then compiled and ran this utility using a 64 MB buffer size (significantly larger than the L3 cache size) over 10,000 loops.
Dual socket configuration:
$ ./mem_test 64 10000
Memory Tester
Num loops: 10000
Buffer size: 64 MB
Duration: 17.9141 ms
Rate: 3572.61 MB/s
Single socket configuration:
(same hardware with one processor physically removed)
#./mem_test 64 10000
Memory Tester
Num loops: 10000
Buffer size: 64 MB
Duration: 11.2055 ms
Rate: 5711.46 MB/s
Dual socket using numactl:
At the behest of a colleague I tried running the same utility using numactl
to localize memory access to only the first numa node.
$ numactl -m 0 -N 0 ./mem_test 64 10000
Memory Tester
Num loops: 10000
Buffer size: 64 MB
Duration: 18.3539 ms
Rate: 3486.99 MB/s
Results
5711.43 / 3572.61 = 1.59867
The exact same test on both configurations shows that the single socket configuration is ~60% faster.
I found this question which is somewhat similar but much more detailed. From one of the comments: "Populating a 2nd socket forces even local L3 misses to snoop the remote CPU...".
I understand the concept of L3 snooping, but still the overhead compared to the single socket case seems incredibly high to me. Is the behavior that I'm seeing expected? Could someone shed more light on what's happening and what, if anything, I can do about it?