Memcpy bandwidth ~1.6x faster on 1 vs 2 socket Intel Scalable (Skylake)?

Question

I'm in the process of porting a complex performance oriented application to run on a new dual socket machine. I encountered some performance anomalies while doing so and, after much experimentation, discovered that memory bandwidth on the new machine seems to be substantially slower than I would have expected.

The machine uses a Supermicro X11DGQ motherboard with 2 X Intel Xeon Gold 6148 processors and 6 x 32 GB DDR4-2133 RAM (192 GB total). The system is running Ubuntu 16.04.4 with 4.13 kernel.

I wrote a simple memory test utility that repeatedly runs and times a memcpy to determine an average duration and rate:

#include <algorithm>
#include <chrono>
#include <cstring>
#include <iomanip>
#include <iostream>

#include <unistd.h>

const uint64_t MB_SCALER = 1024 * 1024L;

// g++ -std=c++11 -O3 -march=native -o mem_test mem_test.cc
int main(int argc, char** argv)
{
    uint64_t buffer_size = 64 * MB_SCALER;
    uint32_t num_loops = 100;

    std::cout << "Memory Tester\n" << std::endl;

    if (argc < 2)
    {
        std::cout << "Using default values.\n" << std::endl;
    }

    // Parse buffer size
    if (argc >= 2)
    {
        buffer_size = std::strtoul(argv[1], nullptr, 10) * MB_SCALER;
    }

    // Parse num loops 
    if (argc >= 3)
    {
        num_loops = std::strtoul(argv[2], nullptr, 10);
    }

    std::cout << "    Num loops:   " << num_loops << std::endl;
    std::cout << "    Buffer size: " << (buffer_size / MB_SCALER) << " MB" 
              << std::endl;

    // Allocate buffers
    char* buffer1 = nullptr;
    posix_memalign((void**)&buffer1, getpagesize(), buffer_size);
    std::memset(buffer1, 0x5A, buffer_size);

    char* buffer2 = nullptr;
    posix_memalign((void**)&buffer2, getpagesize(), buffer_size);
    std::memset(buffer2, 0xC3, buffer_size);

    // Loop and copy memory, measuring duration each time
    double average_duration = 0;    
    for (uint32_t loop_idx = 0; loop_idx < num_loops; ++loop_idx)
    {    
        auto iter_start = std::chrono::system_clock::now();

        std::memcpy(buffer2, buffer1, buffer_size);

        auto iter_end = std::chrono::system_clock::now();

        // Calculate and accumulate duration
        auto diff = iter_end - iter_start;
        auto duration = std::chrono::duration<double, std::milli>(diff).count();
        average_duration += duration;
    }

    // Calculate and display average duration
    average_duration /= num_loops;
    std::cout << "    Duration:    " << std::setprecision(4) << std::fixed 
              << average_duration << " ms" << std::endl;

    // Calculate and display rate
    double rate = (buffer_size /  MB_SCALER) / (average_duration / 1000);
    std::cout << "    Rate:        " << std::setprecision(2) << std::fixed 
              << rate << " MB/s" << std::endl;

    std::free(buffer1);
    std::free(buffer2);
}

I then compiled and ran this utility using a 64 MB buffer size (significantly larger than the L3 cache size) over 10,000 loops.

Dual socket configuration:

$ ./mem_test 64 10000
Memory Tester

    Num loops:   10000
    Buffer size: 64 MB
    Duration:    17.9141 ms
    Rate:        3572.61 MB/s

Single socket configuration:

(same hardware with one processor physically removed)

#./mem_test 64 10000
Memory Tester

    Num loops:   10000
    Buffer size: 64 MB
    Duration:    11.2055 ms
    Rate:        5711.46 MB/s

Dual socket using numactl:

At the behest of a colleague I tried running the same utility using numactl to localize memory access to only the first numa node.

$ numactl -m 0 -N 0 ./mem_test 64 10000
Memory Tester

    Num loops:   10000
    Buffer size: 64 MB
    Duration:    18.3539 ms
    Rate:        3486.99 MB/s

Results

5711.43 / 3572.61 = 1.59867

The exact same test on both configurations shows that the single socket configuration is ~60% faster.

I found this question which is somewhat similar but much more detailed. From one of the comments: "Populating a 2nd socket forces even local L3 misses to snoop the remote CPU...".

I understand the concept of L3 snooping, but still the overhead compared to the single socket case seems incredibly high to me. Is the behavior that I'm seeing expected? Could someone shed more light on what's happening and what, if anything, I can do about it?

Check also https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/509237 — Alessandro Carini, Jun 04 '18 at 14:00
So I guess the answer is "yes, this is expected"? It seems crazy to me that the relative overhead L3 snooping traffic is that high for such a trivial test. Any additional insight would be appreciated. — Dave, Jun 04 '18 at 19:12

Memcpy bandwidth ~1.6x faster on 1 vs 2 socket Intel Scalable (Skylake)?

Dual socket configuration:

Single socket configuration:

Dual socket using numactl:

Results

0 Answers0