Unexpected and unexplained slow (and unusual) memory performance with Xeon Skylake SMP

Question

We've been testing a server using 2x Xeon Gold 6154 CPUs with a Supermicro X11DPH-I motherboard, and 96GB RAM, and found some very strange performance issues surrounding memory when compared to running with only 1 CPU (one socket empty), similar dual CPU Haswell Xeon E5-2687Wv3 (for this series of tests, but other Broadwells perform similarly), Broadwell-E i7s, and Skylake-X i9s (for comparison).

It would be expected that the Skylake Xeon processors with faster memory would perform faster than the Haswell when it comes to various memcpy functions and even memory allocation (not covered in the tests below, as we found a workaround), but instead with both CPUs installed, the Skylake Xeons perform at almost half the speed as the Haswell Xeons, and even less when compared to an i7-6800k. What's even stranger, is when using Windows VirtualAllocExNuma to assign the NUMA node for memory allocation, while plain memory copy functions expectedly perform worse on the remote node vs. the local node, memory copy functions utilizing the SSE, MMX, and AVX registers perform much faster on the remote NUMA node than on the local node (what?). As noted above, with Skylake Xeons, if we pull 1 CPU it performs more or less as expected (still a bit slower than Haswell, but not by a dramatic amount).

I'm not sure if this is a bug on the motherboard or CPU, or with UPI vs QPI, or none of the above, but no combination of BIOS settings seems to avail this. Disabling NUMA (not included in test results) in the bios does improve the performance of all copy functions using the SSE, MMX and AVX registers, but all other plain memory copy functions suffer large losses as well.

For our test program, we tested both using inline assembly functions, and _mm intrinsic, we used Windows 10 with Visual Studio 2017 for everything except the assembly functions, which as msvc++ won't compile asm for x64, we used gcc from mingw/msys to compile an obj file using -c -O2 flags, which we included in the msvc++ linker.

If the system is using NUMA nodes, we test both operators new for memory allocation with VirtualAllocExNuma for each NUMA node and do a cumulative average of 100 memory buffer copies of 16MB each for each memory copy function, and we rotate which memory allocation we are on between each set of tests.

All 100 source and 100 destination buffers are 64 bytes aligned (for compatibility up to AVX512 using streaming functions) and initialized once to incremental data for the source buffers, and 0xff for the destination buffers.

The number of copies being averaged on each machine with each configuration varied, as it was much faster on some, and much slower on others.

Results were as follows:

Haswell Xeon E5-2687Wv3 1 CPU (1 empty socket) on Supermicro X10DAi with 32GB DDR4-2400 (10c/20t, 25 MB of L3 cache). But remember, the benchmark rotates through 100 pairs of 16MB buffers, so we probably aren't getting L3 cache hits.

---------------------------------------------------------------------------
Averaging 7000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2264.48 microseconds
asm_memcpy (asm)                 averaging 2322.71 microseconds
sse_memcpy (intrinsic)           averaging 1569.67 microseconds
sse_memcpy (asm)                 averaging 1589.31 microseconds
sse2_memcpy (intrinsic)          averaging 1561.19 microseconds
sse2_memcpy (asm)                averaging 1664.18 microseconds
mmx_memcpy (asm)                 averaging 2497.73 microseconds
mmx2_memcpy (asm)                averaging 1626.68 microseconds
avx_memcpy (intrinsic)           averaging 1625.12 microseconds
avx_memcpy (asm)                 averaging 1592.58 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2260.6 microseconds

Haswell Dual Xeon E5-2687Wv3 2 cpu on Supermicro X10DAi with 64GB ram

---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3179.8 microseconds
asm_memcpy (asm)                 averaging 3177.15 microseconds
sse_memcpy (intrinsic)           averaging 1633.87 microseconds
sse_memcpy (asm)                 averaging 1663.8 microseconds
sse2_memcpy (intrinsic)          averaging 1620.86 microseconds
sse2_memcpy (asm)                averaging 1727.36 microseconds
mmx_memcpy (asm)                 averaging 2623.07 microseconds
mmx2_memcpy (asm)                averaging 1691.1 microseconds
avx_memcpy (intrinsic)           averaging 1704.33 microseconds
avx_memcpy (asm)                 averaging 1692.69 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3185.84 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 3992.46 microseconds
asm_memcpy (asm)                 averaging 4039.11 microseconds
sse_memcpy (intrinsic)           averaging 3174.69 microseconds
sse_memcpy (asm)                 averaging 3129.18 microseconds
sse2_memcpy (intrinsic)          averaging 3161.9 microseconds
sse2_memcpy (asm)                averaging 3141.33 microseconds
mmx_memcpy (asm)                 averaging 4010.17 microseconds
mmx2_memcpy (asm)                averaging 3211.75 microseconds
avx_memcpy (intrinsic)           averaging 3003.14 microseconds
avx_memcpy (asm)                 averaging 2980.97 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3987.91 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3172.95 microseconds
asm_memcpy (asm)                 averaging 3173.5 microseconds
sse_memcpy (intrinsic)           averaging 1623.84 microseconds
sse_memcpy (asm)                 averaging 1657.07 microseconds
sse2_memcpy (intrinsic)          averaging 1616.95 microseconds
sse2_memcpy (asm)                averaging 1739.05 microseconds
mmx_memcpy (asm)                 averaging 2623.71 microseconds
mmx2_memcpy (asm)                averaging 1699.33 microseconds
avx_memcpy (intrinsic)           averaging 1710.09 microseconds
avx_memcpy (asm)                 averaging 1688.34 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3175.14 microseconds

Skylake Xeon Gold 6154 1 CPU (1 empty socket) on Supermicro X11DPH-I with 48GB DDR4-2666 (18c/36t, 24.75 MB of L3 cache)

---------------------------------------------------------------------------
Averaging 5000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1832.42 microseconds
asm_memcpy (asm)                 averaging 1837.62 microseconds
sse_memcpy (intrinsic)           averaging 1647.84 microseconds
sse_memcpy (asm)                 averaging 1710.53 microseconds
sse2_memcpy (intrinsic)          averaging 1645.54 microseconds
sse2_memcpy (asm)                averaging 1794.36 microseconds
mmx_memcpy (asm)                 averaging 2030.51 microseconds
mmx2_memcpy (asm)                averaging 1816.82 microseconds
avx_memcpy (intrinsic)           averaging 1686.49 microseconds
avx_memcpy (asm)                 averaging 1716.15 microseconds
avx512_memcpy (intrinsic)        averaging 1761.6 microseconds
rep movsb (asm)                  averaging 1977.6 microseconds

Skylake Xeon Gold 6154 2 CPU on Supermicro X11DPH-I with 96GB DDR4-2666

---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3131.6 microseconds
asm_memcpy (asm)                 averaging 3070.57 microseconds
sse_memcpy (intrinsic)           averaging 3297.72 microseconds
sse_memcpy (asm)                 averaging 3423.38 microseconds
sse2_memcpy (intrinsic)          averaging 3274.31 microseconds
sse2_memcpy (asm)                averaging 3413.48 microseconds
mmx_memcpy (asm)                 averaging 2069.53 microseconds
mmx2_memcpy (asm)                averaging 3694.91 microseconds
avx_memcpy (intrinsic)           averaging 3118.75 microseconds
avx_memcpy (asm)                 averaging 3224.36 microseconds
avx512_memcpy (intrinsic)        averaging 3156.56 microseconds
rep movsb (asm)                  averaging 3155.36 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 5309.77 microseconds
asm_memcpy (asm)                 averaging 5330.78 microseconds
sse_memcpy (intrinsic)           averaging 2350.61 microseconds
sse_memcpy (asm)                 averaging 2402.57 microseconds
sse2_memcpy (intrinsic)          averaging 2338.61 microseconds
sse2_memcpy (asm)                averaging 2475.51 microseconds
mmx_memcpy (asm)                 averaging 2883.97 microseconds
mmx2_memcpy (asm)                averaging 2517.69 microseconds
avx_memcpy (intrinsic)           averaging 2356.07 microseconds
avx_memcpy (asm)                 averaging 2415.22 microseconds
avx512_memcpy (intrinsic)        averaging 2487.01 microseconds
rep movsb (asm)                  averaging 5372.98 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3075.1 microseconds
asm_memcpy (asm)                 averaging 3061.97 microseconds
sse_memcpy (intrinsic)           averaging 3281.17 microseconds
sse_memcpy (asm)                 averaging 3421.38 microseconds
sse2_memcpy (intrinsic)          averaging 3268.79 microseconds
sse2_memcpy (asm)                averaging 3435.76 microseconds
mmx_memcpy (asm)                 averaging 2061.27 microseconds
mmx2_memcpy (asm)                averaging 3694.48 microseconds
avx_memcpy (intrinsic)           averaging 3111.16 microseconds
avx_memcpy (asm)                 averaging 3227.45 microseconds
avx512_memcpy (intrinsic)        averaging 3148.65 microseconds
rep movsb (asm)                  averaging 2967.45 microseconds

Skylake-X i9-7940X on ASUS ROG Rampage VI Extreme with 32GB DDR4-4266 (14c/28t, 19.25 MB of L3 cache) (overclocked to 3.8GHz/4.4GHz turbo, DDR at 4040MHz, Target AVX Frequency 3737MHz, Target AVX-512 Frequency 3535MHz, target cache frequency 2424MHz)

---------------------------------------------------------------------------
Averaging 6500 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1750.87 microseconds
asm_memcpy (asm)                 averaging 1748.22 microseconds
sse_memcpy (intrinsic)           averaging 1743.39 microseconds
sse_memcpy (asm)                 averaging 3120.18 microseconds
sse2_memcpy (intrinsic)          averaging 1743.37 microseconds
sse2_memcpy (asm)                averaging 2868.52 microseconds
mmx_memcpy (asm)                 averaging 2255.17 microseconds
mmx2_memcpy (asm)                averaging 3434.58 microseconds
avx_memcpy (intrinsic)           averaging 1698.49 microseconds
avx_memcpy (asm)                 averaging 2840.65 microseconds
avx512_memcpy (intrinsic)        averaging 1670.05 microseconds
rep movsb (asm)                  averaging 1718.77 microseconds

Broadwell i7-6800k on ASUS X99 with 24GB DDR4-2400 (6c/12t, 15 MB of L3 cache)

---------------------------------------------------------------------------
Averaging 64900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2522.1 microseconds
asm_memcpy (asm)                 averaging 2615.92 microseconds
sse_memcpy (intrinsic)           averaging 1621.81 microseconds
sse_memcpy (asm)                 averaging 1669.39 microseconds
sse2_memcpy (intrinsic)          averaging 1617.04 microseconds
sse2_memcpy (asm)                averaging 1719.06 microseconds
mmx_memcpy (asm)                 averaging 3021.02 microseconds
mmx2_memcpy (asm)                averaging 1691.68 microseconds
avx_memcpy (intrinsic)           averaging 1654.41 microseconds
avx_memcpy (asm)                 averaging 1666.84 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2520.13 microseconds

The assembly functions are derived from fast_memcpy in xine-libs, mostly used just to compare with msvc++'s optimizer.

Source Code for the test is available at https://github.com/marcmicalizzi/memcpy_test (it's a bit long to put in the post)

Has anyone else run into this or does anyone have any insight on why this might be happening?

Update 2018-05-15 13:40EST

So as suggested by Peter Cordes, I've updated the test to compare prefetched vs not prefetched, and NT stores vs regular stores, and tuned the prefetching done in each function (I don't have any meaningful experience with writing prefetching, so if I'm making any mistakes with this, please let me know and I'll adjust the tests accordingly. The prefetching does have an impact, so at the very least it's doing something). These changes are reflected in the latest revision from the GitHub link I made earlier for anyone looking for the source code.

I've also added an SSE4.1 memcpy, since prior to SSE4.1 I can't find any _mm_stream_load (I specifically used _mm_stream_load_si128) SSE functions, so sse_memcpy and sse2_memcpy can't be completely using NT stores, and as well the avx_memcpy function uses AVX2 functions for stream loading.

I opted not to do a test for pure store and pure load access patterns yet, as I'm not sure if the pure store could be meaningful, as without a load to the registers it's accessing, the data would be meaningless and unverifiable.

The interesting results with the new test were that on the Xeon Skylake Dual Socket setup and only on that setup, the store functions were actually significantly faster than the NT streaming functions for 16MB memory copying. As well only on that setup as well (and only with LLC prefetch enabled in BIOS), prefetchnta in some tests (SSE, SSE4.1) outperforms both prefetcht0 and no prefetch.

The raw results of this new test are too long to add to the post, so they are posted on the same git repository as the source code under results-2018-05-15

I still don't understand why for streaming NT stores, the remote NUMA node is faster under the Skylake SMP setup, albeit the using regular stores is still faster than that on the local NUMA node

Haven't had a chance to digest your data yet, but see also [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) (comparing a quad-core Skylake against a many-core Broadwell, and seeing the downside of higher memory/L3 latency in many-core systems where single-core bandwidth is limited by max memory concurrency in one core, not by DRAM controllers.) SKX has high latency / low bandwidth per core to L3 / memory in general, according to Mysticial's testing and other results. You're probably seeing that. — Peter Cordes, May 14 '18 at 22:37
Ah, my fault, I was confusing the v3 with the v4 since we've worked with both, but the server is was tested on was a v3, so Haskell it is — , May 14 '18 at 22:40
The slower L3 is fine and visible for operating with one CPU, a magnitude of 200microseconds for 16MB for SSE for example, but the biggest problem is just having the second cpu installed and active, even for single threaded memory copy, results in almost double the time to copy the same memory. That's not even mentioning the weird behaviour of speeding up sse, avx, etc. register based copies on the remote NUMA node — , May 14 '18 at 22:53
Are any of your copies using NT stores? I just checked, and all of your copies except MMX are using `prefetchnta` and NT stores! That's a huge important fact you left out of your question! See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) for more discussion of ERMSB `rep movsb` vs. NT vector stores vs. regular vector stores. Messing around with that would be more useful than MMX vs. SSE. Probably just use AVX and/or AVX512 and try NT vs. regular, and / or leaving out the SW prefetch. — Peter Cordes, May 14 '18 at 23:08
Did you tune the prefetch distance for your SKX machines? SKX `prefetchnta` bypasses L3 as well as L2 (because L3 is non-inclusive), so it's more sensitive to prefetch distance (too late and data has to come all the way from DRAM again, not just L3), so it's more "brittle" (sensitive to tuning the right distance). Your prefetch distances look fairly low, though, under 500 bytes if I'm reading the asm correctly. @Mysticial's testing on SKX has found that [`prefetchnta` can be a big slowdown on that uarch](https://stackoverflow.com/posts/comments/82130637)), and he doesn't recommend it. — Peter Cordes, May 14 '18 at 23:16
The NT stores seem to always have better performance through the tests we did, the mmx and mmx2 are in there mostly just because they were already there so it provides more functions to look at. But still in comparing the SMP Skylake to SMP Haswell in more or less the same configuration, would just having the NT stores account for a near 50% slowdown for using sse, sse2 and avx stores for memory copying? I'll try a few more tests with NT vs regular stores, and without prefetch, but when I was writing the intrinsic functions for the test originally, non-NT sse/avx was much slower. — , May 14 '18 at 23:18
We've tried every prefetch related option in the bios, they all seem to make little difference, except for LLC prefetch did initially, but under load it actually seems to make little difference. We also tried DCU streamer prefetch, adjacent cache prefetch, enabling and disabling SNC, 1 way and 2 way IMC interleaving... we've gone through as many bios options as seem like they could be related to no avail. Also all power management is disabled, so it wouldn't be related to that. — , May 14 '18 at 23:23
**You definitely have some interesting results here, but we need to untangle them from various effects**. Having numbers both with and without NT stores may tell us something useful about NUMA behaviour. Populating a 2nd socket forces even local L3 misses to snoop the remote CPU, at least on Broadwell/Haswell. Dual-socket E5 Xeons don't have a snoop filter. I think Gold Xeons *do* have snoop filters, because they're capable of operating in more than dual-socket systems. But I'm not sure how big it is, or what that really means :P I haven't done memory perf tuning on multi-socket. — Peter Cordes, May 14 '18 at 23:24
Alright, I'll add many more functions using different options for the stores and prefetch to the test. I'm probably not going to be able to test it on the server until Wednesday to get the results under an SMP environment, but I'll try to get them tomorrow afternoon if I can. I definitely appreciate the input and insight! — , May 14 '18 at 23:30
It's still strange to me that there's such a massive performance hit for adding the second CPU in Skylake when there wasn't in Haswell (and I don't think Broadwell did either, but I don't have access to a server to test that on at the moment) — , May 14 '18 at 23:32
SKX is a fundamentally different interconnect; a mesh instead of a ring. It's an interesting result, but not unbelievable and may not be a sign of a misconfiguration. IDK, hopefully someone else with more experience with the hardware can shed more light. — Peter Cordes, May 14 '18 at 23:35
When you get a chance to test again, testing pure-store and pure-load access patterns would be interesting, too, not just copy. e.g. maybe we'll find that only loads are slowed down by having the other socket populated, but not stores, or the other way around. (Or a different slowdown factor). You said there was nothing interesting for data sets that fit in L3 cache? I guess that's normal; L3 hits should avoid snooping the other socket. — Peter Cordes, May 14 '18 at 23:53
`_mm_stream_load` doesn't do anything special on WB memory, only on WC memory (e.g. copying from video RAM). It runs as a `movdqa` load + an ALU uop. `prefetchnta` is currently the only way to minimize cache pollution for streaming loads, but it's brittle and can make things worse instead of better. — Peter Cordes, May 15 '18 at 18:37
Still applies to `_mm_stream_load_si128`? That's the specific SSE4.1 function I was refering to, which results in `movntdqa`, I was specifically looking for NT functions for moving from memory to registers. I guess I should have clarified in the updates regardless, I've edited it in now. — , May 15 '18 at 18:44
Yes, `movntdqa` is only special on WC memory. The *only* way to do anything other than normal loads on current hardware is `prefetchnta`. https://web.archive.org/web/20120918010837/http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/ and https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers. (Current CPUs ignore the NT hint on `movntdqa` on memory types other than WC, I think because they don't have hardware NT-prefetchers to keep track of whether to do NT prefetch or regular) — Peter Cordes, May 15 '18 at 18:49
*as without a load to the registers it's accessing, the data would be meaningless and unverifiable.* Huh? If the store instructions run, the stores actually happen. Your copy tests aren't reading back the copy destinations, are they? Out-of-order execution only happens over a window of 224 uops on Skylake, so there's no way for the CPU to realize the data is unused and optimize away the stores. You can use perf counters to check, e.g. `ocperf.py stat l2_lines_out.non_silent ./my_prog` counts dirty lines evicted from L2, written back to L3. — Peter Cordes, May 15 '18 at 18:51
`DDR4-4266` for you i9 machine is a typo, right? I assume you actually have the same DDR4-2666 as your SKX Xeons, not *highly*-overclocked memory. — Peter Cordes, May 15 '18 at 18:58
The i9 machine is my home computer, with overclocked cpu and ram, water cooled, etc., (See motherboard being ROG Rampage VI Extreme) so not a typo :) It's the only other Skylake cpu machine I had to test against at this time, so I figured even being overclocked it could be useful metric. Also it's worth noting that the test will, about 50% of the time, cause a windows WHEA_UNCORRECTABLE_ERROR bluescreen on my overclocked i9, which is probably related to the overclocking. — , May 15 '18 at 19:00
Did you use the same binary on all machines, or did you compile separately on the SKX Xeons vs. your i9? There's a large difference between asm and intrinsic `avx_memcpy` on the i9 (with intrinsics being much slower), but only a small different on single-socket SKX Xeon (times much closer to `asm`). Uncore clock speeds might be relevant here; faster handoff of cache lines to the memory controller might be helping keep more loads/stores in flight. I should have included stock / turbo clocks when I edited in cores / cache size, but maybe you could do that with actual clocks for your machines? — Peter Cordes, May 15 '18 at 19:06
It's the same binaries on all the machines tested. I left the asm functions (save for asm_memcpy) out of the most recent tests, and just stuck with intrinsics. And I think you either have a typo in the last comment, or misread the results, from those tests the asm functions were almost always slower than the intrinsics in all cases, with only a couple close exceptions (at a quick glance it was almost always on the remote numa node). Might have been the optimizer, or related to the inconsistent prefetching on the original tests. Also the i9 is the *only* overclocked CPU/RAM in the tests — , May 15 '18 at 19:23
Oh yup, I had it backwards. `asm` was slower than `intrinsic` in the cases I was commenting on, not faster. I'm curious what difference in compiler-generated asm vs. inline asm makes so much difference; maybe hand-written asm is hurting itself by only loading one cache line before storing it, while maybe the compiler loads 2 or 4 cache lines before storing. It's odd that it makes so much more difference on the i9 than on the Xeon; maybe prefetch distance is slightly different and more memory controllers on the Xeon can absorb extra demand misses? Or both PF distances work on the Xeon... — Peter Cordes, May 15 '18 at 19:33
A BIOS update was publised on 15/06/2018, here are the release [notes](https://www.supermicro.com/Bios/softfiles/5924/P-X11DPH-I-T-TQ_BIOS_2_1_release_notes.pdf) and the download [link](https://www.supermicro.com/support/resources/results.aspx) — pmarkoulidakis, Oct 29 '18 at 11:23

score 0 · Answer 1 · answered Apr 25 '19 at 19:26

0

Is your memory the incorrect Rank? Perhaps your board has some weird thing with the memory ranking when you add that second CPU? I know when you have Quad CPU machines they do all kinds of weird things to make the memory work properly and if you have the incorrect ranked memory sometimes it will work but clock back to like 1/4 or 1/2 of the speed. Perhaps SuperMicro did something in that board to make the DDR4 and Dual CPU into Quad Channel and it is using similar math. Incorrect rank == 1/2 speed.

answered Apr 25 '19 at 19:26

thelanranger

139
7

Doesn't appear to be the case, all the memory is 1R8, and matches rank from the supermicro qvl for the motherboard. Was worth a check though! – Marc Micalizzi May 01 '19 at 23:11
1

I know this is a different system entirely, but this is what I was referring too. https://qrl.dell.com/Files/en-us/Html/Manuals/R920/System%20Memory=GUID-A94C7F4A-512E-44CB-B142-CE638A9304FF=1=en-us=.html You'll note that the rank requirements change when you increase the amount of sticks/CPUs. – thelanranger May 02 '19 at 18:05

Unexpected and unexplained slow (and unusual) memory performance with Xeon Skylake SMP

Results were as follows:

I still don't understand why for streaming NT stores, the remote NUMA node is faster under the Skylake SMP setup, albeit the using regular stores is still faster than that on the local NUMA node

1 Answers1

Linked