Memory ordering

Memory ordering describes the order of accesses to computer memory by a CPU. The term can refer either to the memory ordering generated by the compiler during compile time, or to the memory ordering generated by a CPU during runtime.

In modern microprocessors, memory ordering characterizes the CPUs ability to reorder memory operations – it is a type of out-of-order execution. Memory reordering can be used to fully utilize the bus-bandwidth of different types of memory such as caches and memory banks.

On most modern uniprocessors memory operations are not executed in the order specified by the program code. In single threaded programs all operations appear to have been executed in the order specified, with all out-of-order execution hidden to the programmer – however in multi-threaded environments (or when interfacing with other hardware via memory buses) this can lead to problems. To avoid problems, memory barriers can be used in these cases.

Compile-time memory ordering

The compiler has some freedom to sort the order of operations during compile time. However this can lead to problems if the order of memory accesses is of importance.

Compile-time memory barrier implementation

These barriers prevent a compiler from reordering instructions during compile time – they do not prevent reordering by CPU during runtime.

  • The GNU inline assembler statement
asm volatile("" ::: "memory");

or even

__asm__ __volatile__ ("" ::: "memory");

forbids GCC compiler to reorder read and write commands around it.[1]

  • The C11/C++11 function
atomic_signal_fence(memory_order_acq_rel);

forbids the compiler to reorder read and write commands around it.[2]

__memory_barrier()

intrinsics.[3][4]

_ReadWriteBarrier()

Combined barriers

In many programming languages different types of barriers can be combined with other operations (like load, store, atomic increment, atomic compare and swap), so no extra memory barrier is needed before or after it (or both). Depending on a CPU architecture being targeted these language constructs will translate to either special instructions, to multiple instructions (i.e. barrier and load), or to normal instruction, depending on hardware memory ordering guarantees.

Runtime memory ordering

In symmetric multiprocessing (SMP) microprocessor systems

There are several memory-consistency models for SMP systems:

  • Sequential consistency (all reads and all writes are in-order)
  • Relaxed consistency (some types of reordering are allowed)
    • Loads can be reordered after loads (for better working of cache coherency, better scaling)
    • Loads can be reordered after stores
    • Stores can be reordered after stores
    • Stores can be reordered after loads
  • Weak consistency (reads and writes are arbitrarily reordered, limited only by explicit memory barriers)

On some CPUs

  • Atomic operations can be reordered with loads and stores.[6]
  • There can be incoherent instruction cache pipeline, which prevents self-modifying code from being executed without special instruction cache flush/reload instructions.
  • Dependent loads can be reordered (this is unique for Alpha). If the processor fetches a pointer to some data after this reordering, it might not fetch the data itself but use stale data which it has already cached and not yet invalidated. Allowing this relaxation makes cache hardware simpler and faster but leads to the requirement of memory barriers for readers and writers.[7] On Alpha hardware (like multiprocessor Alpha 21264 systems) cache line invalidations sent to other processors are processed in lazy fashion by default, unless requested explicitly to be processed between dependent loads. The Alpha architecture specification also allows other forms of dependent loads reordering, for example using speculative data reads ahead of knowing the real pointer to be dereferenced.
Memory ordering in some architectures[8][9]
Type Alpha ARMv7 MIPS RISC-V PA-RISC POWER SPARC x86 [lower-alpha 1] AMD64 IA-64 z/Architecture
WMO TSO RMO PSO TSO
Loads can be reordered after loads YY depend on
implementation
YYYYY
Loads can be reordered after stores YYYYYYY
Stores can be reordered after stores YYYYYYYY
Stores can be reordered after loads YYYYYYYYYYYYY
Atomic can be reordered with loads YYYYYY
Atomic can be reordered with stores YYYYYYY
Dependent loads can be reordered Y
Incoherent instruction cache pipeline YYYYYYYYY
  1. This column indicates the behaviour of the vast majority of x86 processors. Some rare specialised x86 processors (IDT WinChip manufactured around 1998) may have weaker 'oostore' memory ordering.[10]

RISC-V memory ordering models:

WMO
Weak memory order (default)
TSO
Total store order (only supported with the Ztso extension)

SPARC memory ordering modes:

TSO
Total store order (default)
RMO
Relaxed-memory order (not supported on recent CPUs)
PSO
Partial store order (not supported on recent CPUs)

Hardware memory barrier implementation

Many architectures with SMP support have special hardware instruction for flushing reads and writes during runtime.

lfence (asm), void _mm_lfence(void)
sfence (asm), void _mm_sfence(void)[11]
mfence (asm), void _mm_mfence(void)[12]
sync (asm)
sync (asm)
mf (asm)
  • POWER
dcs (asm)
dmb (asm)
dsb (asm)
isb (asm)

Compiler support for hardware memory barriers

Some compilers support builtins that emit hardware memory barrier instructions:

  • GCC,[14] version 4.4.0 and later,[15] has __sync_synchronize.
  • Since C11 and C++11 an atomic_thread_fence() command was added.
  • The Microsoft Visual C++ compiler[16] has MemoryBarrier().
  • Sun Studio Compiler Suite[17] has __machine_r_barrier, __machine_w_barrier and __machine_rw_barrier.
gollark: =tex \spave
gollark: =tex AnyoneGotFunIdeasToWorkOn?
gollark: =tex NobodyEmployMe
gollark: =tex green \times eggs + ham
gollark: =tex \times

See also

References

  1. GCC compiler-gcc.h Archived 2011-07-24 at the Wayback Machine
  2. ECC compiler-intel.h Archived 2011-07-24 at the Wayback Machine
  3. Intel(R) C++ Compiler Intrinsics Reference
    Creates a barrier across which the compiler will not schedule any data access instruction. The compiler may allocate local data in registers across a memory barrier, but not global data.
  4. Visual C++ Language Reference _ReadWriteBarrier
  5. Victor Alessandrini, 2015. Shared Memory Application Programming: Concepts and Strategies in Multicore Application Programming. Elsevier Science. p. 176. ISBN 978-0-12-803820-8.
  6. Reordering on an Alpha processor by Kourosh Gharachorloo
  7. Memory Ordering in Modern Microprocessors by Paul McKenney
  8. Memory Barriers: a Hardware View for Software Hackers, Figure 5 on Page 16
  9. Table 1. Summary of Memory Ordering, from "Memory Ordering in Modern Microprocessors, Part I"
  10. SFENCE Store Fence
  11. MFENCE Memory Fence
  12. Data Memory Barrier, Data Synchronization Barrier, and Instruction Synchronization Barrier.
  13. Atomic Builtins
  14. "36793 – x86-64 does not get __sync_synchronize right".
  15. MemoryBarrier macro
  16. Handling Memory Ordering in Multithreaded Applications with Oracle Solaris Studio 12 Update 2: Part 2, Memory Barriers and Memory Fence

Further reading

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.