Write-Combining can batches writes to the same cache line so they can be transferred in a single bus clock.

The data will be combined and stored in the write combine buffer, and then write is in burst mode.

By combining small individual memory transfers into one large (and continuous) one. This technique allows systems to nearly saturate the AGP/PCI bus and can transfer twice as much or more data than systems that do not have write combining.

The normal way:

Combine Write

The Weak Ordering problem

Write-Combining is weak ordering which makes it cannot be used for general memory access. The common case to use combine write is deal with frame buffer of video memory which does not need strong ordering.

Using _mm_stream_si128 intrinsics

In C++ code, there are some intrinsics function can help to utilize the write combine buffers.

 
void _mm_stream_si128(__m128i *p, __m128i a)
 
 

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16-byte aligned.

The cache line is the unit of CPU cache and memory transfer. CPU cache implemented by hash map, each bucket is typically 64 bytes. This is so called cache line.

A variable of type __m128i maps to the XMM[0-7] registers. Every register has 128 bits(16 bytes).

Store Intrinsics

Reference

Write-combining

Why SSDs Care About What You Write: Fragmentation & Write Combining

The basic idea of write combining

Copying Accelerated Video Decode Frame Buffers

Write Combining by Martin Thompson