While profiling Shadesmar a couple of
weeks ago, I noticed that for large binary unserialized messages (>512kB) most
of the execution time is spent doing copying the message (using memcpy
)
between process memory to shared memory and back.
I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies.
Here’s the dumb of perf
when running pub-sub for messages of sizes between 512kB and 2MB.
Children Self Shared Object Symbol
+ 99.86% 0.00% libc-2.27.so [.] __libc_start_main
+ 99.86% 0.00% [unknown] [k] 0x4426258d4c544155
+ 99.84% 0.02% raw_benchmark [.] main
+ 98.13% 97.12% libc-2.27.so [.] __memmove_avx_unaligned_erms
+ 51.99% 0.00% raw_benchmark [.] shm::PublisherBin<16u>::publish
+ 51.98% 0.01% raw_benchmark [.] shm::Topic<16u>::write
+ 47.64% 0.01% raw_benchmark [.] shm::Topic<16u>::read
__memmove_avx_unaligned_erms
is an implementation of memcpy
for unaligned
memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the
glibc
source code, I found this:
#if IS_IN (libc)
# define VEC_SIZE 32
# define VEC(i) ymm##i
# define VMOVNT vmovntdq
# define VMOVU vmovdqu
# define VMOVA vmovdqa
# define SECTION(p) p##.avx
# define MEMMOVE_SYMBOL(p,s) p##_avx_##s
# include "memmove-vec-unaligned-erms.S"
#endif
Breaking down this function:
memmove
: glibc
implements memcpy
as a memmove
instead, here’s the
relevant source code:
# define SYMBOL_NAME memcpy
# include "ifunc-memmove.h"
libc_ifunc_redirected (__redirect_memcpy, __new_memcpy,
IFUNC_SELECTOR ());
Here’s the difference between the two: With memcpy
, the destination cannot
overlap the source at all. With memmove
it can. Initially, I wasn’t sure why
it was implemented as memmove
. The reason for this will become clearer as
the post proceeds.
erms
: Enhanced Rep Movs is a hardware optimization for a loop that
does a simple copy. In simple pseudo-code, this is what the loop implementation
looks like for copying a single byte at a time (REP MOVSB
).
void rep_movsb(void *dest, const void *src, size_t len) {
const uint8_t* s = (uint8_t*)src;
uint8_t* d = (uint8_t*)dest;
while (len--)
*d++ = *s++;
return dest;
}
Since the loop copies data pointer by pointer, it can handle the case of overlapping data.
vec
: For the above loop rather than copying around single bytes, it uses x86
vectorized instructions to copy multiple bytes in a single loop iteration
(technically single instruction). vmov*
are assembly instructions for AVX
which is the latest instruction set that the CPU on my laptop supports. With
VEC_SIZE = 32
, it copies 32 bytes at a time.
unaligned
: This is a generic version of memmove
that can copy between any
pointer locations irrespective of their alignment. Unaligned pointers increase
complexity for the copy loop when using vectorized instructions. The unaligned
preceeding and trailing memory locations must be copied separately before hitting the
optimized loop.
memmove-vec-unaligned-erms.S
holds the actual implementation in assembly. A
few things that the implementation does:
It uses REP MOVS
only if the data is greater than 4kB. For smaller values it uses the SSE2 optimization.
unaligned
pointers, it uses the following blocks:
vmovdqu
movq
movl
movzwl
and movw
VMOVNT
defined above is for doing non-temporal(NT) moves. NT instructions
are used when there is an overlap between destination and source since
destination may be in cache when source is loaded. Uses prefetcht0
to load
data into cache (all levels: t0). In the current iteration, we prefetch the
data for 2 iterations later. The data is copied (via cache) into registers. The
data (via NT) is copied from registers into destination.L(loop_large_forward):
; Copy 4 * VEC a time forward with non-temporal stores.
PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 2)
PREFETCH_ONE_SET (1, (%rsi), PREFETCHED_LOAD_SIZE * 3)
; PREFETCH 256b from rsi+256 to rsi+511
VMOVU (%rsi), %VEC(0)
VMOVU VEC_SIZE(%rsi), %VEC(1)
VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2)
VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3)
; mov 128b from rsi to rsi+127 -> 4 ymm registers (cache)
; 2 loops later, we hit the prefetched values
addq $PREFETCHED_LOAD_SIZE, %rsi ; advance to rsi+128 in next loop
subq $PREFETCHED_LOAD_SIZE, %rdx
VMOVNT %VEC(0), (%rdi)
VMOVNT %VEC(1), VEC_SIZE(%rdi)
VMOVNT %VEC(2), (VEC_SIZE * 2)(%rdi)
VMOVNT %VEC(3), (VEC_SIZE * 3)(%rdi)
; mov 128b from 4 ymm register -> rdi to rdi+127 (no cache)
addq $PREFETCHED_LOAD_SIZE, %rdi ; advance to rdi+128 in next loop
cmpq $PREFETCHED_LOAD_SIZE, %rdx
ja L(loop_large_forward)
Before getting into more exotic implementations, I wanted to first implement a super simple version of ERSB to see how well it would perform. I used inline assembly to write out the loop.
void _rep_movsb(void *d, const void *s, size_t n) {
asm volatile("rep movsb"
: "=D"(d), "=S"(s), "=c"(n)
: "0"(d), "1"(s), "2"(n)
: "memory");
}
This does the same as the pseudo-code attached above, but I wrote it in assembly to prevent any compiler optimization, and rely only on the hardware ERMS optimization.
One of the complexities in glibc
’s implementation is getting it to work for
unaligned pointers. Since I control the memory allocation, I figured I could
recreate the implementation focused solely on aligned pointer and sizes. I’m
using AVX intrinsics for 32-byte vectors (AVX):
void _avx_cpy(void *d, const void *s, size_t n) {
// d, s -> 32 byte aligned
// n -> multiple of 32
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec--, sVec++, dVec++) {
const __m256i temp = _mm256_load_si256(sVec);
_mm256_store_si256(dVec, temp);
}
}
The logic is identical to the previous REP MOVSB
loop instead operating on 32
bytes at a time.
_mm256_load_si256
and _mm256_store_si256
go through the cache, which incurs
additional overhead. AVX instruction set has _stream_
load and store
instructions that skip the cache. The performance of this copy is dependant
on:
Non-temporal moves may bog down the performance for smaller copies (that can fit into L2 cache) compared to regular moves.
void _avx_async_cpy(void *d, const void *s, size_t n) {
// d, s -> 32 byte aligned
// n -> multiple of 32
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec--, sVec++, dVec++) {
const __m256i temp = _mm256_stream_load_si256(sVec);
_mm256_stream_si256(dVec, temp);
}
_mm_sfence();
}
Exact code as before but using non-temporal moves instead. There’s an extra
_mm_sfence
which guarantees that all stores in the preceding loop are
visible globally.
In the previous method, we skipped the cache entirely. We can squeeze a bit more performance by prefetching the source data into the cache for the next iteration in the current iteration. Since all prefetches work on cache-lines (64-bytes), each loop iteration copies 64-bytes from source to data.
void _avx_async_pf_cpy(void *d, const void *s, size_t n) {
// d, s -> 64 byte aligned
// n -> multiple of 64
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 2; nVec -= 2, sVec += 2, dVec += 2) {
// prefetch the next iteration's data
// by default _mm_prefetch moves the entire cache-lint (64b)
_mm_prefetch(sVec + 2, _MM_HINT_T0);
_mm256_stream_si256(dVec, _mm256_load_si256(sVec));
_mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));
}
_mm256_stream_si256(dVec, _mm256_load_si256(sVec));
_mm256_stream_si256(dVec + 1, _mm256_load_si256(sVec + 1));
_mm_sfence();
}
The load from source pointer to register should not skip the cache since
that data is explicitly prefetched into the cache, non-stream
_mm256_load_si256
must be used instead.
This also unrolls the loop for 2 copies at a time instead of a single copy. This is to guarantee that each loop iteration’s prefetch coincides the copy. Prefetch the next 64-bytes and copy the current 64-bytes.
In the previous section, most of the changes were in the actual underlying load, store instructions used. Another avenue of exploration is to unroll the loop for a certain number of iterations. This reduces the number of branch statements by the factor of unrolling.
In the glibc
implementation the unrolling factor is 4 which is what I’ll use
as well. A very simple way to implement this is to increase the alignment
required by 4x and treat each loop as 4 instructions that copy 4x data.
A more complicated version would be trying to implement an unrolled loop without increasing alignment size. We’ll need to copy using a regular fully rolled loop till we hit a pointer location that is aligned to the size expected by our unrolled loop.
Unrolling the aligned AVX copy:
void _avx_cpy_unroll(void *d, const void *s, size_t n) {
// d, s -> 128 byte aligned
// n -> multiple of 128
auto *dVec = reinterpret_cast<__m256i *>(d);
const auto *sVec = reinterpret_cast<const __m256i *>(s);
size_t nVec = n / sizeof(__m256i);
for (; nVec > 0; nVec -= 4, sVec += 4, dVec += 4) {
_mm256_store_si256(dVec, _mm256_load_si256(sVec));
_mm256_store_si256(dVec + 1, _mm256_load_si256(sVec + 1));
_mm256_store_si256(dVec + 2, _mm256_load_si256(sVec + 2));
_mm256_store_si256(dVec + 3, _mm256_load_si256(sVec + 3));
}
}
The operation of copying data is super easy to parallelize across multiple threads. The total data to be transferred can be segmented into (almost) equal chunks, and then copied over using one of the above methods. This will make the copy super-fast especially if the CPU has a large core count.
To make it easy to integrate custom memory copying logic into the library,
I introduced the concept of Copier
in this commit.
For a new copying algorithm, an abstract class Copier
must be implemented.
Here’s the definition of Copier
:
class Copier {
public:
virtual void *alloc(size_t) = 0;
virtual void dealloc(void *) = 0;
virtual void shm_to_user(void *, void *, size_t) = 0;
virtual void user_to_shm(void *, void *, size_t) = 0;
};
The original reason for introducing this construct was to allow cross-device
usage, where a custom copier would be implemented to tranfer between CPU and
GPU. E.g.: using cudaMemcpy
for Nvidia GPUs.
For a single device use case the implementation of shm_to_user
and
user_to_shm
are identical. The implementation of a copier that uses
std::memcpy
:
class DefaultCopier : public Copier {
public:
void *alloc(size_t size) override { return malloc(size); }
void dealloc(void *ptr) override { free(ptr); }
void shm_to_user(void *dst, void *src, size_t size) override {
std::memcpy(dst, src, size);
}
void user_to_shm(void *dst, void *src, size_t size) override {
std::memcpy(dst, src, size);
}
};
I also created an adapter MTCopier
that adds multithreading support to other
copiers:
template <class BaseCopierT>
class MTCopier : public Copier {
public:
explicit MTCopier(uint32_t threads = std::thread::hardware_concurrency())
: base_copier(base_copier), nthreads(threads) {}
void *alloc(size_t size) override { return base_copier.alloc(size); }
void dealloc(void *ptr) override { base_copier.dealloc(ptr); }
void _copy(void *d, void *s, size_t n, bool shm_to_user) {
std::vector<std::thread> threads;
threads.reserve(nthreads);
ldiv_t per_worker = div((int64_t)n, nthreads);
size_t next_start = 0;
for (uint32_t thread_idx = 0; thread_idx < nthreads; ++thread_idx) {
const size_t curr_start = next_start;
next_start += per_worker.quot;
if (thread_idx < per_worker.rem) {
++next_start;
}
uint8_t *d_thread = reinterpret_cast<uint8_t *>(d) + curr_start;
uint8_t *s_thread = reinterpret_cast<uint8_t *>(s) + curr_start;
if (shm_to_user) {
threads.emplace_back(&Copier::shm_to_user, &base_copier, d_thread,
s_thread, next_start - curr_start);
} else {
threads.emplace_back(&Copier::user_to_shm, &base_copier, d_thread,
s_thread, next_start - curr_start);
}
}
for (auto &thread : threads) {
thread.join();
}
threads.clear();
}
void shm_to_user(void *dst, void *src, size_t size) override {
_copy(dst, src, size, true);
}
void user_to_shm(void *dst, void *src, size_t size) override {
_copy(dst, src, size, false);
}
private:
BaseCopierT base_copier;
uint32_t nthreads;
};
Currently this only works for memcpy
and _rep_movsb
since the
implementation expects the memory copy to work for unaligned memory.
I used Google’s Benchmark for timing the performance of copying data ranging from size of 32kB to 64MB. All the benchmarks were run on my PC with the following specifications:
Stick to std::memcpy
. It delivers great performance while also adapting to
the hardware architecture, and makes no assumptions about the memory alignment.
If performance truly matters, then you might want to consider using a more
specific non-genetic implementation with alignment requirements. The streaming
prefetching copy works the best for larger copies (>1MB), but the performance
for small sizes is abyssal, but memcpy
matches its performance. For small to
medium sizes Unrolled AVX absolutely dominates, but as for larger messages, it
is slower than the streaming alternatives. The regular RepMovsb
is by far the
worst overall performer as excepted.
Unrolling definitely improves performance in most cases by about 5-10%. The
only case where the unrolled version is slower than rolled version is for
AvxCopier
with data size of 32B, which the unrolled version is 25% slower.
The rolled version will do a single AVX-256 load/store and a conditional check.
The unrolled version will do 4 AVX-256 load/stores and a conditional check.
Code for all the methods is included in the library conforming to the above
mentioned API. To actively warn about the danger of using these custom copiers
I have named this file dragons.h
,
with an apt message: Here be dragons.