Suite for benchmarking malloc implementations, originally
developed for benchmarking mimalloc.
Collection of various benchmarks from the academic literature, together with
automated scripts to pull specific versions of benchmark programs and
allocators from Github and build them.
Due to the large variance in programs and allocators, the suite is currently
only developed for Unix-like systems, and specifically Ubuntu with apt-get, Fedora with dnf,
and macOS (for a limited set of allocators and benchmarks).
The only system-installed allocator used is glibc's implementation that ships as part of Linux's libc.
All other allocators are downloaded and built as part of build-bench-env.sh --
if you are looking to run these benchmarks on a different Linux distribution look at
the setup_packages function to see the packages required to build the full set of
allocators.
It is quite easy to add new benchmarks and allocator implementations --
please do so!.
Enjoy,
Daan
Note that all the code in the bench directory is not part of
mimalloc-bench as such, and all programs in the bench directory are
governed under their own specific licenses and copyrights as detailed in
their README.md (or license.txt) files. They are just included here for convenience.
The build-bench-env.sh script with the all argument will automatically pull
all needed benchmarks and allocators and build them in the extern directory:
~/dev/mimalloc-bench> ./build-bench-env.sh all
It starts installing packages and you will need to enter the sudo password.
All other programs are build in the mimalloc-bench/extern directory.
Use ./build-bench-env.sh -h to see all options.
If everything succeeded, you can run the full benchmark suite (from out/bench) as:
~/dev/mimalloc-bench> cd out/bench
~/dev/mimalloc-bench/out/bench>../../bench.sh alla allt
Or just test mimalloc and tcmalloc on cfrac and larson with 16 threads:
~/dev/mimalloc-bench/out/bench>../../bench.sh --procs=16 mi tc cfrac larson
Generally, you can specify the allocators (mi, je,
tc, hd, sys (system allocator)) etc, and the benchmarks
, cfrac, espresso, barnes, lean, larson, alloc-test, cscratch, etc.
Or all allocators (alla) and tests (allt).
Use --procs=<n> to set the concurrency, and use --help to see all supported
allocators and benchmarks.
Current Allocators
Supported allocators are as follow, see
build-bench-env.sh
for the versions:
dieharder: The DieHarder
allocator is an error-resistant memory allocator for Windows, Linux, and Mac
OS X.
lt: The ltalloc allocator,
a multi-threaded memory allocator based on free lists best suited for many small allocations.
mesh: The mesh allocator, a
memory allocator that automatically reduces the memory footprint of C/C++
applications. Also tested as nomesh with the meshing feature disabled.
mi: The mimalloc allocator.
We can also test the debug version as dmi (this can be used to check for
any bugs in the benchmarks), and the secure version as smi.
rp: The rpmalloc allocator uses
16-byte aligned allocations and is developed by Mattias
Jansson at Epic Games, used for example
in Haiku.
sc: The scalloc allocator,
a fast, multicore-scalable, low-fragmentation memory allocator
scudo: The
scudo allocator
used by Fuschia and Android.
sg: The slimguard allocator,
designed to be secure and memory-efficient.
sm: The Supermalloc
allocator by Bradley Kuszmaul uses hardware transactional memory to speed up
parallel operations.
sn: The snmalloc allocator
is a recent concurrent message passing
allocator by Liétar et al. [8].
tbb: The Intel TBB allocator that comes
with the Thread Building Blocks (TBB) library [7].
tc: The tcmalloc
allocator which comes as part of the Google performance tools,
now maintained by the commuity.
tcg: The tcmalloc
allocator, maintained and used
by Google.
yal: The yalloc yet another allocator aims at balancing safety and compactness.
sys: The system allocator. Here we usually use the glibc allocator
(which is originally based on Ptmalloc2).
Current Benchmarks
The first set of benchmarks are real world programs, or are trying to mimic
some, and consists of:
barnes: a hierarchical n-body particle solver [4], simulating the
gravitational forces between 163840 particles. It uses relatively few
allocations compared to cfrac and espresso but is multithreaded.
cfrac: by Dave Barrett, implementation of continued fraction
factorization, using many small short-lived allocations.
espresso: a programmable logic array analyzer, described by
Grunwald, Zorn, and Henderson [3]. in the context of cache aware memory allocation.
gs: have ghostscript process the entire
Intel Software Developer’s Manual PDF, which is around 5000 pages.
leanN: The Lean compiler by
de Moura et al, version 3.4.1,
compiling its own standard library concurrently using N threads
(./lean --make -j N). Big real-world workload with intensive
allocations.
redis: running redis-benchmark,
with 1 million requests pushing 10 new list elements and then requesting the
head 10 elements, and measures the requests handled per second. Simulates a
real-world workload.
larsonN: by Larson and Krishnan [2]. Simulates a server workload using 100 separate
threads which each allocate and free many objects but leave some
objects to be freed by other threads. Larson and Krishnan observe this
behavior (which they call bleeding) in actual server applications,
and the benchmark simulates this.
larsonN-sized: same as the larsonN except it uses sized deallocation calls which
have a fast path in some allocators.
The second set of benchmarks are stress tests and consist of:
alloc-test: a modern allocator test developed by
OLogN Technologies AG (ITHare.com)
Simulates intensive allocation workloads with a Pareto size
distribution. The alloc-testN benchmark runs on N cores doing
100·10⁶ allocations per thread with objects up to 1KiB
in size. Using commit 94f6cb
(master, 2018-07-04)
cache-scratch: by Emery Berger [1]. Introduced with the
Hoard allocator to test for
passive-false sharing of cache lines: first some small objects are
allocated and given to each thread; the threads free that object and allocate
immediately another one, and access that repeatedly. If an allocator
allocates objects from different threads close to each other this will lead
to cache-line contention.
cache_trash: part of Hoard
benchmarking suite, designed to exercise heap cache locality.
glibc-simple and glibc-thread: benchmarks for the glibc.
malloc-large: part of mimalloc benchmarking suite, designed
to exercice large (several MiB) allocations.
mleak: check that terminate threads don't "leak" memory.
mstress: simulates real-world server-like allocation patterns, using N threads with with allocations in powers of 2
where objects can migrate between threads and some have long life times. Not all threads have equal workloads and
after each phase all threads are destroyed and new threads created where some objects survive between phases.
rbstress: modified version of allocator_bench,
allocates chunks in memory via ruby shenanigans.
sh6bench: by MicroQuill as part of
SmartHeap. Stress test
where some of the objects are freed in a usual last-allocated, first-freed
(LIFO) order, but others are freed in reverse order. Using the public
source (retrieved
2019-01-02)
sh8benchN: by MicroQuill as part of
SmartHeap. Stress test
for multi-threaded allocation (with N threads) where, just as in larson,
some objects are freed by other threads, and some objects freed in reverse
(as in sh6bench). Using the public
source (retrieved
2019-01-02)
xmalloc-testN: by Lever and Boreham [5] and Christian Eder. We use the
updated version from the
SuperMalloc repository. This is a
more extreme version of the larson benchmark with 100 purely allocating
threads, and 100 purely deallocating threads with objects of various sizes
migrating between them. This asymmetric producer/consumer pattern is usually
difficult to handle by allocators with thread-local caches.
Finally, there is a
security benchmark
aiming at checking basic security properties of allocators.
Example
Below is an example (Apr 2019) of the benchmark results on an HP
Z4-G4 workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz with 16GB
ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
Memory usage:
(note: the xmalloc-testN memory usage should be disregarded is it
allocates more the faster the program runs. Unfortunately,
there are no entries for SuperMalloc in the leanN and xmalloc-testN
benchmarks as it faulted on those)
[1] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson.
Hoard: A Scalable Memory Allocator for Multithreaded Applications
the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). Cambridge, MA, November 2000.
pdf
[2] P. Larson and M. Krishnan. Memory allocation for long-running server applications. In ISMM, Vancouver, B.C., Canada, 1998.
pdf
[3] D. Grunwald, B. Zorn, and R. Henderson.
Improving the cache locality of memory allocation. In R. Cartwright, editor,
Proceedings of the Conference on Programming Language Design and Implementation, pages 177–186, New York, NY, USA, June 1993.
pdf
[4] J. Barnes and P. Hut. A hierarchical O(n*log(n)) force-calculation algorithm. Nature, 324:446-449, 1986.
[7] Alexey Kukanov, and Michael J Voss.
The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks.
Intel Technology Journal 11 (4). 2007
[8] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson,
Alex Shamis, Christoph M Wintersteiger, and David Chisnall.
Snmalloc: A Message Passing Allocator.
In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.