Benchmarking OS primitives

Lately I have taken some interest in the hardware and software of C++ build servers. One of the things that I have noticed is that there is a significant performance difference between Windows and Linux machines for common build tasks, such as cloning a git repository, running CMake and caching build results.

Some of these differences can obviously by explained by the fact that the software was originally designed and optimized for Linux (which is especially true for git), but surely there must be underlying differences in the operating systems that contribute to this too.

To get a clearer picture, I set out to benchmark some typical core primitives of operating systems. This includes:

  • Process and thread creation.
  • File creation.
  • Memory allocation.

These are things that are used heavily by software build tool chains, as well as by many other softwares (e.g. VCS clients/servers, web servers, etc).

The benchmark suite

I wrote a set of simple micro benchmarks in plain C. You can find the source code on GitHub. I have built and run the benchmark programs on Linux, Windows and macOS, and they should be very portable and run on most Unix-like operating systems.

Pre-built binaries for Windows (64-bit, compiled with GCC): osbench-win64-20170529.zip

The test systems

Name OS CPU Disk
Linux-i7x4 Ubuntu 16.04 i7-6820HQ, 4-core, 2.7GHz 256GB SSD (SATA)
Linux-i7x8 Ubuntu 16.10 i7-6900K, 8-core, 3.2GHz 1TB SSD (NVMe)
Linux-AMDx8 Fedora 25 Ryzen 1800X, 8-core, 3.6GHz 250GB SSD (NVMe)
RaspberryPi Raspbian Jessie ARMv7, 4-core, 1.2GHz 32GB MicroSD
MacBookPro macOS 10.12.4 i5-6360U, 2-core, 2GHz 250GB SSD (NVMe)
MacMini macOS 10.12.5 i7-3615QM, 4-core, 2.3GHz 1TB HDD (SATA)
Win-i7x4 Win 10 Pro i7-6820HQ, 4-core, 2.7GHz 256GB SSD (SATA)
Win-AMDx8 Win 10 Pro Ryzen 1800X, 8-core, 3.6GHz 250GB SSD (NVMe)

Except for the Raspberry Pi3, most of these systems are fairly high end. Also to be noted is that Linux-AMDx8 and Win-AMDx8 have identical hardware. Same thing with Linux-i7x4 and Win-i7x4.

The results

Creating threads

In this benchmark 100 threads are created. Each thread terminates immediately without doing any work, and the main thread waits for all child threads to terminate. The time it takes for a single thread to start and terminate is measured.

Create threadApparently macOS is about twice as fast as Windows at creating threads, whereas Linux is about three times faster than Windows.

Creating processes

This benchmark is almost identical to the previous benchmark. However, here 100 child processes are created and terminated (using fork() and waitpid()). Unfortunately Windows does not have any corresponding functionality, so only Linux and macOS were benchmarked.

Create processAgain, Linux comes out on top. It is actually quite impressive that creating a process is only about 2-3x as expensive as creating a thread under Linux (the corresponding figure for macOS is about 7-8x).

Launching programs

Launching a program is essentially an extension to process creation: in addition to creating a new process, a program is loaded and executed (the program consists of an empty main() function and exists immediately). On Linux and macOS this is done using fork() + exec(), and on Windows it is done using CreateProcess().

Launch programHere Linux is notably faster than both macOS (~10x faster) and Windows (>20x faster). In fact, even a Raspberry Pi3 is faster than a stock Windows 10 Pro installation on an octa-core AMD Ryzen 1800X system!

Worth noting is that on Windows, this benchmark is very sensitive to background services such as Windows Defender and other antivirus software.

The best results on Windows were achieved by Win-AMDx8*, which is the same system as Win-AMDx8 but with most performance hogging services completely disabled (including Windows Defender and search indexing). However this is not a practical solution as it leaves your system completely unprotected, and makes things like file search close to unusable.

The very poor result for Win-i7x4 is probably due to third party antivirus software.

Creating files

In this benchmark, >65000 files are created in a single folder, filled with 32 bytes of data each, and then deleted. The time to create and delete a single file is measured.

Here is where things get silly…

Create fileAgain, Win-AMDx8* has Windows Defender and search indexing etc. disabled.

Two tests were performed for the Raspberry Pi3: with a slow MicroSD card (RaspberryPi) and with a RAM disk (RaspberryPi-RAM). For the other Linux and macOS systems, a RAM disk did not have a significant performance impact (and I did not try a RAM disk for Windows).

Here are some interesting observations:

  • The best performing system (Linux-AMDx8) is over one thousand times faster than the worst performing system (Wini7x4)!
  • Only with Windows Defender etc. disabled can an octa-core Windows system with a 3GB/s NVMe disk compete with a Raspberry Pi3 with a slow MicroSD memory card (the Pi wins easily when using a RAM disk though)!
  • Something is absolutely killing the file creation performance on Win-i7x4 (probably third party antivirus software).
  • Creating a file on Linux is really fast (over 100,000 files/s)!

Allocating memory

The memory allocation performance was measured by allocating 1,000,000 small memory blocks (4-128 bytes in size) and then freeing them again.

Memory allocationThis is the one benchmark where raw hardware performance seems to be the dominating factor. Even so, Linux is slightly faster than both Windows and macOS (even for equivalent hardware).

Conclusions

Some of the differences between the operating systems are staggering! I suspect that the poor process and file creation performance on Windows is to blame for the painfully slow git and CMake performance, for instance.

Obviously each operating system has its merits, but in general it seems that Linux > macOS > Windows when it comes to raw kernel and file system performance.

As a side note, I was quite surprised to find that Windows does not even offer anything similar to the standard Unix fork() functionality. This makes certain multi processing patterns unnecessarily cumbersome and expensive on Windows.

Leave a Reply

Your email address will not be published. Required fields are marked *