OpenMP
OpenMP (Open Multi-Processing) is an application programming interface that supports shared memory multi-threaded programming on most platforms, including AIX, Solaris, Linux, OS X, and Windows. This wiki article will explain how to build the library with OpenMP support and provide an example program.
OpenMP pragmas surface in various source files, including nbtheory.cpp and scrypt.cpp. Crypto++ supports OpenMP when it is profitable to do so, like modern Key Derivation Functions with parallelism. Other classes, like Rabin-Williams signatures, has OpenMP but it is not profitable because it does not speed up calculations.
If you are using OpenMP on Linux then see GCC Issue 58378, Protect libgomp against child process hanging after a Unix fork().
Compiler Options
Here is a list of compiler options for the compilers the project regularly tests. Crypto++ uses the compiler to drive link, so the compiler will add the correct OpenMP libraries, too.
SunCC requires -O3 or above. SunCC will change the optimization level automatically if it is below -O3.
Compiler | Option |
---|---|
MSVC | /openmp |
GCC | -fopenmp |
Clang | -fopenmp=libomp |
ICC | -fopenmp |
SunCC | -xopenmp=parallel |
XLC | -qsmp=omp |
PGI | -mp |
Tru64 | -omp |
Building the Library
You must use -fopenmp on Linux to build the library with OpenMP support using GCC and Clang. Windows should use /openmp option. Below is an example of building with OpenMP on Fedora. The makefile detects -fopenmp on Linux platforms and adds -lgomp if it is missing.
$ CXXFLAGS="-DNDEBUG -g2 -O3 -fopenmp" make -j 4 g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cryptlib.cpp g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cpu.cpp g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c integer.cpp ... g++ -o cryptest.exe -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe adhoc.o test.o bench1.o bench2.o validat0.o validat1.o validat2.o validat3.o validat4.o datatest.o regtest1.o regtest2.o regtest3.o dlltest.o fipsalgt.o ./libcryptopp.a -lgomp
After building the library you can test it with cryptest.exe v. Notice the message "OpenMP version 201511, 4 threads" when OpenMP is in effect. The version number comes directly from the _OPENMP macro.
$ ./cryptest.exe v Using seed: 1548150602 OpenMP version 201511, 4 threads Testing Settings... passed: Your machine is little endian. passed: sizeof(byte) == 1 passed: sizeof(word16) == 2 passed: sizeof(word32) == 4 passed: sizeof(word64) == 8 passed: sizeof(word128) == 16 passed: sizeof(hword) == 4, sizeof(word) == 8, sizeof(dword) == 16 passed: cacheLineSize == 64 hasSSE2 == 1, hasSSSE3 == 1, hasSSE4.1 == 1, hasSSE4.2 == 1, hasAESNI == 1, hasCLMUL == 1, hasRDRAND == 1, hasRDSEED == 1, hasSHA == 0, isP4 == 0 ...
The "OpenMP version..." message is due to the following code in test.cpp:
void PrintSeedAndThreads(const std::string& seed) { std::cout << "Using seed: " << seed << std::endl; #ifdef _OPENMP int tc = 0; #pragma omp parallel { tc = omp_get_num_threads(); } std::cout << "OpenMP version " << (int)_OPENMP << ", "; std::cout << tc << (tc == 1 ? " thread" : " threads") << std::endl; #endif }
Sample Program
The following is a sample program from the Scrypt wiki page. Scrypt benefits using OpenMP in its Smix or ROmix function. In the example below, the security parameters cost=(1<<20), blockSize=12, paralellization=4 were selected to demonstrate OpenMP profitability. 1<<20 is a cost factor of approximately 1 million.
$ cat test.cxx #include "cryptlib.h" #include "secblock.h" #include "scrypt.h" #include "osrng.h" #include "files.h" #include "hex.h" #include <iostream> #include <omp.h> int main() { int threads = 1; #pragma omp parallel { threads = omp_get_num_threads(); } using namespace CryptoPP; AutoSeededRandomPool prng; SecByteBlock key(64), salt(16*1024); prng.GenerateBlock(key, key.size()); prng.GenerateBlock(salt, salt.size()); Scrypt scrypt; scrypt.DeriveKey(key, key.size(), key, key.size(), salt, salt.size(), 1<<20, 12, 4); std::cout << "Threads: " << threads << std::endl; std::cout << "Key: "; StringSource(key, 16, true, new HexEncoder(new FileSink(std::cout))); std::cout << "..." << std::endl; return 0; }
The program should be compiled with the same options used to build the library (or, the library should be built with the same options that will be used for the program). Below the two important ones, -O3 -fopenmp, are used for both the library and the program.
$ g++ -I. -O3 -fopenmp test.cxx ./libcryptopp.a -o test.exe
The multithreaded OpenMP version of the program is about 4x faster than the single threaded version:
$ time OMP_NUM_THREADS=1 ./test.exe Threads: 1 Key: B771BD6E4E26AEF83166C5F9061F8BB8... real 0m18.057s user 0m16.144s sys 0m1.881s $ time OMP_NUM_THREADS=4 ./test.exe Threads: 4 Key: 4B7511052754BC61B2356234BEF1FC9F... real 0m4.498s user 0m15.535s sys 0m1.954s
Performance
The Scrypt example showed that OpenMP is profitable for programs that are naturally parallelizable. However, there are costs associated with OpenMP that non-omp algorithms will pay.
The table below shows GMAC and VMAC performance in both configurations on a Skylake Core-i5 6400 @ 2.7 GHz. Below, bigger GiB/second is better; and smaller Cycle/byte is better.
Algorithm | GiB/second | Cycle/byte |
---|---|---|
GMAC(AES) | 8.009 GiB/s | 0.32 cpb |
VMAC(AES)-64 | 9.676 GiB/s | 0.27 cpb |
VMAC(AES)-128 | 5.110 GiB/s | 0.50 cpb |
Algorithm | GiB/second | Cycle/byte |
---|---|---|
GMAC(AES) | 8.445 GiB/s | 0.30 cpb |
VMAC(AES)-64 | 10.964 GiB/s | 0.23 cpb |
VMAC(AES)-128 | 6.237 GiB/s | 0.41 cpb |
And running the full Benchmark suite on the Skylake machine results in an overall drop in performance. Below, bigger Throughput is better.
Configuration | Throughput |
---|---|
With OpenMP | 1100.504107 |
Without OpenMP | 1286.652288 |