OpenMP

From Crypto++ Wiki
Jump to navigation Jump to search

OpenMP (Open Multi-Processing) is an application programming interface that supports shared memory multi-threaded programming on most platforms, including AIX, Solaris, Linux, OS X, and Windows. This wiki article will explain how to build the library with OpenMP support and provide an example program.

OpenMP pragmas surface in various source files, including nbtheory.cpp and scrypt.cpp. Crypto++ supports OpenMP when it is profitable to do so, like modern Key Derivation Functions with parallelism. Other classes, like Rabin-Williams signatures, has OpenMP but it is not profitable because it does not speed up calculations.

If you are using OpenMP on Linux then see GCC Issue 58378, Protect libgomp against child process hanging after a Unix fork().

Compiler Options

Here is a list of compiler options for the compilers the project regularly tests. The option should be added to CXXFLAGS. Crypto++ uses the compiler to drive link, so the compiler will add the correct OpenMP libraries, too.

SunCC requires -O3 or above. SunCC will change the optimization level automatically if it is below -O3.

Compiler Option
MSVC /openmp
GCC -fopenmp
Clang -fopenmp=libomp
ICC -fopenmp
ICX -qopenmp
SunCC -xopenmp=parallel
XLC -qsmp=omp
PGI -mp
Tru64 -omp

Building the Library

You must use -fopenmp on Linux to build the library with OpenMP support using GCC and Clang. Windows should use /openmp option. Below is an example of building with OpenMP on Fedora. The makefile detects -fopenmp on Linux platforms and adds -lgomp if it is missing.

$ CXXFLAGS="-DNDEBUG -g2 -O3 -fopenmp" make -j 4
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe -c integer.cpp
...
g++ -o cryptest.exe -DNDEBUG -g2 -O3 -fopenmp -fPIC -pthread -pipe adhoc.o test.o
bench1.o bench2.o validat0.o validat1.o validat2.o validat3.o validat4.o datatest.o
regtest1.o regtest2.o regtest3.o dlltest.o fipsalgt.o ./libcryptopp.a -lgomp

After building the library you can test it with cryptest.exe v. Notice the message "OpenMP version 201511, 4 threads" when OpenMP is in effect. The version number comes directly from the _OPENMP macro.

$ ./cryptest.exe v
Using seed: 1548150602
OpenMP version 201511, 4 threads

Testing Settings...

passed:  Your machine is little endian.
passed:  sizeof(byte) == 1
passed:  sizeof(word16) == 2
passed:  sizeof(word32) == 4
passed:  sizeof(word64) == 8
passed:  sizeof(word128) == 16
passed:  sizeof(hword) == 4, sizeof(word) == 8, sizeof(dword) == 16
passed:  cacheLineSize == 64
hasSSE2 == 1, hasSSSE3 == 1, hasSSE4.1 == 1, hasSSE4.2 == 1, hasAESNI == 1,
hasCLMUL == 1, hasRDRAND == 1, hasRDSEED == 1, hasSHA == 0, isP4 == 0
...

The "OpenMP version..." message is due to the following code in test.cpp:

void PrintSeedAndThreads(const std::string& seed)
{
    std::cout << "Using seed: " << seed << std::endl;

#ifdef _OPENMP
    int tc = 0;
    #pragma omp parallel
    {
        tc = omp_get_num_threads();
    }

    std::cout << "OpenMP version " << (int)_OPENMP << ", ";
    std::cout << tc << (tc == 1 ? " thread" : " threads") << std::endl;
#endif
}

Sample Program

The following is a sample program from the Scrypt wiki page. Scrypt benefits using OpenMP in its Smix or ROmix function. In the example below, the security parameters cost=(1<<20), blockSize=12, paralellization=4 were selected to demonstrate OpenMP profitability. 1<<20 is a cost factor of approximately 1 million.

$ cat test.cxx
#include "cryptlib.h"
#include "secblock.h"
#include "scrypt.h"
#include "osrng.h"
#include "files.h"
#include "hex.h"

#include <iostream>
#include <omp.h>

int main()
{
    int threads = 1;
    #pragma omp parallel
    {
        threads = omp_get_num_threads();
    }

    using namespace CryptoPP;
    AutoSeededRandomPool prng;

    SecByteBlock key(64), salt(16*1024);
    prng.GenerateBlock(key, key.size());
    prng.GenerateBlock(salt, salt.size());

    Scrypt scrypt;
    scrypt.DeriveKey(key, key.size(), key, key.size(), salt, salt.size(), 1<<20, 12, 4);

    std::cout << "Threads: " << threads << std::endl;

    std::cout << "Key: ";
    StringSource(key, 16, true, new HexEncoder(new FileSink(std::cout)));
    std::cout << "..." << std::endl;

    return 0;
}

The program should be compiled with the same options used to build the library (or, the library should be built with the same options that will be used for the program). Below the two important ones, -O3 -fopenmp, are used for both the library and the program.

$ g++ -I. -O3 -fopenmp test.cxx ./libcryptopp.a -o test.exe

The multithreaded OpenMP version of the program is about 4x faster than the single threaded version:

$ time OMP_NUM_THREADS=1 ./test.exe
Threads: 1
Key: B771BD6E4E26AEF83166C5F9061F8BB8...

real    0m18.057s
user    0m16.144s
sys     0m1.881s

$ time OMP_NUM_THREADS=4 ./test.exe
Threads: 4
Key: 4B7511052754BC61B2356234BEF1FC9F...

real    0m4.498s
user    0m15.535s
sys     0m1.954s

Performance

The Scrypt example showed that OpenMP is profitable for programs that are naturally parallelizable. However, there are costs associated with OpenMP that non-omp algorithms will pay.

The table below shows GMAC and VMAC performance in both configurations on a Skylake Core-i5 6400 @ 2.7 GHz. Below, bigger GiB/second is better; and smaller Cycle/byte is better.

Skylake with OpenMP
Algorithm GiB/second Cycle/byte
GMAC(AES) 8.009 GiB/s 0.32 cpb
VMAC(AES)-64 9.676 GiB/s 0.27 cpb
VMAC(AES)-128 5.110 GiB/s 0.50 cpb
Skylake without OpenMP
Algorithm GiB/second Cycle/byte
GMAC(AES) 8.445 GiB/s 0.30 cpb
VMAC(AES)-64 10.964 GiB/s 0.23 cpb
VMAC(AES)-128 6.237 GiB/s 0.41 cpb

And running the full Benchmark suite on the Skylake machine results in an overall drop in performance. A variance of 3 to 5 is typical, and it is just noise. But a change of 180 is not, and it indicates a deficiency. Below, bigger Throughput is better.

Skylake Geometric Average
Configuration Throughput
With OpenMP 1100.504107
Without OpenMP 1286.652288

Here are the results of running the full Benchmark suite on a Coffee Lake Core-i7 8700 @ 3.2 GHz.

Coffee Lake Geometric Average
Configuration Throughput
With OpenMP 1414.931019
Without OpenMP 1583.779185

Here are the results of running the full Benchmark suite on an ARM ODROID-XU4 @ 2.0 GHz.

ODROID Geometric Average
Configuration Throughput
With OpenMP 172.617632
Without OpenMP 188.200157