Benchmarks

From Crypto++ Wiki
Jump to navigation Jump to search

Benchmarking is a topic that arises on occasion on the mailing list. Benchmarking allows you to measure performance and compare the Crypto++ to other libraries like Botan and OpenSSL. The benchmark framework also allows you to gauge the performance of algorithms you add to the library. This wiki article will discuss the library's benchmarking source code, and benchmarking the library using Linux and Windows.

To achieve the most accurate results you should build the library in a release configuration and move the machine from a power save state to a performance state. Both topics are discussed below.

The benchmark program is spread across several source files. The first source files are regtestN.cpp, and the second set of source files are benchN.cpp. The third source file is test.cpp, but its boring because it only handles the command line. The real work is performed in regtest{N}.cpp and bench{N}.cpp. See Name Registration and Benchmark Execution below for more details.

A sample program is provided at the end of the article. It benchmarks a block cipher using a ThreadUserTimer but it can be adapted to just about any Crypto++ object.

Benchmark Command

Below is a typical command to run the benchmark program. The first letter, b, means run the benchmarks. The second argument is 2 and it means run each test for about 2 seconds. The third option is 3.1 and it means the processor frequency is 3.1 GHz.

$ ./cryptest.exe b 2 3.1
...

You can select a subset to run according to the following table. Instead of running cryptest.exe b ..., you can use b1, b2 or b3:

Command Benchmarks
cryptest.exe b ... Benchmarks all algorithms. The list includes unkeyed ciphers, keyed ciphers and public key algorithms.
cryptest.exe b1 ... Benchmarks unkeyed algorithms. The algorithms are random number generators, CRC codes and hashes.
cryptest.exe b2 ... Benchmarks keyed algorithms. The algorithms are MACs, stream ciphers and block ciphers. Block ciphers are usually run in CTR mode, but AES benchmarks all modes.
cryptest.exe b3 ... Benchmarks integer-based public key algorithms. The schemes include encryption, signing, key exchange and integrated encryption schemes over integers and elliptic curves.
cryptest.exe b4 ... Benchmarks elliptic curve public key algorithms. The schemes include encryption, signing, key exchange and integrated encryption schemes over integers and elliptic curves.

b3 was split and b4 was added at Crypto++ 8.3. Prior to the split, both integer and elliptic curve were tested using b3. After the split, integer fields were benchmarked with b4, and elliptic curve fields were benchmarked with b4.

In the future we will probably add additional control for the subset of algorithms. For example, a future revision will assign a subcommand to random number generators, another one for hashes, and so on. b1 will still run all the unkeyed ciphers, but the additional subcommands will offer more convenience.

Benchmark Output

The result of running the benchmark command is a table of results. The columns include Algorithm, MiB/Second and Cycles Per Byte. A real example can be found at Crypto++ 5.6.5 Benchmarks hanging off the website. A wikified example of the table is shown below on a 6th gen Core-i5 Skylake 6400 GHz running at 3.1 GHz using the command cryptest.exe b 2 3.1.

Algorithm MiB/Second Cycles Per Byte
NonblockingRng 163 18.2
AutoSeededRandomPool 516 5.7
AutoSeededX917RNG(AES) 56 52.7
MT19937 825 3.6
RDRAND 62 47.7
RDSEED 25 119.1
AES/OFB RNG 903 3.3
Hash_DRBG(SHA1) 118 25.0
Hash_DRBG(SHA256) 104 28.3
HMAC_DRBG(SHA1) 30 97.3
HMAC_DRBG(SHA256) 27 110.2

The most important measurement to the library is Cycles Per Byte (cpb). It abstracts away most of the CPU frequency leaving the Instruction Set Architecture (ISA). The ISA is important and it affects the results. For example SPECK provides an SSSE3 implementation. On an older Core2 Duo SPECK-128 runs around 6.3 cpb. On the modern Skylake SPECK-128 runs around 3.5 cpb. For more on benchmark variations see Benchmark Differences.

MiB/s or Mebibytes per second is an important measure of throughput but it is sensitive to CPU frequency. A Mebibyte is 230 bytes or 1024 Kibibytes (with a KiB equal to 1024 bytes). MiB/s is calculated just like MB/s but it uses a different scaling factor. MiB/s counts the number of bytes processed, divides by the number of seconds, and then scales to MiB/s by dividing by 1024*1024. Also see Benchmark Metrics below and Mebibyte on Wikipedia.

Benchmark Differences

The Crypto++ library often lags behind benchmarks of algorithms provided by the author of the algorithm, but not by much. For example, the authors of the block ciphers SIMON and SPECK state SPECK runs around 2.9 cycles per byte (cpb). The Crypto++ benchmarks for SPECK run around 3.5 cpb. There are several reasons for the difference, including the usual suspects like machine differences and test data sets. After factoring out the usual differences, the two remaining differences are usually (1) the Crypto++ library and benchmark program is generalized, while the author's build of the algorithm is often specialized; and (2) the Crypto++ library builds for a minimal machine and switches to a faster implementation at runtime, while the author's implementation is built for a native [and fast] machine.

An example of difference (1) is, the authors of SPECK took an optimization that Crypto++ could not take because the library has to interop with other implementations. The interop requirement meant the Crypto++ library had to accept data as a big-endian byte string and then perform a word-based little-endian swap to process the data. An example of difference (2) is, the Crypto++ library has a base implementation that targets x86_64 or i686, and then it switches to SSSE3 if the CPU feature is available. The authors of SPECK were able to build for and then run on a machine that provided AVX2 and BMI2.

On the upside, the Crypto++ library implementation is usually faster than most other implementations, including scripting languages like Bash and Python, and most other C or C++ implementations. As far as other cryptographic libraries are concerned, the Crypto++ library usually lags behind OpenSSL because Andy Polyakov's hand-tuned ASM is a work of art. The library is usually on-par with Botan because Botan and Crypto++ often collaborate and share tricks and optimizations. The library is usually on-par with mbedTLS, which we use to gauge our ARM-based implementations. Also see Cross Validation below.

CPU Frequency Scaling

Before you benchmark you should note the speed of your processor, and change the CPU governor or scaling to 100%. The benchmarks take the CPU frequency in GHz. If you don't change the governor or frequency scaling then the first several results could be skewed. On modern Intel processors in a "power save" state the processor takes several seconds to leave a deep C-level power state.

On Linux you can run governor.sh, which is a script to adjust the governor. The script is available in the TestScript directory. On Linux you need super user to change the frequency scaling.

$ cp TestScripts/governor.sh .

$ sudo ./governor.sh perf
Current CPU governor scaling settings:
  CPU 0: powersave
  CPU 1: powersave
  CPU 2: powersave
  CPU 3: powersave
New CPU governor scaling settings:
  CPU 0: performance
  CPU 1: performance
  CPU 2: performance
  CPU 3: performance

To return to a power save state just run the script in reverse.

$ sudo ./governor.sh power
Current CPU governor scaling settings:
  CPU 0: performance
  CPU 1: performance
  CPU 2: performance
  CPU 3: performance
New CPU governor scaling settings:
  CPU 0: powersave
  CPU 1: powersave
  CPU 2: powersave
  CPU 3: powersave

Release Optimizations

Be sure you are using optimizations when benchmarking. You should perform a release build of the library, which is the default configuration used in the makefile. That is, the makefile adds -DNDEBUG -g2 -O3 for you if you don't do anything special. The library tests upto -O5 on Linux, so -O2 or -O3 should not be a problem. On Windows you should use /Oi /Oy /O2 or similar.

Here's what a typical command line looks like under Linux on an x86_64 machine.

$ make
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cryptlib.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c cpu.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c integer.cpp
g++ -DNDEBUG -g2 -O3 -fPIC -pthread -pipe -c 3way.cpp
...

Name Registration

The benchmarks are run by name and the name is registered in regtestN.cpp, where N is a number like 1 or 2. The algorithm's name comes from the static member function StaticAlgorithmName. The name is usually a standard cryptographic algorithm name, like SHA2-256 or AES/CBC.

regtest1.cpp usually registers unkeyed ciphers like random number generators and hashes. regtest2.cpp registers symmetric ciphers, like MACs, stream ciphers and block ciphers. regtest3.cpp registers public key alogirthms, like RSA, DSA and ECIES.

Below is an excerpt from regtest2.cpp. It shows how a block cipher registers itself for both self tests and benchmarks. As can be seen below,SIMON registers SIMON-64/CTR and SIMON-128/CTR for benchmarking, and registers SIMON-64/ECB, SIMON-128/ECB, SIMON-64/CBC , and SIMON-64/CBC for self tests.

RegisterSymmetricCipherDefaultFactories<ECB_Mode<SIMON64> >();  // Test Vectors
RegisterSymmetricCipherDefaultFactories<CBC_Mode<SIMON64> >();  // Test Vectors
RegisterSymmetricCipherDefaultFactories<CTR_Mode<SIMON64> >();  // Benchmarks
RegisterSymmetricCipherDefaultFactories<ECB_Mode<SIMON128> >();  // Test Vectors
RegisterSymmetricCipherDefaultFactories<CBC_Mode<SIMON128> >();  // Test Vectors
RegisterSymmetricCipherDefaultFactories<CTR_Mode<SIMON128> >();  // Benchmarks

There are several different RegisterXXXDefaultFactories interfaces, including RegisterSymmetricCipherDefaultFactories and RegisterAuthenticatedSymmetricCipherDefaultFactories. However, the class used most often is one that takes an interface as a template parameter. Below is a sample across the regtestN.cpp files.

RegisterDefaultFactoryFor<RandomNumberGenerator, NonblockingRng>();
RegisterDefaultFactoryFor<NIST_DRBG, Hash_DRBG<SHA1> >("Hash_DRBG(SHA1)");
RegisterDefaultFactoryFor<HashTransformation, SHA1>();
RegisterDefaultFactoryFor<MessageAuthenticationCode, HMAC<SHA1> >();
RegisterDefaultFactoryFor<KeyDerivationFunction, HKDF<SHA1> >();

RegisterDefaultFactoryFor's template parameter, like RandomNumberGenerator and HashTransformation, is important. It is used to selection a specialization of the Benchmark function that performs the actual benchmark. See Benchmark Execution below.

If you fail to register an algorithm then you will get an exception similar to below.

$ ./cryptest.exe b 2 3.1
...
Exception caught: ObjectFactoryRegistry: could not find factory for algorithm SHA-1

Benchmark Execution

The source code to run benchmarks are in benchN.cpp, where N is a number like 1 or 2. The benchmarks are registered by name and run according to the interface associated with its name. The name registration was explained in Name Registration.

If you type cryptest.exe b ... then all the benchmarks are run as explained in Benchmark Command. The handling of the subcommand b is in bench1.cpp and shown below.

void BenchmarkWithCommand(int argc, const char* const argv[])
{
    std::string command(argv[1]);
    float runningTime(argc >= 3 ? Test::StringToValue<float, true>(argv[2]) : 1.0f);
    float cpuFreq(argc >= 4 ? Test::StringToValue<float, true>(argv[3])*float(1e9) : 0.0f);
    std::string algoName(argc >= 5 ? argv[4] : "");

    if (command == "b")  // All benchmarks
        Benchmark(Test::All, runningTime, cpuFreq);
    else if (command == "b3")  // Public key algorithms
        Test::Benchmark(Test::PublicKey, runningTime, cpuFreq);
    else if (command == "b2")  // Shared key algorithms
        Test::Benchmark(Test::SharedKey, runningTime, cpuFreq);
    else if (command == "b1")  // Unkeyed algorithms
        Test::Benchmark(Test::Unkeyed, runningTime, cpuFreq);
}

The function Benchmark is the message cracker, and it routes the request into Benchmark1, Benchmark2 or Benchmark3 depending on the subcommand.

void Benchmark(Test::TestClass suites, double t, double hertz)
{
    ...
    if (suites & Test::Unkeyed)
    {
        Benchmark1(t, hertz);
    }
    if (suites & Test::SharedKey)
    {
        Benchmark2(t, hertz);
    }
    if (suites & Test::PublicKey)
    {
        Benchmark3(t, hertz);
    }
    ...
}

Benchmark1, Benchmark2 or Benchmark3 handle the running of the test. For example, Benchmark1 performs:

void Benchmark1(double t, double hertz)
{
    ...

#ifdef NONBLOCKING_RNG_AVAILABLE
    BenchMarkByNameKeyLess<RandomNumberGenerator>("NonblockingRng");
#endif
#ifdef OS_RNG_AVAILABLE
    BenchMarkByNameKeyLess<RandomNumberGenerator>("AutoSeededRandomPool");
    BenchMarkByNameKeyLess<RandomNumberGenerator>("AutoSeededX917RNG(AES)");
#endif
    BenchMarkByNameKeyLess<RandomNumberGenerator>("MT19937");
#if (CRYPTOPP_BOOL_X86)
    if (HasPadlockRNG())
        BenchMarkByNameKeyLess<RandomNumberGenerator>("PadlockRNG");
#endif
#if (CRYPTOPP_BOOL_X86 || CRYPTOPP_BOOL_X32 || CRYPTOPP_BOOL_X64)
    if (HasRDRAND())
        BenchMarkByNameKeyLess<RandomNumberGenerator>("RDRAND");
    if (HasRDSEED())
        BenchMarkByNameKeyLess<RandomNumberGenerator>("RDSEED");
#endif
    BenchMarkByNameKeyLess<RandomNumberGenerator>("AES/OFB RNG");
    BenchMarkByNameKeyLess<NIST_DRBG>("Hash_DRBG(SHA1)");
    BenchMarkByNameKeyLess<NIST_DRBG>("Hash_DRBG(SHA256)");
    BenchMarkByNameKeyLess<NIST_DRBG>("HMAC_DRBG(SHA1)");
    BenchMarkByNameKeyLess<NIST_DRBG>("HMAC_DRBG(SHA256)");
    ...
}

And below is BenchMarkByNameKeyLess. The "by name" means the registered name and it is discussed under Name Registration. The "key less" means the benchmark is unkeyed.

template <class T>
void BenchMarkByNameKeyLess(const char *factoryName, const char *displayName=NULLPTR, ...)
{
    CRYPTOPP_UNUSED(params);
    std::string name = factoryName;
    if (displayName)
        name = displayName;

    member_ptr<T> obj(ObjectFactoryRegistry<T>::Registry().CreateObject(factoryName));
    BenchMark(name.c_str(), *obj, g_allocatedTime);
}

The trace is almost complete. The final piece is BenchMark with the class object *obj. In the case of a random number generator:

void BenchMark(const char *name, RandomNumberGenerator &rng, double timeTotal)
{
    const int BUF_SIZE = 2048U;
    AlignedSecByteBlock buf(BUF_SIZE);
    Test::GlobalRNG().GenerateBlock(buf, BUF_SIZE);

    ...
    unsigned long long blocks = 1;
    double timeTaken;

    clock_t start = ::clock();
    do
    {
        rng.GenerateBlock(buf, buf.size());
        blocks++;
        timeTaken = double(::clock() - start) / CLOCK_TICKS_PER_SECOND;
    } while (timeTaken < timeTotal);

    OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
}

Not surprisingly, there are several BenchMark overloads available. The one used depends on the interface being tested, and they include:

  • BenchMark(const char *, RandomNumberGenerator&, ...)
  • BenchMark(const char *, NIST_DRBG&, ...)
  • BenchMark(const char *, BlockTransformation&, ...)
  • BenchMark(const char *, StreamTransformation&, ...)
  • BenchMark(const char *, AuthenticatedSymmetricCipher&, ...)
  • BenchMark(const char *, HashTransformation&, ...)
  • BenchMark(const char *, BufferedTransformation&, ...)

At the end of the BenchMark function there is an OutputResultBytes. It is the function that creates the row in the HTML table that is output by the test program.

OutputResult is an anchor you can use to locate the functions that perform the benchmarks because OutputResult prints the result of the work performed. From the output below, you know there are about 15 different BenchMark functions that perform actual work. The functions are overloaded to handle a particular interfaces, which explains why there re so many of them.

$ grep -n OutputResult *.h *.cpp
bench.h:51:void OutputResultBytes(const char *name, double length, double timeTaken);
bench1.cpp:52:void OutputResultBytes(const char *name, double length, double timeTaken)
bench1.cpp:71:void OutputResultKeying(double iterations, double timeTaken)
bench1.cpp:128: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:152: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:183: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:206: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:237: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:263: OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
bench1.cpp:281: OutputResultKeying(iterations, timeTaken);
bench2.cpp:55:  OutputResultOperations(name, "Encryption", pc, i, timeTaken);
bench2.cpp:79:  OutputResultOperations(name, "Decryption", false, i, timeTaken);
bench2.cpp:95:  OutputResultOperations(name, "Signature", pc, i, timeTaken);
bench2.cpp:118: OutputResultOperations(name, "Verification", pc, i, timeTaken);
bench2.cpp:138: OutputResultOperations(name, "Key-Pair Generation", pc, i, timeTaken);
bench2.cpp:158: OutputResultOperations(name, "Key-Pair Generation", pc, i, timeTaken);
bench2.cpp:185: OutputResultOperations(name, "Key Agreement", pc, i, timeTaken);
bench2.cpp:210: OutputResultOperations(name, "Key Agreement", pc, i, timeTaken);

Timing Loop

The section Benchmark Execution showed BenchMark(const char *name, <interface>, ...) and that is where the actual timing of the algorithm occurs. The code is shown again below, but this time the StreamTransformation is used.

void BenchMark(const char *name, StreamTransformation &cipher, double timeTotal)
{
    const int BUF_SIZE=RoundUpToMultipleOf(2048U, cipher.OptimalBlockSize());
    AlignedSecByteBlock buf(BUF_SIZE);
    Test::GlobalRNG().GenerateBlock(buf, BUF_SIZE);

    unsigned long i=0, blocks=1;
    double timeTaken;
    clock_t start = ::clock();

    do
    {
        blocks *= 2;
        for (; i<blocks; i++)
            cipher.ProcessString(buf, BUF_SIZE);
        timeTaken = double(::clock() - start) / CLOCK_TICKS_PER_SECOND;
    }
    while (timeTaken < 2.0/3*timeTotal);

    OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);
}

There are three things to note. First, a high-res time is not used because it is not needed. When running a benchmark on large blocks over a period of seconds, the system clock is fine.

The second point is the geometric progression of the block size. Notice the block size if 2048 bytes, and each iteration of the loop doubles the number of times it is fed into the cipher.

The third point is the test stops as the time approaches the target time because the loop is controlled with the predicate while (timeTaken < 2.0/3*timeTotal). The control was used because of the geometric progression of the number of 2K blocks.

Benchmark Metrics

The section Benchmark Output explained how two metrics were provided for most tests. The first is Mebibytes per second (MiB/s) and the second is cycles per byte (cpb). Here is how they are calculated in the code using OutputResultBytes as an example.

void OutputResultBytes(const char *name, double length, double timeTaken)
{
    double mbs = length / timeTaken / (1024*1024);
    std::cout << "\n<TR><TD>" << name;
    std::cout << std::setiosflags(std::ios::fixed);
    std::cout << "<TD>" << std::setprecision(0) << std::setiosflags(std::ios::fixed) << mbs;
    if (g_hertz > 1.0f)
    {
        double cpb  = timeTaken * g_hertz / length;
        if (cpb < 24.0f)
            std::cout << "<TD>" << std::setprecision(2) << std::setiosflags(std::ios::fixed) << cpb;
        else
            std::cout << "<TD>" << std::setprecision(1) << std::setiosflags(std::ios::fixed) << cpb;
    }
    g_logTotal += ::log(mbs);
    g_logCount++;
}

length is total bytes processed, and timeTaken is the time period, in seconds. The variable mbs is Mebibytes per second, and the variable cpb is cycles per byte. The variable g_logTotal keeps track of bytes processed so mean geometric throughput can be measured.

Sample Program

The code below benchmarks large-block AES/CCM encryption using the same pattern as the benchmarks, but it uses a ThreadUserTimer instead of a call to clock. The variable cpuFreq should be changed to match the machine's cpu frequency.

On Linux you should run ./governor.sh performance to move the CPU from an idle or C-level power-save state. There is no script for Windows. The Linux script can be found in the TestScript/ folder.

Note: Because of the geometric progression of block processing (blocks *= 2 followed by the loop based on blocks), the test will usually run longer than runTimeInSeconds.

#include "cryptlib.h"
#include "secblock.h"
#include "hrtimer.h"
#include "osrng.h"
#include "modes.h"
#include "aes.h"
#include <iostream>

const double runTimeInSeconds = 3.0;
const double cpuFreq = 2.7 * 1000 * 1000 * 1000;

int main(int argc, char* argv[])
{
    using namespace CryptoPP;
    AutoSeededRandomPool prng;

    SecByteBlock key(16);
    prng.GenerateBlock(key, key.size());
    
    CTR<AES>::Encryption cipher;
    cipher.SetKeyWithIV(key, key.size(), key);

    const int BUF_SIZE = RoundUpToMultipleOf(2048U,
        dynamic_cast<StreamTransformation&>(cipher).OptimalBlockSize());

    AlignedSecByteBlock buf(BUF_SIZE);
    prng.GenerateBlock(buf, buf.size());

    double elapsedTimeInSeconds;
    unsigned long i=0, blocks=1;

    ThreadUserTimer timer;
    timer.StartTimer();

    do
    {
        blocks *= 2;
        for (; i<blocks; i++)
            cipher.ProcessString(buf, BUF_SIZE);
        elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
    }
    while (elapsedTimeInSeconds < runTimeInSeconds);

    const double bytes = static_cast<double>(BUF_SIZE) * blocks;
    const double ghz = cpuFreq / 1000 / 1000 / 1000;
    const double mbs = bytes / elapsedTimeInSeconds / 1024 / 1024;
    const double cpb = elapsedTimeInSeconds * cpuFreq / bytes;

    std::cout << cipher.AlgorithmName() << " benchmarks..." << std::endl;
    std::cout << "  " << ghz << " GHz cpu frequency"  << std::endl;
    std::cout << "  " << cpb << " cycles per byte (cpb)" << std::endl;
    std::cout << "  " << mbs << " MiB per second (MiB)" << std::endl;

    // std::cout << "  " << elapsedTimeInSeconds << " seconds passed" << std::endl;
    // std::cout << "  " << (word64) bytes << " bytes processed" << std::endl;

    return 0;
}

Running the program on a Core-i5 Skylake 6400 at 2.7 GHz results in output similar to the following:

$ ./bench.exe
AES/CTR benchmarks...
  2.7 GHz cpu frequency
  0.583066 cycles per byte (cpb)
  4416.17 MiB per second (MiB)

And switching to AES/CCM mode:

$ ./bench.exe
AES/CCM benchmarks...
  2.7 GHz cpu frequency
  3.00491 cycles per byte (cpb)
  856.904 MiB per second (MiB)

You can benchmark small-block or short strings by changing the code:

const int BUF_SIZE = 64;
unsigned int blocks = 0;
...

do
{
    // Favor integer domain compares
    for (unsigned int i=0; i<128; ++i)
        cipher.ProcessString(buf, BUF_SIZE);
    blocks += 128;

    elapsedTimeInSeconds = timer.ElapsedTimeAsDouble();
}
while (elapsedTimeInSeconds < runTimeInSeconds);

Running the modified program on a Core-i5 Skylake 6400 at 2.7 GHz results in output similar to the following. Notice performance dropped from about 0.58 cpb to about 2.05 cpb.

$ ./bench.exe
AES/CTR benchmarks...
  2.7 GHz cpu frequency
  2.05393 cycles per byte (cpb)
  1253.66 MiB per second (MiB)

Cross Validation

The Crypto++ library uses other libraries to ensure the benchmark results of the algorithms are consistent with other libraries. You can use libraries Botan and OpenSSL to cross validate results if you have questions about the Crypto++ results.

As an example, according to the Crypto++ 5.6.5 Benchmarks, AES/CTR using a 128-bit key performs at about 0.6 cpb on a 6th gen Skylake core-i5 6400 running at 3.1 GHz. On the same machine Botan benchmarks at 0.73 cpb for AES-128:

$ ./botan speed --msec=2000 AES-128 AES-192 AES-256
AES-128 encrypt buffer size 1024 bytes: 3528.408 MiB/sec 0.73 cycles/byte (7056.82 MiB in 2000.00 ms)
AES-128 decrypt buffer size 1024 bytes: 3552.524 MiB/sec 0.73 cycles/byte (7105.05 MiB in 2000.00 ms)
AES-192 encrypt buffer size 1024 bytes: 3057.142 MiB/sec 0.85 cycles/byte (6114.29 MiB in 2000.00 ms)
AES-192 decrypt buffer size 1024 bytes: 3099.002 MiB/sec 0.84 cycles/byte (6198.00 MiB in 2000.00 ms)
AES-256 encrypt buffer size 1024 bytes: 2719.769 MiB/sec 0.95 cycles/byte (5439.54 MiB in 2000.00 ms)
AES-256 decrypt buffer size 1024 bytes: 2733.509 MiB/sec 0.95 cycles/byte (5467.02 MiB in 2000.00 ms)

The difference between Botan and Crypto++ is mostly book keeping. Crypto++ scales cpu frequency by 1000 * 1000 (GHz), while Botan scales cpu frequency by 1024 * 1024 (GiHz). Once the multiplier is normalized both Crypto++ and Botan perform about the same.

And to cross validate with OpenSSL:

$ openssl speed aes-128-cbc aes-192-cbc aes-256-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 20922084 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 5816299 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 1536193 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 388082 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 48639 aes-128 cbc's in 2.99s
Doing aes-192 cbc for 3s on 16 size blocks: 18344345 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 64 size blocks: 5014544 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 256 size blocks: 1277957 aes-192 cbc's in 3.00s
Doing aes-192 cbc for 3s on 1024 size blocks: 322072 aes-192 cbc's in 2.99s
Doing aes-192 cbc for 3s on 8192 size blocks: 40382 aes-192 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16 size blocks: 15997127 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 4306376 aes-256 cbc's in 2.99s
Doing aes-256 cbc for 3s on 256 size blocks: 1095799 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 275620 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 34543 aes-256 cbc's in 3.00s
...

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc     111584.45k   124496.03k   131088.47k   132465.32k   133261.10k
aes-192 cbc      97836.51k   106976.94k   109052.33k   110301.58k   110269.78k
aes-256 cbc      85318.01k    92176.61k    93508.18k    94078.29k    94325.42k

The difference between OpenSSL and Crypto++ is more pronounced. OpenSSL uses a "stitched" AES/CBC implementation, and the tight integration of the mode with the block cipher simply outperforms Crypto++. It is also worth mentioning Andy Polyakov is world renowned for the performance in his assembly language implementations. Also see Intel's Improving OpenSSL Performance.

OpenMP

Generally speaking, OpenMP causes the library to slow down. Some algorithms speed-up, like Scrypt, but overall the library performs worse. Also see OpenMP on the Crypto++ wiki.

LTO

Generally speaking, Link Time Optimizations causes the library to slow down. Overall the library performs worse. Also see LTO on the Crypto++ wiki.