For a project involving 3D triangulation of points and optimization, I needed to find out the fastest way to compute the projection of a 3D point on a 2D view. This operation requires a 4D matrix-vector multiplication, which can be accelerated with SIMD instructions. Initially, to save time, I reused an implementation that I had written for a class at the University. However, it turned out that it was far from optimal in terms of performance. Here I present the benchmark I wrote in C++ to compare OpenCV, GLM and Eigen.
Key takeaways
Here are the important highlights of the experiment. See the conclusion for more details.
- 4-by-4 matrix-vector multiplication is faster than 3-by-4 matrix vector multiplication
- double precision does not make the computations significantly slower than single precision
- Results depend on the compilers and the library, in our case gcc performs better than MSVC
- Never use OpenCV’s cv::Mat for 3D geometry computations, it is extremely slow!
- The AVX-512 instruction set is not useful for 4-by-4 matrix-vector multiplications
Method
I used a C++ micro benchmarking library named nanobench and implemented a 4-by-4 matrix-vector multiplication using either OpenCV, GLM, or Eigen. To make it more informative, I tested with float and double precision. I also tried a 3-by-4 matrix multiplication to check if it is faster than 4-by-4.
Results
I ran the benchmark on two computers: my Windows laptop and a Windows workstation from the university. Both are equipped with an Intel CPU that has the AVX2 instruction set. The workstation also has AVX-512 instructions.
Dell Precision T5820 with an Intel Xeon W-2145 Windows 10 with Visual Studio 2022 version 17.3.6
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 814.35 | 1,227,973.74 | 0.3% | 2.98 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
| 822.10 | 1,216,398.71 | 0.7% | 3.00 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
| 7.51 | 133,084,031.35 | 0.6% | 2.47 | `OpenCV: Matx44f times Vector4 (float)`
| 8.67 | 115,304,500.68 | 0.1% | 2.85 | `OpenCV: Matx44d times Vector4 (double)`
| 4.08 | 245,041,633.15 | 0.4% | 1.49 | `GLM: Matrix4x4 times Vector4 (float)`
| 5.69 | 175,662,182.15 | 0.5% | 2.07 | `GLM: Matrix4x4 times Vector4 (double)`
| 3.05 | 327,899,204.92 | 0.4% | 1.11 | `GLM: Matrix3x4 times Vector4 (float)`
| 3.08 | 324,549,169.09 | 1.3% | 1.12 | `GLM: Matrix3x4 times Vector4 (double)`
| 1.61 | 622,895,521.72 | 1.8% | 0.58 | `Eigen: Matrix4x4 times Vector4 (float)`
| 1.68 | 594,626,598.16 | 0.3% | 0.62 | `Eigen: Matrix4x4 times Vector4 (double)`
| 3.98 | 251,082,244.98 | 0.4% | 1.45 | `Eigen: Matrix3x4 times Vector4 (float)`
| 2.89 | 346,103,412.46 | 0.3% | 1.05 | `Eigen: Matrix3x4 times Vector4 (double)`
Dell Precision T5820 with an Intel Xeon W-2145 Windows 10 WSL with g++ version 11.2.0
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 574.54 | 1,740,536.07 | 0.5% | 2.09 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
| 574.61 | 1,740,316.20 | 0.5% | 1.89 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
| 3.78 | 264,604,361.40 | 1.0% | 1.37 | `OpenCV: Matx44f times Vector4 (float)`
| 3.75 | 266,524,926.63 | 0.5% | 1.37 | `OpenCV: Matx44d times Vector4 (double)`
| 0.98 | 1,015,373,609.76 | 0.4% | 0.36 | `GLM: Matrix4x4 times Vector4 (float)`
| 1.13 | 887,509,294.54 | 1.1% | 0.41 | `GLM: Matrix4x4 times Vector4 (double)`
| 1.79 | 559,536,142.55 | 1.2% | 0.65 | `GLM: Matrix3x4 times Vector4 (float)`
| 1.87 | 533,794,861.38 | 0.3% | 0.69 | `GLM: Matrix3x4 times Vector4 (double)`
| 1.39 | 721,918,390.71 | 0.5% | 0.51 | `Eigen: Matrix4x4 times Vector4 (float)`
| 1.55 | 645,222,153.03 | 0.7% | 0.56 | `Eigen: Matrix4x4 times Vector4 (double)`
| 1.79 | 558,964,815.22 | 1.3% | 0.65 | `Eigen: Matrix3x4 times Vector4 (float)`
| 2.23 | 448,510,911.71 | 0.5% | 0.82 | `Eigen: Matrix3x4 times Vector4 (double)`
Dell XPS 17 9700 with an Intel Core i7-10875H
Windows 10 with Visual Studio 2022 version 17.3.6
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 842.56 | 1,186,864.16 | 0.7% | 3.10 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
| 823.40 | 1,214,480.65 | 0.8% | 3.00 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
| 7.14 | 139,958,475.16 | 0.4% | 2.35 | `OpenCV: Matx44f times Vector4 (float)`
| 7.90 | 126,551,768.52 | 0.4% | 2.59 | `OpenCV: Matx44d times Vector4 (double)`
| 4.04 | 247,224,394.58 | 0.6% | 1.47 | `GLM: Matrix4x4 times Vector4 (float)`
| 5.10 | 195,890,937.54 | 0.1% | 1.86 | `GLM: Matrix4x4 times Vector4 (double)`
| 3.10 | 322,251,119.32 | 1.1% | 1.13 | `GLM: Matrix3x4 times Vector4 (float)`
| 3.13 | 319,531,239.59 | 0.5% | 1.14 | `GLM: Matrix3x4 times Vector4 (double)`
| 1.54 | 648,734,166.30 | 1.4% | 0.56 | `Eigen: Matrix4x4 times Vector4 (float)`
| 1.53 | 651,892,822.93 | 0.5% | 0.56 | `Eigen: Matrix4x4 times Vector4 (double)`
| 3.93 | 254,772,280.05 | 0.7% | 1.42 | `Eigen: Matrix3x4 times Vector4 (float)`
| 2.70 | 370,644,876.89 | 2.0% | 0.99 | `Eigen: Matrix3x4 times Vector4 (double)`
Dell XPS 17 9700 with an Intel Core i7-10875H
Windows 10 WSL with g++ version 11.2.0
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 574.13 | 1,741,751.58 | 1.3% | 2.08 | `OpenCV: Mat1d(4x4) times Vector4 (float)`
| 561.67 | 1,780,402.66 | 1.3% | 2.05 | `OpenCV: Mat1d(4x4) times Vector4 (double)`
| 3.66 | 273,285,071.60 | 0.4% | 1.20 | `OpenCV: Matx44f times Vector4 (float)`
| 3.69 | 270,684,689.01 | 0.7% | 1.21 | `OpenCV: Matx44d times Vector4 (double)`
| 0.96 | 1,041,278,471.09 | 1.4% | 0.35 | `GLM: Matrix4x4 times Vector4 (float)`
| 1.03 | 972,124,429.18 | 1.3% | 0.37 | `GLM: Matrix4x4 times Vector4 (double)`
| 1.75 | 570,792,360.05 | 1.2% | 0.64 | `GLM: Matrix3x4 times Vector4 (float)`
| 1.86 | 537,884,955.04 | 1.5% | 0.68 | `GLM: Matrix3x4 times Vector4 (double)`
| 1.36 | 735,029,481.30 | 0.9% | 0.49 | `Eigen: Matrix4x4 times Vector4 (float)`
| 1.38 | 723,380,789.97 | 1.6% | 0.50 | `Eigen: Matrix4x4 times Vector4 (double)`
| 1.76 | 568,813,038.55 | 2.1% | 0.63 | `Eigen: Matrix3x4 times Vector4 (float)`
| 1.92 | 520,583,076.87 | 0.4% | 0.70 | `Eigen: Matrix3x4 times Vector4 (double)`
Discussion
To better understand results, let’s look at the output assembly codes. For this, I am using the excellent tool that is Compiler Explorer.
4-by-4 versus 3-by-4 multiplication
First, why is the 4-by-4 matrix-vector multiplication faster than 3-by-4 matrix-vector multiplication, even though there are actually more computations involved? Following is the answer with glm:
void glm_mul4(glm::dvec4& C, const glm::dmat4& A, const glm::dvec4& B)
{
C = A * B;
}
void glm_mul3x4(glm::dvec3& C, const glm::dmat3x4& A, const glm::dvec4& B)
{
C = A * B;
}
glm_mul4(glm::vec<4, double, (glm::qualifier)0>&, glm::mat<4, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
vmovupd ymm0, YMMWORD PTR [rdx]
vxorps xmm1, xmm1, xmm1
vpermpd ymm1, ymm0, 85
vmulpd ymm1, ymm1, YMMWORD PTR [rsi+32]
vbroadcastsd ymm2, xmm0
vfmadd132pd ymm2, ymm1, YMMWORD PTR [rsi]
vxorps xmm1, xmm1, xmm1
vpermpd ymm1, ymm0, 170
vpermpd ymm0, ymm0, 255
vmulpd ymm0, ymm0, YMMWORD PTR [rsi+96]
vfmadd231pd ymm0, ymm1, YMMWORD PTR [rsi+64]
vaddpd ymm0, ymm0, ymm2
vmovupd YMMWORD PTR [rdi], ymm0
vzeroupper
ret
glm_mul3x4(glm::vec<3, double, (glm::qualifier)0>&, glm::mat<3, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
vmovsd xmm1, QWORD PTR [rdx+16]
vmovupd xmm0, XMMWORD PTR [rdx]
vmovddup xmm4, xmm1
vmulsd xmm1, xmm1, QWORD PTR [rsi+80]
vpermilpd xmm3, xmm0, 0
vpermilpd xmm2, xmm0, 3
vfmadd231sd xmm1, xmm0, QWORD PTR [rsi+16]
vunpckhpd xmm0, xmm0, xmm0
vfmadd132sd xmm0, xmm1, QWORD PTR [rsi+48]
vmulpd xmm1, xmm4, XMMWORD PTR [rsi+64]
vfmadd231pd xmm1, xmm3, XMMWORD PTR [rsi]
vfmadd231pd xmm1, xmm2, XMMWORD PTR [rsi+32]
vmovsd QWORD PTR [rdi+16], xmm0
vmovupd XMMWORD PTR [rdi], xmm1
ret
We can see that the function glm_mul3x4
uses SSE instructions (with xmm registers) whereas the function glm_mul4
uses AVX2 instructions (with ymm registers). Even though there are the same numbers of instructions in both implementations, the AVX2 executes slightly less computation (add and multiplications) instructions than the SSE implementation. This might explain the slight difference in performance between the two approaches.
Eigen versus glm
In our tests, glm appears to be slightly faster than Eigen. Following is the comparison of both implementations:
void eigen_mul4(Vector4d &C, const Matrix4d& A, const Vector4d& B)
{
C = A * B;
}
void glm_mul4(glm::dvec4& C, const glm::dmat4& A, const glm::dvec4& B)
{
C = A * B;
}
eigen_mul4(Eigen::Matrix<double, 4, 1, 0, 4, 1>&, Eigen::Matrix<double, 4, 4, 0, 4, 4> const&, Eigen::Matrix<double, 4, 1, 0, 4, 1> const&):
vbroadcastsd ymm0, QWORD PTR [rdx]
vbroadcastsd ymm1, QWORD PTR [rdx+8]
vmulpd ymm0, ymm0, YMMWORD PTR [rsi]
vbroadcastsd ymm2, QWORD PTR [rdx+16]
vbroadcastsd ymm3, QWORD PTR [rdx+24]
vfmadd231pd ymm0, ymm1, YMMWORD PTR [rsi+32]
vfmadd231pd ymm0, ymm2, YMMWORD PTR [rsi+64]
vfmadd231pd ymm0, ymm3, YMMWORD PTR [rsi+96]
vmovapd YMMWORD PTR [rdi], ymm0
vzeroupper
ret
glm_mul4(glm::vec<4, double, (glm::qualifier)0>&, glm::mat<4, 4, double, (glm::qualifier)0> const&, glm::vec<4, double, (glm::qualifier)0> const&):
vmovupd ymm0, YMMWORD PTR [rdx]
vxorps xmm1, xmm1, xmm1
vpermpd ymm1, ymm0, 85
vmulpd ymm1, ymm1, YMMWORD PTR [rsi+32]
vbroadcastsd ymm2, xmm0
vfmadd132pd ymm2, ymm1, YMMWORD PTR [rsi]
vxorps xmm1, xmm1, xmm1
vpermpd ymm1, ymm0, 170
vpermpd ymm0, ymm0, 255
vmulpd ymm0, ymm0, YMMWORD PTR [rsi+96]
vfmadd231pd ymm0, ymm1, YMMWORD PTR [rsi+64]
vaddpd ymm0, ymm0, ymm2
vmovupd YMMWORD PTR [rdi], ymm0
vzeroupper
ret
The two implementations gives different results after compilation. Unfortunately, I did not go deep enough to understand what differs and how much impact it has.
Conclusion
As shown in the results, the cv::Mat
matrix from OpenCV is terribly slow, and should not be used for intensive 3D mathematics. The alternative is to use cv::Matx44d
but it is still quite slow compared to glm or Eigen. It might be because the cv::Mat
matrix class has a dynamic size rather than a fixed size defined at compile time, like in glm and Eigen. It is however understandable, as OpenCV is designed to process high resolution images.
Depending on the compiler and the Operating System, either glm or Eigen performs the best. Interestingly, double precision is not necessarily significantly less performant than float in this benchmark. Usually single precision outperforms double precision in memory bottleneck scenarios, but it is not the case here since the matrices fits in the CPU cache. Another interesting discovery is that using 4-by-4 matrices instead of 3-by-4 can be actually more performant even though it involves more computations. It is because in this case the compiler cannot make the best use of the AVX2 instructions.
As seen with the compiler outputs, the matrix-vector multiplication does not benefit from AVX-512. As a future work, it would be interesting to run the same benchmark with matrix-matrix multiplications and see whether AVX-512 provides a significant boost compared to AVX-2. It would also be interesting to investigate if the use of SIMD instructions can make the computation memory bound, and how much performance is lost because of this.
Code
Here is the cmake file that automatically downloads the dependencies, and compile them with the benchmark.
cmake_minimum_required(VERSION 3.16)
project(BenchmarkProject VERSION 1.0 LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
include(FetchContent)
# Activate OpenMP
find_package(OpenMP REQUIRED)
# Retrieve and activate the nanobench library
FetchContent_Declare(
nanobench
GIT_REPOSITORY https://github.com/martinus/nanobench.git
GIT_TAG v4.3.6
GIT_SHALLOW TRUE)
FetchContent_MakeAvailable(nanobench)
# Include OpenCV 4.5.5
option(WITH_OPENMP ON)
FetchContent_Declare(opencv
GIT_REPOSITORY https://github.com/opencv/opencv.git
GIT_TAG 4.5.5
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(opencv)
# Manual include dirs for OpenCV (could not find better than this hack)
set(OpenCV_INCLUDE_DIRS "")
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_CONFIG_FILE_INCLUDE_DIR})
list(APPEND OpenCV_INCLUDE_DIRS ${OPENCV_MODULE_opencv_core_LOCATION}/include)
set(OpenCV_LIBS "")
list(APPEND OpenCV_LIBS opencv_core)
# Retrieve and activate glm
option(BUILD_SHARED_LIBS "Build shared library" OFF)
option(BUILD_STATIC_LIBS "Build static library" OFF)
option(GLM_TEST_ENABLE "Build unit tests" OFF)
FetchContent_Declare(
glm
GIT_REPOSITORY https://github.com/g-truc/glm.git
GIT_TAG 0.9.9.8
GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(glm)
# Retrieve and activate Eigen3
FetchContent_Declare(
eigen
GIT_REPOSITORY https://gitlab.com/libeigen/eigen.git
GIT_TAG 3.4.0
GIT_SHALLOW TRUE
)
FetchContent_GetProperties(eigen)
if(NOT eigen_POPULATED)
FetchContent_Populate(eigen)
endif()
set(EIGEN_INCLUDE_DIRS ${eigen_SOURCE_DIR})
add_executable(Benchmark main.cpp)
if (MSVC)
# warning level 4 and all warnings as errors
target_compile_options(Benchmark PRIVATE /W4)
target_compile_options(Benchmark PRIVATE -openmp:experimental)
target_compile_options(Benchmark PRIVATE /arch:AVX2)
else()
# lots of warnings and all warnings as errors
target_compile_options(Benchmark PRIVATE -Wall -Wextra -pedantic)
target_compile_options(Benchmark PRIVATE -march=native)
endif()
target_include_directories(Benchmark
PRIVATE
${OpenCV_INCLUDE_DIRS}
$<BUILD_INTERFACE:${EIGEN_INCLUDE_DIRS}>
$<INSTALL_INTERFACE:${CMAKE_INSTALL_INCLUDEDIR}>
)
target_link_libraries(Benchmark
PRIVATE
nanobench
${OpenCV_LIBS}
glm::glm
OpenMP::OpenMP_CXX
)
The source code of the benchmark:
#include <array>
#include <atomic>
#include <iostream>
#include <omp.h>
#include <opencv2/core.hpp>
#include <glm/glm.hpp>
#include <Eigen/Dense>
#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>
constexpr int WARMUP_ITERATIONS = 1000;
constexpr int MIN_EPOCHS = 30000000;
template <typename T, uint64_t N>
T array_sum(const std::array<T, N>& arr)
{
T sum = 0;
for (unsigned int i = 0; i < arr.size(); i++)
{
sum += arr[i];
}
return sum;
}
template <typename T, uint64_t N>
T array_sum_simd(const std::array<T, N>& arr)
{
T sum = 0;
#pragma omp simd reduction(+:sum)
for (unsigned int i = 0; i < arr.size(); i++)
{
sum += arr[i];
}
return sum;
}
void benchmark_array_sum()
{
constexpr unsigned int N = 16;
std::array<float, N> arr_float{};
std::array<double, N> arr_double{};
for (unsigned int i = 0; i < N; i++)
{
arr_float[i] = 1.f;
arr_double[i] = 1.0;
}
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Sum of 16 values (float)", [&] {
auto sum = array_sum(arr_float);
ankerl::nanobench::doNotOptimizeAway(sum);
});
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("OpenMP: Sum of 16 values (float)", [&] {
auto sum = array_sum_simd(arr_float);
ankerl::nanobench::doNotOptimizeAway(sum);
});
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Sum of 16 values (double)", [&] {
auto sum = array_sum(arr_double);
ankerl::nanobench::doNotOptimizeAway(sum);
});
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("OpenMP: Sum of 16 values (double)", [&] {
auto sum = array_sum_simd(arr_double);
ankerl::nanobench::doNotOptimizeAway(sum);
});
}
void benchmark_opencv_float()
{
const cv::Mat1f mat4 = (cv::Mat1f(4, 4) <<
1.f, 0.f, 0.f, 0.f,
0.f, 1.f, 0.f, 0.f,
0.f, 0.f, 1.f, 0.f,
0.f, 0.f, 0.f, 1.f
);
const cv::Vec4f vec4(1.f, 1.f, 1.f, 1.f);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS / 100).minEpochIterations(MIN_EPOCHS / 100)
.run("OpenCV: Mat1d(4x4) times Vector4 (float)", [&] {
cv::Mat1f result = mat4 * vec4;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
void benchmark_opencv_double()
{
const cv::Mat1d mat4 = (cv::Mat1d(4, 4) <<
1.0, 0.0, 0.0, 0.0,
0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 1.0
);
const cv::Vec4d vec4(1.0, 1.0, 1.0, 1.0);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS / 100).minEpochIterations(MIN_EPOCHS / 100)
.run("OpenCV: Mat1d(4x4) times Vector4 (double)", [&] {
cv::Mat1d result = mat4 * vec4;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
void benchmark_opencv_fixed_float()
{
const cv::Matx44f mat4(
1.0, 0.0, 0.0, 0.0,
0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 1.0
);
const cv::Vec4f vec4(1.0, 1.0, 1.0, 1.0);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("OpenCV: Matx44f times Vector4 (float)", [&] {
cv::Vec4f result = mat4 * vec4;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
void benchmark_opencv_fixed_double()
{
const cv::Matx44d mat4(
1.0, 0.0, 0.0, 0.0,
0.0, 1.0, 0.0, 0.0,
0.0, 0.0, 1.0, 0.0,
0.0, 0.0, 0.0, 1.0
);
const cv::Vec4d vec4(1.0, 1.0, 1.0, 1.0);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("OpenCV: Matx44d times Vector4 (double)", [&] {
cv::Vec4d result = mat4 * vec4;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
void benchmark_glm()
{
const glm::mat4 mat4f(1.f);
const glm::vec4 vec4f(1.f, 1.f, 1.f, 1.f);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("GLM: Matrix4x4 times Vector4 (float)", [&] {
glm::vec4 result = mat4f * vec4f;
ankerl::nanobench::doNotOptimizeAway(result);
});
const glm::dmat4 mat4d(1.0);
const glm::dvec4 vec4d(1.0, 1.0, 1.0, 1.0);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("GLM: Matrix4x4 times Vector4 (double)", [&] {
glm::dvec4 result = mat4d * vec4d;
ankerl::nanobench::doNotOptimizeAway(result);
});
const glm::mat4x3 mat34f(1.f);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("GLM: Matrix3x4 times Vector4 (float)", [&] {
glm::vec3 result = mat34f * vec4f;
ankerl::nanobench::doNotOptimizeAway(result);
});
const glm::dmat4x3 mat34d(1.f);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("GLM: Matrix3x4 times Vector4 (double)", [&] {
glm::dvec3 result = mat34d * vec4d;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
void benchmark_eigen()
{
const Eigen::Matrix4f mat4f = Eigen::Matrix4f::Random(4, 4);
const Eigen::Vector4f vec4f(1.f, 1.f, 1.f, 1.f);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Eigen: Matrix4x4 times Vector4 (float)", [&] {
Eigen::Vector4f result = mat4f * vec4f;
ankerl::nanobench::doNotOptimizeAway(result);
});
const Eigen::Matrix4d mat4d = Eigen::Matrix4d::Random(4, 4);
const Eigen::Vector4d vec4d(1.0, 1.0, 1.0, 1.0);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Eigen: Matrix4x4 times Vector4 (double)", [&] {
Eigen::Vector4d result = mat4d * vec4d;
ankerl::nanobench::doNotOptimizeAway(result);
});
const Eigen::Matrix<float, 3, 4> mat34f = Eigen::Matrix<float, 3, 4>::Random(3, 4);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Eigen: Matrix3x4 times Vector4 (float)", [&] {
Eigen::Vector3f result = mat34f * vec4f;
ankerl::nanobench::doNotOptimizeAway(result);
});
const Eigen::Matrix<double, 3, 4> mat34d = Eigen::Matrix<double, 3, 4>::Random(3, 4);
ankerl::nanobench::Bench().warmup(WARMUP_ITERATIONS).minEpochIterations(MIN_EPOCHS)
.run("Eigen: Matrix3x4 times Vector4 (double)", [&] {
Eigen::Vector3d result = mat34d * vec4d;
ankerl::nanobench::doNotOptimizeAway(result);
});
}
int main()
{
// benchmark_array_sum();
benchmark_opencv_float();
benchmark_opencv_double();
benchmark_opencv_fixed_float();
benchmark_opencv_fixed_double();
benchmark_glm();
benchmark_eigen();
return 0;
}