An Improvement of the Matrix-Matrix Multiplication Speed using 2D-Tiling and AVX512 Intrinsics for Multi-Core Architectures

Main Article Content

Nwe Zin Oo
Panyayot Chaikan

Abstract

Matrix-matrix multiplication is a time-consuming operation in scientific and engineering applications. When the matrix size is large, it will take a lot of computation time, resulting in slow software which is unacceptable in real-time applications. In this paper, 2D-tiling, loop unrolling, data padding, OpenMP directives, and AVX512 intrinsics are utilized to increase the speed of matrix-matrix multiplication on multi-core architectures. Our algorithm, tested on a Core i9-7900X machine, is more than two times faster than the operations offered by the OpenBLAS and Eigen libraries for single and double precision floating-point matrices. We also propose an equation for parameter tuning which allows our algorithm to be adapted to process any size of matrix on CPUs with different cache organizations.

Article Details

Section
Research Articles

References

Ahmad, A., & Pasha, M. A. (2020). Optimizing hardware accelerated general matrix-matrix multiplication for CNNs on FPGAs, IEEE Transactions on Circuits and Systems II: Express Briefs, 67(11), 2692–2696.

Choi, Y. R., Nikolskiy, V., & Stegailov, V. (2020). Matrix-matrix multiplication using multiple GPUS connected by Nvlink. In The 2020 Global Smart Industry Conference: GloSIC 2020. 354-361. November 17-19, 2020, Chelyabinsk, Russian Federation: IEEE.

Rasouli, M., Kirby, R. M., & Sundar, H. (2021). A compressed, divide and conquer algorithm for scalable distributed matrix-matrix multiplication. In The 2021 International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2021. 110-119. January 20-22, 2021. Virtual Event, Republic of Korea. New York: Association for Computing Machinery.

Jacob, B., & Guennebaud, G. (2020). Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms (Online). Retrieved Jan 15, 2020, from https://eigen.tuxfamily.org/index.php?title=Main_Page.

Kroeker, M. (2020). Open BLAS 0.3.10 version (Online). Retrieved June 24, 2020, from https://github.com/xianyi/OpenBLAS/releases/tag/v0.3.10.

WikiChip. (2020). Intel Skylake (client) Microarchitectures (Online). Retrieved March 18, 2021, from https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client).

Intel Corporation. (2020). Intel Core i3-1000G1 processor (Online). Retrieved September 2, 2020, from https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i3-1000g1-processor-4m-cache-up-to-3-20-ghz.html.

ARM Limited. (2020). Neon Intrinsics: Getting started on Android user guide (Online). Retrieved March 20, 2021, from https://developer.arm.com/solutions/os/android/developer-guides/neonintrinsics getting-started-on-android.

IBM Corporation. (2018). Power ISA: Version 2.07B (Online). Retrieved March 23, 2021, from https://ibm.ent.box.com/s/jd5w15gz301s5b5dt375mshpq9c3lh4u.

Guimaraes, A., Aranha, D. F., & Borin, E. (2019). Optimized implementation of QC-MDPC codebased cryptography, Concurrency and Computation-Practice & Experience, 31(18), e5089. DOI: 10.1002/cpe.5089.

Ginting, B. M., & Mundani, R. P. (2019). Comparison of shallow water solvers: Applications for Dam-Break and Tsunami cases with reordering strategy for efficient vectorization on modern hardware, Water, 11(4), 639. DOI: 10.3390/w11040639.

Ahmad, N., & Bakar, R. (2019). An analysis for the performance of reservoir simulations on a multicore CPU. In The International Field Exploration and Development Conference: IFEDC 2019. 3514–3530. October 16–18, 2019, Chengdu, China: Springer.

Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S. S., & Sattler, K. U. (2020). Faster & strong: string dictionary compression using sampling and fast vectorized decompression, The VLDB Journal, 29, 1263–1285. DOI: 10.1007/s00778-020-00620-x.

Hassana, S. A., Hemeida, A. M., & Mahmoud, M. M. M. (2016). Performance evaluation of matrixmatrix multiplications using Intel’s Advanced Vector Extensions (AVX). Microprocessors and Microsystems, 47(SI), 369–374, DOI:10.1016/j.micpro.2016.10. 002.

Hassan, S. A., Mahmoud, M. M. M., Hemeida, A. M., & Saber, M. A. (2018). Effective implementation of matrix-vector multiplication on Intel’s AVX multicore processor. Computer Languages Systems & Structures, 51, 158–175.