An Improvement of the Matrix-Matrix Multiplication Speed using 2D-Tiling and AVX512 Intrinsics for Multi-Core Architectures

Nwe Zin Oo; Panyayot Chaikan

doi:10.55164/ajstr.v24i2.242021

PDF

Published: Aug 22, 2021

DOI: https://doi.org/10.55164/ajstr.v24i2.242021

Keywords:

AVX512 OpenMP 2D-tiling Multicore Processing Eigen OpenBLAS

Nwe Zin Oo

Faculty of Engineering, Prince of Songkla University, Songkhla, 90112, Thailand

Panyayot Chaikan

Faculty of Engineering, Prince of Songkla University, Songkhla, 90112, Thailand

Abstract

Matrix-matrix multiplication is a time-consuming operation in scientific and engineering applications. When the matrix size is large, it will take a lot of computation time, resulting in slow software which is unacceptable in real-time applications. In this paper, 2D-tiling, loop unrolling, data padding, OpenMP directives, and AVX512 intrinsics are utilized to increase the speed of matrix-matrix multiplication on multi-core architectures. Our algorithm, tested on a Core i9-7900X machine, is more than two times faster than the operations offered by the OpenBLAS and Eigen libraries for single and double precision floating-point matrices. We also propose an equation for parameter tuning which allows our algorithm to be adapted to process any size of matrix on CPUs with different cache organizations.

Issue

Vol. 24 No. 2 (2021): May - August

Section

Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

Ahmad, A., & Pasha, M. A. (2020). Optimizing hardware accelerated general matrix-matrix multiplication for CNNs on FPGAs, IEEE Transactions on Circuits and Systems II: Express Briefs, 67(11), 2692–2696.

Choi, Y. R., Nikolskiy, V., & Stegailov, V. (2020). Matrix-matrix multiplication using multiple GPUS connected by Nvlink. In The 2020 Global Smart Industry Conference: GloSIC 2020. 354-361. November 17-19, 2020, Chelyabinsk, Russian Federation: IEEE.

Rasouli, M., Kirby, R. M., & Sundar, H. (2021). A compressed, divide and conquer algorithm for scalable distributed matrix-matrix multiplication. In The 2021 International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2021. 110-119. January 20-22, 2021. Virtual Event, Republic of Korea. New York: Association for Computing Machinery.

Jacob, B., & Guennebaud, G. (2020). Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms (Online). Retrieved Jan 15, 2020, from https://eigen.tuxfamily.org/index.php?title=Main_Page.

Kroeker, M. (2020). Open BLAS 0.3.10 version (Online). Retrieved June 24, 2020, from https://github.com/xianyi/OpenBLAS/releases/tag/v0.3.10.

WikiChip. (2020). Intel Skylake (client) Microarchitectures (Online). Retrieved March 18, 2021, from https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client).

Intel Corporation. (2020). Intel Core i3-1000G1 processor (Online). Retrieved September 2, 2020, from https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i3-1000g1-processor-4m-cache-up-to-3-20-ghz.html.

ARM Limited. (2020). Neon Intrinsics: Getting started on Android user guide (Online). Retrieved March 20, 2021, from https://developer.arm.com/solutions/os/android/developer-guides/neonintrinsics getting-started-on-android.

IBM Corporation. (2018). Power ISA: Version 2.07B (Online). Retrieved March 23, 2021, from https://ibm.ent.box.com/s/jd5w15gz301s5b5dt375mshpq9c3lh4u.

Guimaraes, A., Aranha, D. F., & Borin, E. (2019). Optimized implementation of QC-MDPC codebased cryptography, Concurrency and Computation-Practice & Experience, 31(18), e5089. DOI: 10.1002/cpe.5089.

Ginting, B. M., & Mundani, R. P. (2019). Comparison of shallow water solvers: Applications for Dam-Break and Tsunami cases with reordering strategy for efficient vectorization on modern hardware, Water, 11(4), 639. DOI: 10.3390/w11040639.

Ahmad, N., & Bakar, R. (2019). An analysis for the performance of reservoir simulations on a multicore CPU. In The International Field Exploration and Development Conference: IFEDC 2019. 3514–3530. October 16–18, 2019, Chengdu, China: Springer.

Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S. S., & Sattler, K. U. (2020). Faster & strong: string dictionary compression using sampling and fast vectorized decompression, The VLDB Journal, 29, 1263–1285. DOI: 10.1007/s00778-020-00620-x.

Hassana, S. A., Hemeida, A. M., & Mahmoud, M. M. M. (2016). Performance evaluation of matrixmatrix multiplications using Intel’s Advanced Vector Extensions (AVX). Microprocessors and Microsystems, 47(SI), 369–374, DOI:10.1016/j.micpro.2016.10. 002.

Hassan, S. A., Mahmoud, M. M. M., Hemeida, A. M., & Saber, M. A. (2018). Effective implementation of matrix-vector multiplication on Intel’s AVX multicore processor. Computer Languages Systems & Structures, 51, 158–175.

Article Sidebar

Main Article Content

Abstract

Article Details

References