An Improvement of the Matrix-Matrix Multiplication Speed using 2D-Tiling and AVX512 Intrinsics for Multi-Core Architectures

หนู่-ซิน อู; ปัญญยศ ไชยกาฬ

doi:10.55164/ajstr.v24i2.242021

PDF

เผยแพร่แล้ว: ส.ค. 22, 2021

DOI: https://doi.org/10.55164/ajstr.v24i2.242021

คำสำคัญ:

เอวีเอ็กซ์ 512 โอเพนเอ็มพี การปูกระเบื้องแบบสองมิติ การประมวลผลหลายแกน ไอเกน โอเพนบลาส

หนู่-ซิน อู

คณะวิศวกรรมศาสตร์ มหาวิทยาลัยสงขลานครินทร์ สงขลา 90112

ปัญญยศ ไชยกาฬ

คณะวิศวกรรมศาสตร์ มหาวิทยาลัยสงขลานครินทร์ สงขลา 90112

บทคัดย่อ

การคูณเมทริกซ์ด้วยเมทริกซ์เป็นการดำเนินการทางคณิตศาสตร์ที่ใช้เวลาในการประมวลผลอย่างมากในงานทางวิทยาศาสตร์และวิศวกรรม เมื่อเมทริกซ์มีขนาดใหญ่จะเสียเวลาในการประมวลผลมากส่งผลให้ซอฟต์แวร์ทำงานช้า ซึ่งเป็นสิ่งที่รับไม่ได้ในโปรแกรมประยุกต์แบบเวลาจริง บทความนี้นำเทคนิคการปูกระเบื้องแบบ 2 มิติ การคลายการวนซ้ำการเสริมเต็มข้อมูล ตัวชี้แนะโอเพนเอ็มพี และชุดคำสั่งเอวีเอ็กซ์ 512 มาใช้เพิ่มความเร็วในการคูณเมทริกซ์บนสถาปัตยกรรมการประมวลผลแบบหลายแกนขั้นตอนวิธีที่นำเสนอเมื่อทดสอบบนเครื่อง Core i9-7900X พบว่ามีความเร็วมากกว่า 2 เท่าเมื่อเทียบกับการดำเนินการที่ใช้คลังโปรแกรมโอเพนบลาสและไอเกนสำหรับการประมวลผลเมทริกซ์แบบทศนิยมจุดลอยตัวทั้งชนิดความเที่ยงเท่าเดียวและความเที่ยงสองเท่า นอกจากนี้ยังได้มีการนำเสนอสมการสำหรับใช้ปรับค่าพารามิเตอร์เพื่อให้สามารถนำขั้นตอนวิธีที่นำเสนอไปประมวลผลบนเมทริกซ์ขนาดใด ๆ บนตัวประมวลผลอื่น ที่มีรูปแบบของหน่วยความจำแคชแตกต่างกัน

ฉบับ

ปีที่ 24 ฉบับที่ 2 (2021): พฤษภาคม - สิงหาคม

ประเภทบทความ

บทความวิจัย

อนุญาตภายใต้เงื่อนไข Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

เอกสารอ้างอิง

Ahmad, A., & Pasha, M. A. (2020). Optimizing hardware accelerated general matrix-matrix multiplication for CNNs on FPGAs, IEEE Transactions on Circuits and Systems II: Express Briefs, 67(11), 2692–2696.

Choi, Y. R., Nikolskiy, V., & Stegailov, V. (2020). Matrix-matrix multiplication using multiple GPUS connected by Nvlink. In The 2020 Global Smart Industry Conference: GloSIC 2020. 354-361. November 17-19, 2020, Chelyabinsk, Russian Federation: IEEE.

Rasouli, M., Kirby, R. M., & Sundar, H. (2021). A compressed, divide and conquer algorithm for scalable distributed matrix-matrix multiplication. In The 2021 International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2021. 110-119. January 20-22, 2021. Virtual Event, Republic of Korea. New York: Association for Computing Machinery.

Jacob, B., & Guennebaud, G. (2020). Eigen is a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms (Online). Retrieved Jan 15, 2020, from https://eigen.tuxfamily.org/index.php?title=Main_Page.

Kroeker, M. (2020). Open BLAS 0.3.10 version (Online). Retrieved June 24, 2020, from https://github.com/xianyi/OpenBLAS/releases/tag/v0.3.10.

WikiChip. (2020). Intel Skylake (client) Microarchitectures (Online). Retrieved March 18, 2021, from https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client).

Intel Corporation. (2020). Intel Core i3-1000G1 processor (Online). Retrieved September 2, 2020, from https://ark.intel.com/content/www/us/en/ark/products/197122/intel-core-i3-1000g1-processor-4m-cache-up-to-3-20-ghz.html.

ARM Limited. (2020). Neon Intrinsics: Getting started on Android user guide (Online). Retrieved March 20, 2021, from https://developer.arm.com/solutions/os/android/developer-guides/neonintrinsics getting-started-on-android.

IBM Corporation. (2018). Power ISA: Version 2.07B (Online). Retrieved March 23, 2021, from https://ibm.ent.box.com/s/jd5w15gz301s5b5dt375mshpq9c3lh4u.

Guimaraes, A., Aranha, D. F., & Borin, E. (2019). Optimized implementation of QC-MDPC codebased cryptography, Concurrency and Computation-Practice & Experience, 31(18), e5089. DOI: 10.1002/cpe.5089.

Ginting, B. M., & Mundani, R. P. (2019). Comparison of shallow water solvers: Applications for Dam-Break and Tsunami cases with reordering strategy for efficient vectorization on modern hardware, Water, 11(4), 639. DOI: 10.3390/w11040639.

Ahmad, N., & Bakar, R. (2019). An analysis for the performance of reservoir simulations on a multicore CPU. In The International Field Exploration and Development Conference: IFEDC 2019. 3514–3530. October 16–18, 2019, Chengdu, China: Springer.

Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S. S., & Sattler, K. U. (2020). Faster & strong: string dictionary compression using sampling and fast vectorized decompression, The VLDB Journal, 29, 1263–1285. DOI: 10.1007/s00778-020-00620-x.

Hassana, S. A., Hemeida, A. M., & Mahmoud, M. M. M. (2016). Performance evaluation of matrixmatrix multiplications using Intel’s Advanced Vector Extensions (AVX). Microprocessors and Microsystems, 47(SI), 369–374, DOI:10.1016/j.micpro.2016.10. 002.

Hassan, S. A., Mahmoud, M. M. M., Hemeida, A. M., & Saber, M. A. (2018). Effective implementation of matrix-vector multiplication on Intel’s AVX multicore processor. Computer Languages Systems & Structures, 51, 158–175.

Article Sidebar

Main Article Content

บทคัดย่อ

Article Details

เอกสารอ้างอิง