Fact-checked by Grok 2 weeks ago

GEMM

General Matrix Multiply (GEMM) is a core operation in numerical linear algebra that computes the product of two matrices, typically with optional scaling and addition to an existing result matrix, serving as a foundational routine in the level 3 Basic Linear Algebra Subprograms (BLAS). Level 3 BLAS, including GEMM, was introduced in 1990 to support efficient matrix-matrix operations.^[1]^[2] Defined mathematically as C = \alpha AB + \beta C, where A and B are input matrices of compatible dimensions, \alpha and \beta are scalar multipliers, and C is the output matrix, GEMM enables efficient handling of dense matrix multiplications essential for various computational tasks.^[3] This operation is standardized across BLAS implementations to ensure portability and performance optimization on diverse hardware architectures.^[4] GEMM's significance stems from its ubiquity in scientific computing, engineering simulations, and data-intensive applications, where it forms the computational backbone for algorithms involving matrix operations.^[5] In machine learning and deep learning, GEMM underpins forward and backward passes in neural networks, particularly for operations like convolutions that can be reformulated as matrix multiplications, driving the performance of training and inference on accelerators such as GPUs.^[6] High-performance libraries like Intel's MKL, NVIDIA's cuBLAS, and OpenBLAS optimize GEMM to achieve peak hardware utilization.^[5]^[3] Beyond standard dense matrices, extensions like batched GEMM handle multiple independent multiplications simultaneously, crucial for parallel processing in modern workloads such as ensemble simulations and batched inference in AI models.^[5] Research continues to advance GEMM implementations, incorporating low-precision formats for energy efficiency and specialized hardware like tensor cores for accelerated throughput in AI hardware.^[7]^[3] As a benchmark for linear algebra performance, GEMM remains central to evaluating system capabilities in high-performance computing environments.^[8]

Fundamentals

Definition

GEMM, or General Matrix Multiply, is a core operation in numerical linear algebra that computes the matrix product of two input matrices scaled by a factor, added to a scaled version of an output matrix. Specifically, it performs the computation C = \alpha \mathrm{op}(A) \mathrm{op}(B) + \beta C, where A is an m \times k matrix, B is a k \times n matrix, C is an m \times n matrix, \alpha and \beta are scalar multipliers, and \mathrm{op}(X) denotes either X or its transpose X^T (or Hermitian transpose for complex matrices).^[2]^[9] This formulation generalizes the standard matrix multiplication by incorporating optional transpositions of the input matrices A and/or B via dedicated parameters (transA and transB), allowing flexibility in handling row-major or column-major storage without explicit preprocessing. The scaling factors \alpha and \beta further extend its utility, enabling fused multiply-add operations that are common in iterative algorithms and avoiding separate scaling steps.^[2]^[4] GEMM forms part of the Basic Linear Algebra Subprograms (BLAS) standard, which originated with Level 1 routines in 1979 to promote portable, efficient linear algebra software. The Level 3 BLAS, including GEMM, were formally defined in 1990 to address matrix-matrix operations on modern vector processors.^[10]^[2] Unlike specialized Level 3 BLAS routines such as SYMM (symmetric matrix multiply, which assumes one input is symmetric to exploit structure for efficiency) or TRMM (triangular matrix multiply, restricted to a triangular input matrix), GEMM operates on arbitrary dense matrices without assuming any symmetry or triangular form, providing a versatile primitive for general computations.^[2]^[4]

Notation and Parameters

The General Matrix Multiply (GEMM) operation is implemented in the Basic Linear Algebra Subprograms (BLAS) as a Level 3 routine, with variants tailored to different data types: SGEMM for single-precision real numbers, DGEMM for double-precision real numbers, CGEMM for single-precision complex numbers, and ZGEMM for double-precision complex numbers. These routines follow a standardized Fortran interface, enabling portable and efficient matrix multiplication across implementations.^[1] The canonical subroutine signature, exemplified by DGEMM, is as follows:

SUBROUTINE DGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
SUBROUTINE DGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

Here, TRANSA and TRANSB are character flags specifying whether matrices A and B are to be transposed or, for complex variants like CGEMM, conjugate transposed during the computation; valid options are 'N' for no operation, 'T' for transpose, and 'C' for conjugate transpose (the latter exclusive to complex routines).^[11]^[12] The integers M, N, and K denote the number of rows in the output matrix C (M ≥ 0), the number of columns in C (N ≥ 0), and the shared inner dimension for compatibility between A and B (K ≥ 0), respectively. ALPHA and BETA are scalar multipliers of the appropriate precision, with ALPHA scaling the product op(A) * op(B) and BETA scaling the existing matrix C.^[11] The matrices A, B, and C are passed as two-dimensional arrays, with A and B as inputs of dimensions determined by the transposition flags (e.g., A is LDA by K if TRANSA = 'N'), and C as both input and output of dimensions LDC by N. The leading dimensions LDA, LDB, and LDC specify the first dimension of the respective arrays in storage, allowing for flexible handling of submatrices or non-square layouts, and must satisfy minimum values such as LDA ≥ max(1, M) when TRANSA = 'N'.^[11] BLAS GEMM routines adhere to Fortran's column-major storage convention, where matrix elements are stored column-wise in contiguous memory, facilitating efficient access patterns in column-oriented algorithms. The leading dimension parameters provide flexibility for storing matrices in non-contiguous regions, such as subarrays within larger allocated spaces, without requiring data copying.^[1] For error handling, BLAS implementations assume non-negative dimensions (M, N, K ≥ 0) and compatible matrix sizes (e.g., the inner dimensions match via K), with leading dimensions meeting the specified minima; invalid parameters trigger a call to the standard error subroutine XERBLA, which reports the offending argument index and typically halts execution or returns control to the caller.^[11]

Algorithms and Implementations

Naive Algorithm

The naive algorithm for the general matrix multiplication (GEMM) computes C \leftarrow \alpha \op(A) \op(B) + \beta C, where A is an m \times k matrix, B is a k \times n matrix, C is an m \times n matrix, \alpha and \beta are scalars, and \op(X) denotes X, its transpose X^T, or conjugate transpose X^H as specified by BLAS parameters.^[13] This approach uses a triple nested loop to accumulate each element of C directly from corresponding rows of \op(A) and columns of \op(B).^[14] For the non-transposed case (\op(A) = A, \op(B) = B), the pseudocode initializes an accumulator for each C(i,j) and performs the summation before scaling and adding the existing C:

for i = 0 to m-1
    for j = 0 to n-1
        temp = 0.0
        for p = 0 to k-1
            temp += A[i][p] * B[p][j]
        C[i][j] = α * temp + β * C[i][j]
for i = 0 to m-1
    for j = 0 to n-1
        temp = 0.0
        for p = 0 to k-1
            temp += A[i][p] * B[p][j]
        C[i][j] = α * temp + β * C[i][j]

If \beta = 0, the addition term is omitted, overwriting C with the scaled product and effectively initializing it to zero during accumulation.^[15] For transposed cases, the inner loop accesses are adjusted (e.g., A and B for both transposes) to align with the effective dimensions.^[13] This implementation requires exactly $2mnk floating-point operations (FLOPs): mnk multiplications and mnk additions (or subtractions if considering fused operations), yielding O(mnk) time complexity.^[14] Space usage is O(mk + kn + mn) to store the input and output dense matrices A, B, and C.^[14]

Optimized Algorithms

Optimized algorithms for general matrix multiplication (GEMM) enhance performance by addressing memory hierarchy constraints and computational inefficiencies in the naive triple-loop approach. A primary technique is loop blocking, or tiling, which partitions the matrices into smaller submatrices that fit within cache levels, thereby increasing data reuse and reducing memory access latency. For instance, the matrices are divided into blocks of sizes m_c \times k_c for the left operand and k_c \times n_c for the right operand, with block sizes tuned to align with L1 or L2 cache capacities, such as filling half the L2 cache to maximize temporal locality.^[16] Different loop orders, such as ijk, ikj, or jik, further optimize data access patterns; the ikj order, for example, promotes column-major storage compatibility in Fortran-style implementations, minimizing cache misses by ensuring sequential access to submatrices of the accumulating result matrix.^[16] Loop unrolling and fusion complement blocking by reducing overhead from loop control instructions and enabling tighter integration of operations. Inner loops are fully unrolled to eliminate branch instructions and allow the compiler to schedule fused multiply-add (FMA) operations more efficiently, with unrolling factors often set to match register counts, such as processing 4 to 8 elements per iteration on modern CPUs.^[16] Fusion techniques combine adjacent loop iterations or auxiliary operations (e.g., scaling or addition in the GEMM formula) into a single kernel, preventing intermediate data writes to memory and improving instruction-level parallelism.^[17] For large square matrices, Strassen's algorithm offers asymptotic improvements over the standard O(n^3) complexity by recursively subdividing matrices into 2x2 blocks and performing only seven multiplications instead of eight, yielding a complexity of O(n^{\log_2 7}) \approx O(n^{2.807}). This recursive approach is practical for matrices exceeding certain thresholds (e.g., n > [1000](/page/1000)) where the reduced multiplications outweigh the increased additions, though it requires careful implementation to handle non-power-of-two dimensions and maintain numerical stability in GEMM variants.^[14] Single instruction, multiple data (SIMD) instructions accelerate the inner dot-product computations by vectorizing accumulations across multiple elements simultaneously. On x86 architectures, AVX-512 instructions process 8 double-precision or 16 single-precision elements per operation, while AVX2 processes 4 doubles or 8 singles; ARM's NEON handles 2 doubles or 4 singles, enabling up to 4x-8x speedup in the microkernel for aligned data.^[18]^[19] These vectorized dot products are integrated into blocked loops, with data packing to ensure alignment and avoid penalties from misaligned loads.^[20] High-performance libraries like GotoBLAS exemplify these optimizations through hierarchical blocking strategies, including register-level blocking where submatrices (e.g., 4x4) are held entirely in SIMD registers to minimize load/store operations.^[16] This approach, combined with empirical tuning of block sizes for specific hardware, achieves near-peak floating-point performance on CPUs by balancing register usage with cache efficiency.^[16]

Performance and Optimization

Computational Complexity

The computational complexity of GEMM is dominated by its arithmetic operations, requiring Θ(m n k) time in the general case for matrices of dimensions m × k, k × n, and m × n. Specifically, the operation performs 2 m n k floating-point operations (FLOPs) for real-valued entries, accounting for the multiply-accumulate steps across all elements.^[21] The auxiliary space complexity is O(m n + m k + k n), reflecting the storage needs for the input matrices A and B, and the output matrix C, with no additional temporary storage beyond constant factors in optimized implementations. In the roofline performance model, GEMM's attainable performance is constrained by its arithmetic intensity, defined as the ratio of FLOPs to bytes transferred from memory, which for balanced dimensions (m = n = k = N, assuming 64-bit floating-point elements) approximates N/12 FLOPs per byte.^[3] This intensity highlights GEMM's potential to transition from memory-bound to compute-bound regimes as matrix sizes increase, provided data reuse is effectively managed through blocking; low intensities lead to bandwidth limitations, capping performance well below peak FLOPs rates on multicore systems. For distributed-memory environments with P processors, communication complexity imposes fundamental limits, requiring at least Ω(m n + m k + k n) words to be communicated in total under the Irony-Toledo-Tiskin lower bound, which generalizes the input/output volumes across processor boundaries.^[22] This bound underscores the trade-off between local computation and inter-processor data movement, often necessitating replication or 3D partitioning to approach optimality. Practical peak performance for DGEMM on modern CPUs, such as Intel Sapphire Rapids processors (as of 2023), reaches up to ~3.2 TFLOPS when utilizing all cores with optimized libraries like Intel oneMKL, though it remains constrained by memory bandwidth, achieving only a fraction of theoretical peaks for large problems.^[23] These metrics illustrate GEMM's scalability challenges, where bandwidth bottlenecks prevent full utilization of compute resources even in highly tuned libraries like those in BLAS.

Hardware Considerations

The performance of General Matrix Multiply (GEMM) operations is profoundly shaped by the underlying hardware architecture, particularly the memory hierarchy and compute units. Cache exploitation is crucial, as modern CPU cores typically feature a 32 KB L1 data cache, which constrains the size of inner loop blocks to minimize data movement and cache misses.^[24]^[25] For double-precision floating-point elements (8 bytes each), this often limits inner blocks to approximately 4x4 elements, fitting within 128 bytes to allow multiple sub-blocks to reside simultaneously in the cache for efficient reuse during the accumulation phase.^[25] Larger L2 (256-512 KB) and L3 (several MB shared) caches further guide outer blocking strategies, ensuring that intermediate results stay in faster memory levels to reduce latency from main DRAM access, which can be 100-200 cycles slower.^[26] On multi-core processors, GEMM implementations leverage parallelism through directives like OpenMP to distribute loop iterations across cores, enabling scaling to tens of cores while maintaining load balance via dynamic scheduling.^[27] For instance, optimized DGEMM routines can achieve sustained performance of around 250 GFLOPS with eight OpenMP threads on mid-range multi-core systems, though efficiency diminishes beyond 16-32 cores due to increased thread overhead and contention for shared L3 cache.^[27] This loop-level parallelism, combined with cache-aware blocking, ensures that each core processes independent sub-matrices, minimizing synchronization costs in shared-memory environments.^[28] Graphics Processing Units (GPUs) accelerate GEMM through specialized libraries such as NVIDIA's cuBLAS and AMD's rocBLAS, which exploit tensor cores for high-throughput matrix operations. On the NVIDIA A100 GPU, cuBLAS achieves up to 298 TFLOPS for FP16 GEMM using tensor cores, approaching the hardware's peak of 312 TFLOPS by performing multiple low-precision multiplications and accumulations per clock cycle.^[29] On the more recent NVIDIA H100 GPU, cuBLAS achieves around 660 TFLOPS for FP16 GEMM (as of 2024 benchmarks), approaching the peak of 989 TFLOPS, with Blackwell architectures (2024) reaching over 6000 TFLOPS in optimized low-precision formats.^[30]^[31] Similarly, rocBLAS on AMD GPUs utilizes matrix cores to deliver comparable accelerations, with mixed-precision GEMM reaching hundreds of TFLOPS on Instinct accelerators by optimizing tile sizes to match the core's systolic array architecture.^[32] These implementations align matrix dimensions to 16-byte multiples to maximize tensor core utilization and bandwidth from high-bandwidth memory (HBM).^[3] Vector processing units further enhance CPU-based GEMM by enabling SIMD instructions to handle multiple elements simultaneously. Intel's AVX-512 instruction set, with 512-bit registers, allows packing up to eight double-precision accumulations per instruction, boosting throughput by 2x over AVX2 on compatible processors while requiring 64-byte memory alignment to avoid penalties from unaligned loads.^[33]^[34] This alignment ensures efficient data packing into vector registers, reducing the cycles per floating-point operation in the inner kernel.^[35] Post-2020 hardware advancements, such as Intel's Advanced Matrix Extensions (AMX) introduced in 2020 and deployed in Sapphire Rapids processors from 2023, address limitations in integer GEMM for quantized workloads. AMX provides dedicated tile-based matrix multiply units supporting 8-bit and 16-bit integers, achieving up to 2.74x higher peak performance than FP32 equivalents in mixed-precision scenarios by offloading accumulations to 1 KB tile registers.^[36]^[37] Unlike pre-2020 analyses that overlooked such specialized accelerators like tensor cores, these extensions enable efficient handling of low-precision GEMM in AI inference, with tiled operations scaling throughput up to batch sizes of 32 before register pressure impacts performance.^[38]^[39]

Applications

In Linear Algebra

The general matrix multiplication (GEMM) operation serves as a foundational primitive in numerical linear algebra, enabling the efficient implementation of higher-level algorithms through blocked decompositions that leverage level-3 BLAS routines for improved data locality and parallelism. In particular, GEMM forms the core of update steps in matrix factorizations, where submatrices are multiplied to refine the decomposition while minimizing memory accesses. This integration allows GEMM to underpin solvers for systems of linear equations, eigenvalue computations, and related tasks by composing complex operations from repeated matrix-matrix products.^[40] In LU and QR decompositions, GEMM is central to the blocked algorithms that enhance performance on modern architectures. For LU factorization, the right-looking blocked variant partitions the matrix into blocks and uses GEMM to compute the Schur complement update after triangular solves, ensuring that the bulk of the computation occurs at level-3 BLAS granularity. Similarly, in QR decomposition via blocked Householder reflections, GEMM applies the accumulated transformations to the trailing submatrix, with each panel factorization followed by a GEMM update that propagates the orthogonalization across the entire matrix; this approach, as implemented in libraries like LAPACK, requires multiple GEMM calls per block column to achieve numerical stability and efficiency. These compositions highlight GEMM's role in scaling decompositions to large matrices by exploiting cache hierarchies.^[41]^[42] For eigenvalue problems, GEMM facilitates reductions and iterative methods by handling dense matrix operations within sparse or structured contexts. In Hessenberg reduction, the blocked algorithm employs GEMM to chase bulges generated by Householder reflectors, applying similarity transformations to the trailing matrix while maintaining the upper Hessenberg form essential for subsequent QR iterations. In iterative solvers like the Lanczos algorithm for symmetric eigenproblems, GEMM extends the basic matrix-vector products to block versions, enabling subspace iterations where multiple starting vectors are orthogonalized and multiplied against the matrix simultaneously, thus accelerating convergence for extremal eigenvalues in dense settings.^[43]^[44] Cholesky factorization for symmetric positive definite matrices further exemplifies GEMM's utility in specialized decompositions. The blocked right-looking Cholesky algorithm uses GEMM in the update phase to compute the Schur complement of the lower triangular factor, following the factorization of diagonal blocks and solves with the off-diagonal blocks; this SYRK-GEMM-TRSM sequence, as optimized in high-performance implementations, dominates the flop count and ensures the decomposition's stability for applications like least-squares problems. GEMM's generality also supports mixed-precision strategies in these solvers, such as performing multiplications in lower precision (e.g., FP16) while accumulating in higher precision (e.g., FP64) to balance computational speed with numerical accuracy, particularly in fault-tolerant or resource-constrained environments.^[40]

In Machine Learning and HPC

In machine learning, particularly deep learning, the general matrix multiplication (GEMM) operation forms the computational backbone for numerous neural network layers. Fully connected layers compute the output as Y = XW + b, where this linear transformation is efficiently realized through GEMM, with the weight matrix W and input X as operands.^[3] Convolutional layers are similarly transformed into GEMM via the im2col technique, which rearranges input feature maps into a matrix of patches to enable direct matrix-matrix multiplication with the kernel weights, avoiding explicit loops over spatial dimensions.^[45] Transformer models, a cornerstone of modern natural language processing and beyond since their introduction in 2017, reduce core components like scaled dot-product attention to batched GEMM operations. The attention function is defined as \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V, where Q, K, and V are linear projections of the input sequence, involving multiple GEMM calls for the dot products and weighted sums that scale quadratically with sequence length.^[46] Batched GEMM addresses the demands of mini-batch training in deep learning by computing multiple independent matrix multiplications in parallel, such as C_p = \alpha \cdot \text{op}(A_p) \cdot \text{op}(B_p) + \beta \cdot C_p across batch index p. GPU libraries like cuBLAS implement strided batched GEMM to handle these efficiently, minimizing kernel launch overhead for small matrices typical in recurrent or convolutional networks, and achieving throughput near that of a single large GEMM.^[47] In high-performance computing (HPC), GEMM underpins simulations in fields like computational fluid dynamics (CFD) and climate modeling, where it facilitates the assembly and solution of large linear systems arising from finite element discretizations or FFT-based spectral methods. Batched GEMM kernels often dominate these workflows, comprising up to 80% of runtime in climate and weather prediction tasks.^[48] Optimized, hardware-aware GEMM variants, including mixed-precision approaches, enable exascale scalability on systems like the AMD GPU-powered Frontier supercomputer, balancing accuracy and performance for such resource-intensive applications.^[49]

References

[1]
BLAS (Basic Linear Algebra Subprograms) - NetLib.org
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations.BLAS Technical Forum · FAQ · Blas/gemm_based · BLAS(Legacy Website)<|control11|><|separator|>
[2]
Matrix Multiplication Background User's Guide - NVIDIA Docs
Feb 1, 2023 · Background: Matrix-Matrix Multiplication GEMM is defined as the operation C = α AB + β C , with A and B as matrix inputs, α and β as scalar ...
[3]
[PDF] GEMM{Based Level 3 BLAS - NetLib.org
Abstract. The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations.
[4]
Introducing Batch GEMM Operations - Intel
Apr 9, 2017 · The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications.
[5]
General Matrix Multiply (GeMM) - Spatial
General Matrix Multiply (GEMM) is a common algorithm in linear algebra, machine learning, statistics, and many other domains.
[6]
[PDF] uGEMM: Unary Computing for GEMM Applications
General matrix multiplication (GEMM) is pervasive in various domains, such as signal processing, computer vision, and machine learning. Conventional binary.<|control11|><|separator|>
[7]
GEMM-based level 3 BLAS: high-performance model ...
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations.
[8]
A set of level 3 basic linear algebra subprograms - ACM Digital Library
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix-vector operations.
[9]
gemm - Intel
Description. The gemm routines compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, with general matrices.
[10]
Basic Linear Algebra Subprograms for Fortran Usage
Basic Linear Algebra Subprograms for Fortran Usage ; C. L. Lawson · Jet Propulsion Laboratory, M/S 125-128, 4800 Oak Grove Drive, Pasadena, CA ; R. J. Hanson.
[11]
None
### Subroutine Definition for DGEMM
[12]
None
### Subroutine Definition for CGEMM
[13]
[PDF] ALGORITHM 679 A Set of Level 3 Basic Linear Algebra Subprograms
This paper describes a model implementation and test software for the Level 3 Basic Linear Algebra. Subprograms (Level 3 BLAS). The Level 3 BLAS are ...
[14]
[PDF] Implementing Strassen's Algorithm with BLIS - UT Computer Science
Apr 16, 2016 · We start by discussing naive computation of matrix-matrix multiplication (gemm), how it is supported as a library routine by the Basic Linear ...
[15]
[PDF] Efficient Memory and GPU Operations for Tiramisu Compiler
Listing 3.1: A naive implementation of the GEMM algorithm in C for ( int i = 0; i < M; i++) {} for ( int j = 0; j < N; j++) { float sum = 0; for ( int k = 0 ...
[16]
[PDF] Anatomy of High-Performance Matrix Multiplication
High-performance matrix multiplication involves macro-level layering, high-performance inner-kernels, and six inner-kernel building blocks, with one being ...
[17]
[PDF] Automatic Generation of Fast BLAS3-GEMM: A Portable Compiler ...
For the level-3 BLAS (BLAS3), GEMM (GEneral Ma- trix Multiplication) is the main computational kernel, as. the other level-3 BLAS routines can be defined in ...
[18]
[PDF] Optimizing Matrix Multiplication for a Short-Vector SIMD Architecture
May 28, 2008 · The BLAS C = C − A × BT. GEMM is a very common linear algebra oper- ation. LAPACK relies on this operation for implementation of many matrix.
[19]
[PDF] Matrix multiplication on batches of small matrices in half and half ...
The total number of FLOPs in a GEMM operation is equal to (2MNK). The amount of bytes transferred is equal to. P × [K(M + N) + 2MN], where P is the number of ...Missing: naive | Show results with:naive
[20]
Communication lower bounds for distributed-memory matrix ...
We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer.
[21]
Comparing the performance of general matrix multiplication routine ...
For instance, their performance was up to 2930 GFlops or 95% the peak (ZGEMM) when all CPU cores were utilized. The IBM POWER8 and IBM POWER9 computing ...Missing: TFLOPS limit
[22]
Why has the size of L1 cache not increased very much over the last ...
Nov 18, 2009 · The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core. The amount of L1 cache hasn't increased.Why are multiple levels of caches used in modern CPUs? - Super UserProcessors cache L1, L2 and L3 are all made of SRAM? - Super UserMore results from superuser.com
[23]
Towards Peak Performance on Hierarchical SMP ... - People @EECS
The fixed sized sub-blocks are tuned to match the size of the L1 cache, or a fixed fraction of it in order to work with several sub-blocks simultaneously.
[24]
SCP: Shared Cache Partitioning for High-Performance GEMM
Because GEMM is a computational intensive operation with an arithmetic complexity of O(n3) O ( n 3 ) and a memory complexity of O(n2) O ( n 2 ) , the overhead ...2 Background · 3.2 Algorithms · 4.3 Cache Miss Rate Analysis
[25]
[PDF] Threaded Multi-Core GEMM with MoA and Cache-Blocking - NREL
We demonstrate that the MoA im- plementation of matrix-matrix multiply (GEMM) combined with cache-blocking strategies results in at least a 25% performance gain ...Missing: L1 | Show results with:L1
[26]
[PDF] Some Experiments and Issues to Exploit Multicore Parallelism in a ...
Oct 8, 2010 · OpenMP directives are used for multithreading as opposed to POSIX threads, because of the ease of usage of OpenMP in structured codes. This ...<|separator|>
[27]
CUDA 11.3 significantly improved the performance of CUTLASS
298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2.
[28]
AMD matrix cores - ROCm™ Blogs
Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that ...
[29]
Optimizing matrix-matrix multiplication on intel's advanced vector ...
This paper aims to implement matrix-matrix multiplication algorithm presented in Section 3 using AVX instruction sets and optimization methods. In current ...Electrical Engineering · 4. Optimization Techniques · 5. Experimental Results<|separator|>
[30]
https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog
[31]
Optimizing parallel GEMM routines using auto-tuning with Intel AVX ...
This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi ...<|separator|>
[32]
Intel® Math Kernel Library Release Notes and New Features
Feb 9, 2021 · This page provides the current Release Notes for Intel® Math Kernel Library (Intel® MKL). The notes are categorized by year, from newest to oldest.
[33]
[PDF] Accelerating Floating-Point Computations with Intel AMX - NREL
The peak performance for mixed-precision matrix multiplication (MKL BF16 GEMM) is 2.74x higher than the peak for FP32. Notice how close this value is from the ...Missing: 2020 | Show results with:2020<|control11|><|separator|>
[34]
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX ...
Newer Xeon processors feature a tiled matrix multiplication accelerator unit that targets these workloads specifically for speeding up GEMM operation on 8-bit ...
[35]
Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS ...
Sep 8, 2025 · Notably, GEMM throughput on AMX scales effectively up to a batch size of 32, after which performance starts to degrade due to increased register ...
[36]
GEMM-Based Level-3 BLAS | Guide books - ACM Digital Library
Jan 1, 1991 · We discuss four Level-3 routines called DPOTF3, a new type of BLAS, for the factorization part of a block Cholesky ... Read More. Close modal ...
[37]
[PDF] Co-Design of the Dense Linear Algebra Software Stack for Multicore ...
Apr 27, 2023 · The review of the blocked algorithm for the LU factorization in the previous section exposes the link between the algorithmic block size b and ...
[38]
[PDF] Parallel QR Factorization of Block Low-Rank Matrices - arXiv
Aug 12, 2022 · In this article, we present two new algorithms for the QR factorization of BLR-matrices: one that performs block-column-wise. QR based on the ...
[39]
Blocked algorithms for the reduction to Hessenberg-triangular form ...
We present two variants of Moler and Stewart's algorithm for reducing a matrix pair to Hessenberg-triangular (HT) form with increased data locality in the ...
[40]
[PDF] An Optimized and Scalable Eigensolver for Sequences of ... - arXiv
Jul 6, 2014 · Especially in the case of subspace iteration the use of GEMM permits to reach close-to-peak performance of the processing unit. This is one of ...
[41]
High-Performance Low-Memory Lowering: GEMM-based Algorithms ...
Oct 22, 2020 · Deep Neural Network Convolution is often implemented with general matrix multiplication ( GEMM ) using the well-known im2col algorithm.
[42]
Attention Is All You Need
### Summary of Attention Mechanism Involving Matrix Multiplications
[43]
Pro Tip: cuBLAS Strided Batched Matrix Multiply - NVIDIA Developer
Feb 27, 2017 · For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries—has been a standard benchmark ...Missing: definition | Show results with:definition
[44]
Steering Customized AI Architectures for HPC Scientific Applications
Jun 13, 2023 · Batched GEMM performance for MIMO and climate applications. ... The considered CFD simulations belong to a group of steady-state ...<|separator|>
[45]
Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach
### Summary of GEMM in HPC Applications and Performance on Frontier