Fact-checked by Grok 2 weeks ago

OpenBLAS

OpenBLAS is an open-source software library that provides an optimized implementation of the Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) application programming interfaces (APIs), designed for efficient numerical computations in scientific and engineering applications. Based on the GotoBLAS2 1.13 BSD version, it incorporates hand-tuned kernels for vector and matrix operations to achieve high performance on modern processors. Originally developed by Zhang Xianyi, Wang Qian, and others starting in March 2011, OpenBLAS has evolved into a widely adopted tool in fields such as machine learning, high-performance computing, and data analysis. The library supports a broad range of hardware architectures, including x86/x86-64, /ARM64, PowerPC, , and LoongArch, with runtime CPU detection enabling automatic selection of optimal code paths. Key features include multithreading support for up to 256 cores (extendable to 1024 with NUMA optimizations), integration of routines, and binary packages for platforms like Windows and . Released under the permissive BSD , OpenBLAS is actively maintained by the OpenMathLib community, with the latest version 0.3.30 (as of June 2025) incorporating further enhancements such as performance improvements for and targets. Its development history reflects ongoing adaptations to new hardware and standards, beginning with initial alpha releases in 2011 that focused on processors, followed by major updates such as AVX support in 2012 and ReLAPACK integration in 2018. OpenBLAS's performance optimizations, derived from GotoBLAS techniques, make it a preferred choice over reference BLAS implementations, often delivering near-peak hardware utilization in dense linear algebra tasks. The project is hosted on , where contributors worldwide enhance its portability and efficiency for diverse computational workloads.

Introduction

Overview

OpenBLAS is an open-source, optimized implementation of the (BLAS) standard, designed to deliver high-performance numerical computations across various hardware architectures. It encompasses level 1 routines for vector operations, level 2 routines for matrix-vector operations, and level 3 routines for matrix-matrix operations, enabling efficient handling of fundamental linear algebra tasks in computational applications. The BLAS specification serves as a portable for these operations, promoting and performance optimization without prescribing specific implementations, thus allowing libraries like OpenBLAS to focus on architecture-specific tuning for speed and efficiency. In addition to BLAS, OpenBLAS incorporates routines, which address higher-level linear algebra problems such as solving systems of linear equations and computing . Originating from the GotoBLAS2 project, OpenBLAS has evolved into a widely adopted that supports scientific , frameworks, and numerical simulations by providing robust, threaded implementations of these routines. As of November 2025, the latest stable release is version 0.3.30, issued on June 19, 2025.

Purpose and Standards Compliance

OpenBLAS serves as an open-source designed to deliver high-performance implementations of the (BLAS) and routines, surpassing the speed of reference implementations through optimizations tailored for multi-core processors. Its primary goal is to support demanding computational workloads in (HPC), , and engineering simulations by providing efficient vector and matrix operations that exploit modern CPU architectures. The library adheres strictly to established standards, implementing BLAS levels 1, 2, and 3 for operations on real and complex scalars in single and double precision, as defined by the BLAS specification. It also complies with the LAPACK 3.x interface, including implementations of LAPACK routines with recent releases incorporating updates from Reference-LAPACK 3.10.1 and later, which enables seamless integration as a drop-in replacement for standard reference libraries without requiring modifications to existing codebases. In addition to core standard compliance, OpenBLAS extends functionality with architecture-specific kernel tuning, all while preserving full compatibility to ensure interoperability. This makes it a versatile backend for numerical software ecosystems, such as and in environments or and , where it accelerates linear algebra tasks across diverse applications. Furthermore, OpenBLAS maintains portability across major operating systems, including , Windows, and macOS, facilitating broad deployment in scientific and workflows.

History

Origins and Development

OpenBLAS originated as a of GotoBLAS2 version 1.13, released under the BSD license, in 2011. This was initiated following Kazushige Goto's cessation of active maintenance on GotoBLAS2 after his tenure at the Advanced Computing Center, where the library had been developed to provide high-performance linear operations. The decision to addressed the need for continued of an open-source BLAS amid evolving landscapes. The initial development was led by Xianyi Zhang, then at the , along with collaborators such as Wang Qian, focusing initially on optimizations for the 3A processor and later multi-core architectures like Intel's Nehalem series, to bridge performance gaps in open-source alternatives without relying on . Early efforts emphasized hand-tuned assembly code and threading support to exploit parallelism in emerging multi-core systems, ensuring compatibility and efficiency for scientific computing applications. The project was hosted on from its inception in 2011, with releases distributed via . As the library gained traction, it transitioned to the OpenMathLib organization on in 2018, facilitating broader community contributions and improved collaboration. Xianyi Zhang served as the primary maintainer through the 2010s, until the project transitioned to community-driven development under the OpenMathLib organization in 2018, incorporating input from developers worldwide to sustain its relevance.

Key Milestones

OpenBLAS achieved its first stable release in the 0.2.x series in 2012, which included initial multi-threading support through integration to enhance performance on multi-core systems. In 2013, support for architectures was introduced, signifying a pivotal expansion in hardware compatibility beyond x86 platforms. By 2015, OpenBLAS had gained significant adoption, with integration into major distributions including and , enabling seamless use in scientific computing environments. In 2018, the project was transferred to the OpenMathLib organization to enable broader community involvement and . The release of version 0.3.0 on May 23, 2018, marked a major advancement by introducing dynamic detection capabilities for platforms like , ARMv7, ppc64le, and x86_64, alongside initial optimizations in subsequent minor updates within the series. In December 2020, version 0.3.13 further bolstered OpenBLAS's scope with the addition of a port and enhanced integration of routines, addressing the escalating requirements of applications. OpenBLAS also saw adoption in frameworks, serving as a backend for projects like to accelerate linear algebra operations on diverse hardware. Recent developments include version 0.3.26, released on January 2, 2024, which delivered enhancements including speedup for the ?GESV solver on small matrix sizes, and version 0.3.28 in August 2024, featuring optimizations for processors such as improved performance. Version 0.3.30, released on June 19, 2025, includes fixes for performance regressions in multithreaded on POWER targets and enhanced parallel workload partitioning.

Technical Architecture

Core Implementation

OpenBLAS employs a modular software architecture centered around hand-written assembly kernels for performance-critical routines, such as the double-precision general matrix multiplication (DGEMM), which are implemented in architecture-specific assembly files within the kernel directory. These low-level kernels, for example dgemm_kernel_4x8_haswell.S for Intel Haswell processors, handle the core computational loops to maximize instruction-level parallelism and vectorization. To ensure portability across diverse systems, these kernels are encapsulated by higher-level C and Fortran wrappers in the interface directory, which provide the standard BLAS and LAPACK APIs while abstracting hardware dependencies. The build process utilizes a flexible supporting both GNU Make and to compile architecture-specific binaries tailored to the target CPU's instruction set extensions, such as AVX2 or . During , users specify the target via environment variables like TARGET, enabling the generation of optimized code paths; for unsupported or generic CPUs, the falls back to portable C implementations in the kernel/generic/ subdirectory, such as gemmkernel_2x2.c, ensuring functionality without specialized optimizations. This compile-time selection produces a single binary with embedded variants, avoiding the need for multiple installations. Internally, OpenBLAS organizes its components into kernel libraries per routine in the kernel/ directory, driver routines in driver/, and an interface layer that orchestrates calls to the appropriate implementations. A key element is the dispatcher mechanism, implemented in files like driver/others/dynamic.c, which performs CPU detection to select the optimal variant based on detected features, such as sizes or SIMD capabilities, or falls back to compile-time choices if dynamic detection is disabled. This organization allows seamless integration of new kernels without altering the public . OpenBLAS supports mixed-precision operations across single, double, complex, and double-complex datatypes as defined in the BLAS and specifications, with routines like SGEMM for single-precision and ZGEMM for double-complex ensuring consistent behavior across precisions. Error handling adheres to BLAS standards by validating input parameters where specified and propagating through the use of IEEE 754-compliant , though many BLAS routines assume valid inputs and rely on caller-side checks for robustness. Comprehensive testing suites in the test/ and ctest/ directories verify compliance and for these operations.

Optimizations and Algorithms

OpenBLAS employs cache-optimized blocking strategies in its operations to minimize and maximize , particularly in the general () routine. For instance, the kernel divides matrices into blocks sized to fit within L1 and s, with parameters such as m_c, k_c, and n_c tuned to ensure that submatrices of A and B remain resident in during computations, thereby reducing cache misses and improving bandwidth utilization. Vectorization in OpenBLAS leverages SIMD instructions to accelerate floating-point operations through and of multiple data elements. On x86 architectures, kernels incorporate AVX and AVX2 instructions for 256-bit vector operations, while fused multiply-add (FMA) units enable efficient computation of expressions like a \times b + c in a single instruction, boosting throughput in inner loops of routines such as DGEMM. Assembly-optimized micro-kernels, such as those for Haswell processors, explicitly use these SIMD extensions to achieve higher without relying solely on auto-vectorization. During the build process, OpenBLAS performs auto-tuning to empirically optimize parameters like register blocking sizes for specific hardware, ensuring kernels are tailored to the target CPU's cache hierarchy and instruction set. The getarch utility detects the processor architecture and adjusts values such as register block dimensions (e.g., 2x2 or 4x4 for GEMM micro-kernels) and loop unrolling factors based on performance measurements, allowing for matrix-size-dependent efficiency without runtime overhead. This build-time tuning extends to empirical selection of block sizes that balance register pressure and cache utilization across varying matrix dimensions. OpenBLAS adopts algorithmic choices rooted in Goto's register-blocked approach for core routines like DGEMM, where matrices are partitioned into small register-held submatrices to maximize per memory access and approach peak hardware . This method involves packing input matrices into contiguous buffers to eliminate indirect addressing overhead, followed by blocked multiplication that reuses data across L1/ levels, enabling near-optimal performance on modern processors. While Strassen-like variants, which reduce the asymptotic complexity of through recursive subdivision, are explored in related frameworks for very large matrices, OpenBLAS primarily relies on conventional blocked algorithms for its standard BLAS implementations to ensure and broad applicability.

Supported Hardware Architectures

OpenBLAS provides optimized implementations for a wide range of CPU architectures, enabling high-performance linear algebra operations across diverse hardware platforms. Its support spans from widely used x86 processors to emerging architectures like and LoongArch, with kernel-specific adaptations that leverage instruction set extensions for vectorization and parallelism. This broad compatibility is achieved through architecture-specific assembly code and compiler optimizations, allowing users to build tailored binaries for their target systems. The x86 and instruction sets form the core of OpenBLAS's hardware support, with extensive optimizations for and processors. For , it targets models from through Granite Rapids, incorporating advanced vector instructions such as , which enable up to 16-wide vector operations for improved throughput in computations. support covers the Zen architecture family, from Zen 1 to , utilizing corresponding SIMD extensions like AVX2 and FMA for efficient floating-point performance. These adaptations include dedicated kernels for microarchitectures like Skylake-X and Zen, ensuring register-level tuning for hierarchies and behaviors. OpenBLAS also delivers robust support for and ARM64 architectures, catering to mobile, server, and environments. It optimizes for the Cortex-A series, including models like A57, A72, A76, and A510, as well as newer ones like A710 and X2, and custom implementations such as Qualcomm's Falkor and Cavium's ThunderX variants up to ThunderX3. For , compatibility is provided for M-series chips from through M4, with vector operations accelerated via instructions; additionally, Scalable Vector Extension (SVE) and Scalable Matrix Extension () support enables wider vector lengths on compatible hardware like Fujitsu's A64FX. This allows OpenBLAS to exploit ARM's energy-efficient designs while maintaining competitive performance in dense linear algebra tasks. Beyond x86 and , OpenBLAS extends to several other architectures, including and up to for 's high-end servers, which benefit from VSX units; RV64 with extensions (e.g., ZVL128B and ZVL256B for scalable widths, with RVV 1.0 compliance); MIPS32 and MIPS64 variants like the 24K, 1004K, and 3A/3B; zEnterprise from z13 onward, optimized for facilities in ; and LoongArch64, supporting LA264 and LA464 for Chinese domestic processors. These ports include hand-tuned for architecture-specific features, such as 's matrix-multiply interfaces. Build-time configuration is facilitated through TARGET flags in the Makefile, allowing static binaries tuned to specific microarchitectures—for instance, TARGET=SKYLAKEX for Skylake-X or TARGET=ZEN for Zen series. For broader compatibility on systems with varying CPU generations, the DYNAMIC_ARCH=1 option enables runtime detection and selection of the optimal kernel path, supporting multiple sub-architectures within a single library build. This flexibility ensures portability without sacrificing performance on heterogeneous environments.

Features and Capabilities

BLAS and LAPACK Routines

OpenBLAS implements the full suite of Basic Linear Algebra Subprograms (BLAS) routines, with optimizations for performance-critical operations across various hardware architectures. These routines are available in single precision (e.g., S-prefix), double precision (D-prefix), complex (C-prefix), double complex (Z-prefix), and select extended precision variants, ensuring compatibility with the BLAS specification for dense linear algebra computations. BLAS Level 1 routines focus on , including scalar-vector multiplications like DAXPY, which updates a as y \leftarrow \alpha [x + y](/page/X+Y), and inner products such as SDOT or CDOT for computing x^T y or x^H y. Other examples encompass (DSCAL), computations (DASUM), and index-finding operations (IDAMAX), all designed for efficient handling of one-dimensional arrays. BLAS Level 2 routines handle matrix-vector multiplications and related operations, such as DGEMV for general matrix-vector products y \leftarrow \alpha A x + \beta y and DTRMV for solves. These include support for symmetric (DSYMV), Hermitian (CHEMV), and banded matrices, enabling operations on structured data without full dense storage. BLAS Level 3 routines address matrix-matrix operations, exemplified by DGEMM for general dense multiplications C \leftarrow \alpha A B + \beta C, and DSYMM for symmetric matrix products. Additional capabilities cover triangular solves (DTRSM), rank updates (DSYR2K), and banded variants, facilitating high-throughput computations on larger matrices. OpenBLAS optimizes these for multi-core execution where beneficial. OpenBLAS incorporates a complete library by default, bundling the from netlib alongside hand-tuned versions of core routines for enhanced efficiency. Key categories include solvers like DGESV for general dense systems A x = b, leveraging LU factorization. Eigenvalue problems are supported via routines such as DSYEV for symmetric matrices, computing . is handled by DGESVD, yielding U \Sigma V^H, and fitting through DGELS, which solves over- or under-determined systems using QR or SVD decompositions. This extensive coverage supports a wide range of numerical applications without requiring separate installations.

Threading and Parallelism

OpenBLAS utilizes for multi-threading to parallelize compute-intensive operations, enabling efficient execution on multi-core processors. This approach is primarily applied to BLAS Level 3 routines, such as general (), and certain routines like factorization and eigenvalue solvers that exhibit good scalability under parallel execution. By default, the library is configured to scale across up to 256 cores, balancing performance and resource utilization on typical systems. The number of threads employed by OpenBLAS can be dynamically adjusted at runtime using environment variables. The primary variable, OPENBLAS_NUM_THREADS, allows users to specify the maximum thread count for parallel regions, overriding the default detection based on available cores. For compatibility with legacy applications derived from GotoBLAS, the GOTOBLAS_NUM_THREADS variable serves a similar , ensuring seamless without modifications. For deployments on large NUMA architectures, OpenBLAS offers an experimental BIGNUMA mode, activated during compilation with the BIGNUMA=1 flag, which extends support to systems with up to 1024 CPUs and 128 NUMA nodes. In this mode, the library implements thread affinity binding—configurable via the NO_AFFINITY=0 option in the build rules—to pin threads to specific cores or NUMA domains, thereby reducing costly inter-socket memory transfers and improving overall efficiency on distributed-memory-like setups. Parallelism in core routines like follows a task-partitioning strategy inspired by the GotoBLAS framework, where the output matrix is divided into row or column blocks assigned to individual threads for concurrent updates. This blocking scheme optimizes locality and minimizes synchronization overhead. For routines with irregular computational loads, such as certain decompositions, OpenBLAS incorporates dynamic load balancing to distribute work unevenly across threads, preventing idle time and enhancing throughput on heterogeneous workloads.

Dynamic Architecture Detection

OpenBLAS incorporates dynamic architecture detection to enable adaptation to varying capabilities, primarily through the DYNAMIC_ARCH=1 build option during compilation. This option compiles the library with multiple optimized code paths tailored to different processor features, allowing a single binary to function across a range of compatible CPUs without requiring separate builds for each target architecture. Recent releases, such as version 0.3.30 (June 2024), have added support for new processors like AmpereOne and Apple M4, enhancing compatibility. The mechanism relies on runtime CPU feature detection, which on x86 platforms utilizes the instruction to query the processor's supported instruction sets, such as , AVX, AVX2, or FMA. Equivalent detection methods are employed on other supported architectures, like auxiliary registers or feature flags on . Once detected, a —implemented in core files such as dynamic.c—evaluates these capabilities either at library load time or upon the first invocation of a routine, then routes calls to the corresponding optimized . For example, if AVX2 support is confirmed, the dispatcher selects AVX2-accelerated implementations for operations like ; otherwise, it falls back to or a generic scalar version to ensure compatibility. This selection process is performed only once per thread to avoid repeated overhead. The key advantages of this approach include streamlined distribution, as a single library file can serve multiple CPU generations, thereby minimizing package size and maintenance efforts for users and distributors. The initialization overhead is negligible in practice and amortized over subsequent computations. This runtime flexibility enhances portability while preserving performance close to statically optimized builds on supported hardware. Despite these benefits, dynamic architecture detection has limitations and is not universally supported. It requires a build environment capable of generating assembly for all targeted kernels, which can complicate compilation on older or restricted systems. Furthermore, it is unavailable or suboptimal on certain architectures, such as embedded ARM devices, where static targeting is preferred to avoid detection overhead and ensure deterministic behavior in resource-constrained environments.

Installation and Usage

Building from Source

To build OpenBLAS from source, a system with Make, a C such as or , and a Fortran like gfortran are required as prerequisites; support is optional for enabling threading. The source code can be obtained by cloning the official repository using : git clone https://github.com/OpenMathLib/OpenBLAS.git. For the latest development features, navigate to the cloned directory and switch to the develop branch with git checkout develop. Configuration is handled via Make variables passed to the make command, allowing customization for static or dynamic libraries, architecture targeting, and compiler selection. For a dynamic shared library build with generic architecture support and Fortran enabled, use make FC=gfortran NO_SHARED=0 DYNAMIC_ARCH=1 TARGET=GENERIC. To produce a static library instead, set NO_SHARED=1. For builds without Fortran support (C-only, excluding LAPACK routines), specify NO_FORTRAN=1. Threading can be enabled with USE_OPENMP=1 if the compiler supports it. The TARGET option specifies the CPU architecture (e.g., TARGET=NEHALEM for Intel Nehalem; see the repository's TargetList.txt for options), while DYNAMIC_ARCH=1 allows runtime detection of multiple architectures within a single binary. Once configured, run make (optionally with -j for parallel jobs based on CPU cores) to compile the library, which generates the necessary object files and archives. Installation follows with make PREFIX=/usr/local install, where PREFIX sets the target directory (requires appropriate permissions, such as ); this copies the libraries, headers, and binaries to the specified location. Common troubleshooting issues include mismatches, resolved by explicitly setting TARGET or verifying CPU detection via the build ; missing dependencies like a Fortran compiler, addressed by installing gfortran or using NO_FORTRAN=1; and build failures due to incompatible flags, such as conflicting settings, which can be checked in the generated make.[log](/page/Log) file. For -related builds, ensure BUILD_LAPACK=1 if custom configurations are needed, though OpenBLAS includes its own implementation by default.

Binary Distributions

OpenBLAS provides official pre-compiled binary distributions for major platforms, including x86_64 and Windows, available through downloads and Releases. These binaries are updated with each major release, such as version 0.3.30 released on June 19, 2025, and include support for various architectures like x86, x86_64, and on Windows. On macOS, OpenBLAS can be installed via Homebrew with brew install openblas or with sudo port install OpenBLAS. For Linux distributions, OpenBLAS can be installed via system package managers. On Debian and Ubuntu, users can install the development package using sudo apt install libopenblas-dev, which provides both BLAS and LAPACK routines. On RHEL and derivatives like CentOS, the command sudo dnf install openblas-devel (or yum as an alias) installs the library from the EPEL repository. Additionally, the Conda package manager supports OpenBLAS through the conda-forge channel with conda install openblas, offering cross-platform compatibility for scientific computing environments. Binary variants include static libraries (.a files) for and shared libraries (.so on /macOS, .dll on Windows) for dynamic linking, with options compiled with or without support to suit different application needs. To ensure integrity, official binaries on Releases include checksums (SHA256) and GPG signatures for verification, allowing users to validate downloads against provided hashes. Compatibility notes recommend checking versions for binaries, as pre-compiled packages typically target glibc 2.17 or later to maintain broad system support.

Integration with Software

OpenBLAS is typically linked to C, C++, and Fortran programs using the linker flag -lopenblas. For programs compiled with GCC or similar compilers, this can be specified during the linking step, such as gcc -o program program.c -lopenblas or gfortran -o program program.f -lopenblas. If OpenBLAS is installed outside the standard library paths, the -L flag is used to specify the directory containing the library, for example, g++ -o program program.cpp -L/opt/openblas/lib -lopenblas. In Python environments, OpenBLAS integrates with NumPy by ensuring the library is discoverable at runtime; this is achieved by adding the directory containing libopenblas.so to the LD_LIBRARY_PATH environment variable, allowing NumPy's linear algebra operations to utilize OpenBLAS automatically if it is the detected BLAS provider. For Python integrations, OpenBLAS is commonly used through packages like NumPy and SciPy, which can be installed via conda from the conda-forge channel: conda install numpy scipy (these include OpenBLAS as a dependency). In Julia, the LinearAlgebra standard library defaults to OpenBLAS as the underlying BLAS and LAPACK implementation, requiring no additional configuration for basic usage. Threading in OpenBLAS is configured via the OPENBLAS_NUM_THREADS , which should be exported before launching the application to set the maximum number of threads for parallel operations, such as export OPENBLAS_NUM_THREADS=4. To prevent threading conflicts when combining OpenBLAS with other parallelized libraries like MKL in the same process, set OPENBLAS_NUM_THREADS=[1](/page/1) to disable internal multithreading. The following C code snippet demonstrates basic integration by calling the cblas_dgemm routine for double-precision matrix multiplication, including memory allocation checks for robustness:
c
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>

int main() {
    const int m = 3, n = 3, k = 2;
    double *A = malloc(m * k * sizeof(double));
    double *B = malloc(k * n * sizeof(double));
    double *C = malloc(m * n * sizeof(double));

    if (!A || !B || !C) {
        fprintf(stderr, "Memory allocation failed\n");
        free(A); free(B); free(C);
        return 1;
    }

    // Initialize matrices (example values)
    A[0] = 1.0; A[1] = 2.0; A[2] = 1.0;
    A[3] = 4.0; A[4] = 3.0; A[5] = 1.0;
    B[0] = 1.0; B[1] = 2.0; B[2] = 5.0;
    B[3] = 1.0; B[4] = 0.0; B[5] = 3.0;

    // Perform C = A * B^T (column-major, no transpose for A, transpose for B)
    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, 1.0, A, m, B, n, 0.0, C, m);

    // Output result
    for (int i = 0; i < m * n; i++) {
        printf("%f ", C[i]);
    }
    printf("\n");

    free(A); free(B); free(C);
    return 0;
}
This example assumes column-major storage and performs a simple 3x3 matrix product; compile with gcc -o example example.c -lopenblas -lm. OpenBLAS supports the full range of BLAS and LAPACK routines, including cblas_dgemm for general matrix multiplication.

Performance

Benchmark Results

OpenBLAS performance in the High-Performance Linpack (HPL) benchmark, which relies on optimized dense linear routines to solve systems of equations, demonstrates high efficiency on multi-core x86 systems. OpenBLAS typically achieves a significant portion of theoretical peak , benefiting from its tuned DGEMM that maximizes and utilization. On EPYC processors, for instance, HPL configurations with OpenBLAS deliver near-peak performance in multi-socket setups, with DGEMM throughput scaling effectively across cores to support these efficiencies. Standard Netlib BLAS timing tests underscore OpenBLAS's substantial speedups over the unoptimized . In particular, the DGEMM routine for double-precision general shows substantial improvements, often 10x or more, on 16-core x86 CPUs for large matrices, driven by assembly-optimized loops and multi-threaded parallelism that exploit SIMD instructions and . These gains are evident in representative tests where OpenBLAS outperforms the Netlib baseline by leveraging hardware-specific tuning without altering the . As of version 0.3.26 (January 2024), OpenBLAS introduced targeted enhancements for small operations (n<128), significantly reducing computational overhead in scenarios like iterative solvers that involve repeated low-dimensional linear algebra steps. The updates include faster GESV performance for small problem sizes by incorporating fixes from reference and refining kernel selection to avoid unnecessary blocking, leading to measurable reductions in runtime for applications dominated by such workloads. Later versions, such as 0.3.28 ( 2024), introduced POWER10 optimizations and improved handling of special floating-point values, further enhancing performance on newer architectures. OpenBLAS exhibits strong up to 256 cores on x86 systems, maintaining high through dynamic load balancing in its OpenMP-based . On NUMA-aware larger configurations, however, can experience drops beyond 64 cores due to remote latencies; the experimental BIGNUMA build option mitigates this by supporting up to 1024 cores and 128 NUMA nodes, enabling better affinity and reduced inter-node overhead.

Comparisons with Other Libraries

OpenBLAS often outperforms (oneMKL) on non-Intel hardware, such as processors, for key operations like general (). On systems, OpenBLAS has been observed to outperform oneMKL in certain matrix computations, with execution times as low as 40% of oneMKL's in benchmarks involving large datasets. However, on hardware supporting instructions, oneMKL has shown a significant performance advantage over older versions of OpenBLAS due to its optimized and threading tailored for Intel architectures; recent OpenBLAS versions have narrowed this gap. Compared to ATLAS, OpenBLAS offers superior multi-threading capabilities and broader architecture support, including more efficient handling of modern CPU features across x86 and platforms. While ATLAS employs auto-tuning during compilation to adapt to specific hardware, it generally lags behind OpenBLAS on processors, where OpenBLAS provides better performance in linear algebra routines due to pre-optimized kernels. OpenBLAS significantly outperforms the reference BLAS implementation from Netlib across various routines, providing 20-100 times faster execution, particularly in Level 3 operations like , thanks to its advanced blocking and cache optimizations. In terms of trade-offs, OpenBLAS excels in portability and is freely available under a permissive , making it ideal for diverse hardware environments without . In contrast, oneMKL provides superior integration for GPU offloading through oneAPI and OpenMP directives, a feature not natively supported in OpenBLAS, which remains primarily CPU-oriented.

Licensing and Community

License Terms

OpenBLAS is licensed under the BSD 3-Clause License, a permissive that permits commercial use, modification, and redistribution in source or binary forms, subject to the preservation of the original copyright notice, license conditions, and disclaimer. This licensing approach is directly inherited from the BSD version of GotoBLAS2 1.13, on which OpenBLAS is based, and imposes no obligations, allowing integration into without requiring the release of derivative works as , in contrast to GPL-licensed alternatives. Key clauses include requirements for attribution to original authors such as Kazushige Goto and Xianyi Zhang by retaining the copyright notice in source code redistributions and reproducing it in accompanying documentation for binary distributions. The license also features a comprehensive warranty disclaimer, providing the software "as is" without any express or implied warranties, including those of merchantability or fitness for a particular purpose, and absolves contributors from liability for damages arising from its use, which encompasses potential issues with numerical accuracy in computations. For distributions, both source code and binaries must include the full LICENSE file or an equivalent notice to ensure compliance, while the endorsement clause prohibits using the names of the OpenBLAS project or its contributors to promote derived products without prior written permission.

Development Community

OpenBLAS is maintained under the OpenMathLib organization on GitHub, where the project's governance is coordinated through collaborative development practices. Key developers include Zhang Xianyi, Wang Qian, and Werner Saar, with contributions from a diverse group ensuring optimizations across hardware platforms. Contributions to OpenBLAS follow a open-source : developers the , implement changes with accompanying tests, and submit pull requests targeting the develop for integration. Automated testing is enforced via Cirrus CI for cross-platform builds and Azure Pipelines for validation on pull requests, helping maintain code quality and compatibility. Issues and bug reports are tracked directly on the project's , facilitating transparent community-driven issue resolution. The community engages actively through dedicated mailing lists, including the users list at [email protected] for discussions on usage and troubleshooting, and the developers list at [email protected] for technical contributions and coordination. Annual releases are driven by volunteers, reflecting ongoing volunteer-led maintenance since the project's start in 2011, with over 100 contributors participating in enhancements and bug fixes. Funding supports this work through public donations encouraged via the project's and institutional backing, such as hosting and CI resources from the Open Source Lab (OSUOSL). Additional grants, including from the in 2019 and the Sovereign Tech Agency (€263,000 in 2023-2024), have bolstered maintenance efforts.

References

  1. [1]
    OpenBLAS is an optimized BLAS library based on ... - GitHub
    OpenBLAS is an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version. For more information about OpenBLAS, ...Wiki · Build OpenBLAS for Android · Visual Studio · Releases
  2. [2]
    OpenBLAS : An optimized BLAS library - openmathlib.github.io
    Jun 19, 2025 · OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. Please read the documents on OpenBLAS wiki.
  3. [3]
    Model-driven Level 3 BLAS Performance Optimization on Loongson ...
    Jan 17, 2013 · We created an open source BLAS project, OpenBLAS, to demonstrate the ... Zhang Xianyi; Wang Qian; Zhang Yunquan. All Authors. Sign In or ...<|control11|><|separator|>
  4. [4]
    Changelog.txt - OpenBLAS
    OpenBLAS ChangeLog ==================================================================== Version 0.3.28 8-Aug-2024 general: - Reworked the unfinished ...<|control11|><|separator|>
  5. [5]
    Releases · OpenMathLib/OpenBLAS - GitHub
    OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. - Releases · OpenMathLib/OpenBLAS.
  6. [6]
    BLAS (Basic Linear Algebra Subprograms) - The Netlib
    History. Discover the great history behind BLAS. On April 2004 an oral history interview was conducted as part of the SIAM project on the history of software ...
  7. [7]
    FAQ - OpenBLAS
    **Summary of OpenBLAS FAQ Content**
  8. [8]
    Introduction to BLAS - Fedora Magazine
    Sep 30, 2024 · OpenBLAS was forked from GotoBLAS2 by Zhang Xianyi in 2011 while at UT Austin. It is currently maintained and updated to take advantage of new ...Missing: maintenance | Show results with:maintenance
  9. [9]
    BLAS / LAPACK on Windows — mingwpy 0.1 documentation
    GotoBLAS2 is the predecessor to OpenBLAS. It was a library written by Kazushige Goto, and released under a BSD license, but is no longer maintained. Goto now ...
  10. [10]
    OpenBLAS: A high performance BLAS library on Loongson 3A CPU
    BLAS is a fundamental math library in scientific computing. Thus, each CPU vendor releases optimized BLAS library for its own CPU. Loongson CPU series are ...
  11. [11]
    OpenMathLib - GitHub
    OpenMPL (Open Math Performance Library) is an open source math libraries, including BLAS, LAPACK, FFT, VML, and others.
  12. [12]
    Zhang Xianyi
    Zhang Xianyi. 中文版. My research interests focus on performance tuning of ... OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
  13. [13]
    math/openblas: Optimized BLAS library based on GotoBLAS2
    Aug 26, 2012 · OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. OpenBLAS is an open source project supported by Lab of Parallel ...
  14. [14]
    FEDORA-2015-35a90d4541 — enhancement update for openblas
    Built 64-bit interface libraries with additional symbol suffixes, allowing for the library to be used in a same program with both the 32-bit and 64-bit ...Missing: integration | Show results with:integration
  15. [15]
    Release OpenBLAS 0.3.0 version · OpenMathLib/OpenBLAS
    **Release Date and Key Features of OpenBLAS v0.3.0**
  16. [16]
    OpenBLAS 0.3.13 Released With A RISC-V Port, POWER10 ...
    Dec 12, 2020 · OpenBLAS 0.3.13 was released today as the newest update to this leading open-source BLAS (and LAPACK) implementation.
  17. [17]
    [PDF] TensorFlow and PyTorch on Arm Servers - Linaro
    • Has multiple backends on x86 – NNPACK, OpenBLAS, oneDNN (with BLAS, direct kernels, JIT). Page 11. 11. © 2020 Arm Limited (or its affiliates). Key open source ...
  18. [18]
    OpenBLAS 0.3.26 Brings More x86_64 Optimizations, Better ...
    Jan 7, 2024 · OpenBLAS 0.3.26 was released this week as the newest feature update to this open-source Basic Linear Algebra Subprograms (BLAS) library.
  19. [19]
    OpenBLAS 0.3.30 Released With Performance Improvements & Fixes
    Jun 19, 2025 · OpenBLAS 0.3.30 released this morning as the newest version of this optimized BLAS (Basic Linear Algebra Subprograms) library for multiple CPU architectures.
  20. [20]
    Developer manual - OpenBLAS
    ### Summary of Core Implementation Details for OpenBLAS
  21. [21]
    OpenBLAS/docs/build_system.md at develop · OpenMathLib/OpenBLAS
    Insufficient relevant content. The provided text is a GitHub page fragment with navigation, feedback, and footer information but does not contain the actual content of the `build_system.md` file. No details about the OpenBLAS build system, architecture-specific binaries, Make, CMake, or generic fallback are present.
  22. [22]
    [PDF] Optimizing DGEMM Using Vectorized Micro-Kernels ... - CSIT 2025
    The GotoBLAS architecture [1], which introduced a register- level blocking ... The implementation was further compared against the. NumPy library (backed by the ...
  23. [23]
    [PDF] Implementing Strassen's Algorithm with BLIS - arXiv
    May 3, 2016 · We start by discussing naive computation of matrix-matrix multiplication (gemm), how it is supported as a library routine by the Basic Linear ...
  24. [24]
    None
    ### Summary of Targets from TargetList.txt
  25. [25]
  26. [26]
    Faq · OpenMathLib/OpenBLAS Wiki - GitHub
    OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. - Faq · OpenMathLib/OpenBLAS Wiki.Missing: standards compliance
  27. [27]
    [PDF] High-Performance Implementation of the Level-3 BLAS
    In this paper we show that better performance can be attained by specializing a high-performance Gemm kernel [Goto and van de Geijn ] so that it computes ...
  28. [28]
    Install OpenBLAS - OpenBLAS
    ### Supported TARGET Architectures for OpenBLAS
  29. [29]
    Installation Guide · OpenMathLib/OpenBLAS Wiki - GitHub
    Dec 26, 2024 · OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. - Installation Guide · OpenMathLib/OpenBLAS Wiki.
  30. [30]
  31. [31]
    OpenBLAS - Browse Files at SourceForge.net
    OpenBLAS. Introduction. OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. Please read the documents on OpenBLAS wiki pages ...<|control11|><|separator|>
  32. [32]
  33. [33]
    Openblas - Anaconda.org
    To install this package run one of the following: conda install conda-forge::openblas conda install conda-forge/label/broken::openblas conda install ...Missing: managers apt yum
  34. [34]
    Install OpenBLAS - OpenBLAS
    ### Summary of OpenBLAS Installation Content
  35. [35]
    GFortran and OpenBLAS - Help - Fortran Discourse
    Jul 17, 2022 · To use OpenBLAS with GFortran, use the flag `-lopenblas` to link against the `libopenblas.so` library. OpenBLAS includes its own LAPACK ...
  36. [36]
    Building from source — NumPy v2.3 Manual
    Building NumPy from source requires setting up system-level dependencies (compilers, BLAS/LAPACK libraries, etc.) first, and then invoking a build.
  37. [37]
    JuliaLinearAlgebra/libblastrampoline: Using PLT trampolines to ...
    These BLAS libraries are known to work with libblastrampoline (successfully tested in Julia):. OpenBLAS (supported by default in Julia); Intel oneMKL (use in ...
  38. [38]
    Who controls parallelism? A disagreement that leads to slower code
    Nov 10, 2022 · A conflict over who controls parallelism: your application, or the libraries it uses. Let's see an example, and how you can solve it.
  39. [39]
    OpenBLAS - DIPC Technical Documentation
    OpenBLAS is an optimized Basic Linear Algebra Subprograms (BLAS) library based on GotoBLAS2 1.13 BSD version.
  40. [40]
    [PDF] Best Practice Guide - AMD EPYC
    Point performance = #real_cores*8DP flop/clk * core frequency. For a 2 socket system = 2*32cores*8DP flops/ clk * 2.2GHz = 1126.4 Gflops. This includes ...
  41. [41]
    NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore ... - MDPI
    We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture.
  42. [42]
  43. [43]
    OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)
    Dec 28, 2017 · Using all the cores speeds OpenBLAS up >2x, and MKL >5x, giving MKL a roughly 33% edge. With 32 threads, MKL took 6.585 s and OpenBLAS 7.672 s, ...
  44. [44]
    OpenBLAS vs MKL - General Usage - Julia Programming Language
    Nov 9, 2018 · The latest OpenBLAS has finally gotten some support for avx512 dgemm, but their kernels are still far from optimal. Julia 1.0.1 does not come ...
  45. [45]
    Performance differences between ATLAS and MKL?
    Dec 3, 2011 · ATLAS is a free BLAS/LAPACK replacement that tunes itself to the machine when compiled. MKL is the commercial library shipped by Intel.Missing: oneMKL | Show results with:oneMKL
  46. [46]
    ARM performance comparison of ATLAS, OpenBLAS and BLIS
    Jan 24, 2021 · I compared the latest ARM builds of OpenBLAS as of today, the latest version of BLIS (https://code.google.com/p/blis/) as of today and ATLAS 3.11.22 (which ...ATLAS vs OpenBLAS - Google GroupsBenchmark Results with the OpenBLAS library - Google GroupsMore results from groups.google.comMissing: multi- | Show results with:multi-
  47. [47]
    BLAS, LAPACK and OpenMP - pypackaging-native
    Dec 20, 2022 · BLAS and LAPACK provide linear algebra functionality, and OpenMP provides primitives for parallel computing on shared memory machines.
  48. [48]
    Offloading oneMKL Computations onto the GPU - Intel
    You can use OpenMP directives to offload oneMKL computations onto the GPU. There are two ways to do this. One way involves using the Intel-specific OpenMP ...
  49. [49]
    OpenBLAS/LICENSE at develop · OpenMathLib/OpenBLAS
    **Summary of OpenBLAS LICENSE Content:**
  50. [50]
  51. [51]
    The 3-Clause BSD License - Open Source Initiative
    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED ...Missing: OpenBLAS numerical accuracy
  52. [52]
    libs / OpenBLAS · GitLab
    Zhang Xianyi authored Jul 25, 2017 and GitHub committed Jul 25, 2017. Revert ... Please read the documents on OpenBLAS wiki pages http://github.com/xianyi/ ...
  53. [53]
    OpenBLAS-users - Google Groups
    This mailing list is for general discussion related to the use of OpenBLAS. Please file bugs at https://github.com/OpenMathLib/OpenBLAS/issues . The wiki pages ...
  54. [54]
    BLAS/LAPACK packaging - Quansight Labs
    Oct 23, 2025 · ... routines, the BLAS and LAPACK libraries can often benefit from parallelization. The optimized implementations such as OpenBLAS, BLIS or MKL ...
  55. [55]
  56. [56]
    A new grant for NumPy and OpenBLAS! - Quansight Labs
    Nov 14, 2019 · ... OpenBLAS maintainer, has >10x more commits than the next most active contributors. For a project that's so important to all of scientific ...