Fact-checked by Grok 2 weeks ago

SHA instruction set

The SHA instruction set refers to a collection of hardware extensions integrated into modern processor architectures, such as x86 and , designed to accelerate the execution of (SHA) cryptographic functions through dedicated instructions. These extensions primarily target and family algorithms, including SHA-224 and SHA-256, with recent additions supporting SHA-512; by offloading complex hashing operations from general-purpose computation to specialized processing units, thereby improving throughput, reducing , and lowering consumption in applications like verification, digital signatures, and secure communications. In the x86 architecture, introduced the SHA Extensions in July 2013 as part of the (SSE) framework, comprising seven instructions: four for (SHA1MSG1, SHA1MSG2, SHA1RNDS4, and SHA1NEXTE) that handle message scheduling and round computations, and three for SHA-256 (SHA256MSG1, SHA256MSG2, and SHA256RNDS2) focused on similar acceleration steps. These instructions enable up to several times faster hashing compared to software implementations using scalar operations, making them essential for high-volume cryptographic workloads in servers and endpoints. later adopted similar extensions in its processors starting in 2017, ensuring broad compatibility across x86 ecosystems. For architectures, the Extensions, introduced with the ARMv8-A instruction set in 2011 and optionally implemented in cores like Cortex-A53 and later models, add Advanced SIMD instructions supporting , SHA-224, and SHA-256 operations, such as SHA1C, SHA1M, SHA1P, SHA1SU0, SHA256H, and SHA256SU1, with SHA-512 support added in ARMv8.2-A. These enable efficient of hash rounds and message expansions, particularly beneficial in mobile and embedded devices where power efficiency is critical, and are widely used in systems like smartphones and hardware to bolster security protocols.

Introduction

Purpose and Benefits

The SHA instruction set refers to a collection of hardware extensions integrated into modern processor architectures such as x86 and , designed to accelerate computations for the Secure Hash Algorithm () family, including SHA-1 and SHA-256, by minimizing the overhead associated with software-based hash calculations. These extensions enable processors to perform hashing operations more efficiently, offloading repetitive and computationally intensive tasks from software routines to specialized hardware. Key benefits include substantial performance gains, with up to 4x speedup for and significant improvements—often in the range of 4-5x—for SHA-256 compared to general-purpose software implementations on x86 processors. Additionally, the extensions enhance , particularly in embedded systems and high-volume server environments, by reducing the cycles and required for hashing tasks. This results in faster processing for critical operations, such as expediting TLS/SSL handshakes and improving throughput for verification in large-scale applications. ARM implementations provide similar accelerations for , SHA-224, and SHA-256, optimizing for power-sensitive devices like smartphones and hardware. Conceptually, these instructions streamline SHA processing by dedicating hardware paths to handle complex elements like message scheduling and multiple rounds of transformation simultaneously, bypassing the limitations of general-purpose arithmetic logic units (ALUs). For instance, they process four rounds of or two rounds of SHA-256 in a single operation using vector registers, which reduces instruction count and latency while maintaining the integrity of the underlying . Target use cases encompass a range of applications, including signatures for software authenticity, relying on SHA-256, and secure communications protocols that demand rapid computations to protect .

Supported Algorithms

The instruction set provides for variants of the Secure Hash Algorithm () family. In x86 architecture, it supports , , and SHA-512. SHA-1 processes 512-bit message blocks to produce a 160-bit output, serving as a foundational but now legacy hashing mechanism. SHA-256 operates on 512-bit blocks to generate a 256-bit digest, forming the core of the SHA-2 family and enabling robust checks. SHA-512 handles 1024-bit blocks for a 512-bit output, addressing demands for enhanced in environments requiring longer hash lengths. In ARM architecture, the Cryptography Extensions support SHA-1, SHA-224 (a 224-bit variant of SHA-256), and SHA-256. These algorithms are accelerated to optimize performance in cryptographic applications such as (TLS), virtual private networks (VPNs), and s, where efficient hashing is critical for and . SHA-1 support maintains with existing protocols and systems, despite its for new digital signature generation due to demonstrated collision vulnerabilities. In contrast, SHA-256 and SHA-512 are prioritized for their resistance to collision attacks, providing stronger security assurances in modern protocols and high-stakes environments. The instructions map to key phases of these algorithms, streamlining message expansion and round computations without altering the core mathematical operations. For SHA-1 and SHA-256, dedicated instructions handle message scheduling—expanding input blocks into working arrays—and perform multiple rounds of state updates per invocation, reducing the computational overhead of sequential software loops. Similarly, SHA-512 instructions target its expanded message schedule and 80-round compression function, leveraging wider vector registers for of larger blocks. Support for SHA-512 in x86 was introduced in as part of an AVX SHA512 extension, enabling broader parallelism on processors like (Arrow Lake and Lunar Lake), to meet growing needs for high-throughput hashing in secure computing workloads.

Historical Development

Origins in Cryptographic Standards

The family originated from efforts by the National Institute of Standards and Technology (NIST) to establish standardized cryptographic hash functions for federal use. was first specified in Federal Information Processing Standard (FIPS) 180-1, published on April 17, 1995, as a revision of the initial SHA outlined in FIPS 180 from , producing a 160-bit hash value designed for and digital signatures. This algorithm became a for secure communications, but emerging cryptanalytic concerns, including theoretical collision vulnerabilities identified in the late and practical attacks demonstrated in 2005, prompted NIST to develop more robust variants. In response, NIST introduced the SHA-2 family in FIPS 180-2, published on , 2002, which included algorithms like (256-bit output) and SHA-512 (512-bit output) to address SHA-1's limitations by employing larger block sizes and enhanced round functions for greater resistance to attacks. These updates were driven by the need for stronger primitives amid growing cryptographic demands, ensuring while improving security margins against brute-force and collision-based threats. The SHA-2 suite quickly gained adoption in protocols requiring interoperability, such as TLS for secure web transactions. By the 2000s, software implementations of algorithms increasingly became performance bottlenecks as demands surged, particularly with the exponential growth of traffic and requiring frequent hashing for session keys and certificates. General-purpose CPUs struggled to handle the computational intensity of these operations at scale, leading to latency issues in high-throughput environments. This challenge was exemplified by early hardware accelerations like 's New Instructions (AES-NI), proposed in 2008 and released in 2010, which demonstrated the viability of dedicated instructions for and set the stage for similar optimizations in hashing. Intel's development of SHA Extensions (SHA-NI), introduced in 2013, was closely aligned with NIST's FIPS 180 guidelines to ensure precise implementation of and without altering the algorithmic outputs, thereby maintaining interoperability across standards-compliant systems. This alignment with NIST's cryptographic standards facilitated widespread adoption in hardware, allowing vendors to accelerate SHA computations while adhering to federal requirements for validated primitives.

Introduction and Evolution Timeline

The SHA instruction set, formally known as the Intel Secure Hash Algorithm Extensions (Intel SHA Extensions), was first specified by in as a collection of seven (SSE)-based instructions designed to accelerate the computation of and SHA-256 hash functions on x86 processors. These instructions address the performance bottlenecks in software implementations of these algorithms, which are widely used in cryptographic protocols, digital signatures, and verification. By offloading key operations like scheduling and computations to dedicated hardware, the extensions provide significant performance improvements over scalar implementations on contemporary processors. Parallel to x86 developments, ARM introduced Cryptography Extensions supporting SHA-1, SHA-224, and SHA-256 as part of the ARMv8-A architecture in 2011, enabling early hardware acceleration in mobile and embedded systems. The initial hardware implementation of these SHA-1 and SHA-256 instructions appeared in Intel's Goldmont microarchitecture in 2016, targeting low-power Atom processors such as the Pentium and Celeron lines for embedded and mobile applications. AMD followed with support in its Zen microarchitecture, introduced in 2017 with the Ryzen processor family, providing comparable acceleration for server and desktop workloads. Between 2020 and 2023, adoption expanded significantly across mainstream platforms, including AMD's Zen 3 (2020, used in Ryzen 5000 series) and Zen 4 (2022, Ryzen 7000 series) architectures, as well as Intel's Alder Lake (2021, 12th Gen Core) hybrid design, which integrated the extensions into both performance and efficiency cores. This period marked a shift toward ubiquitous availability in consumer and enterprise CPUs, driven by growing demands for secure computing in cloud, IoT, and blockchain applications. In 2024, expanded the instruction set to include SHA-512 support through three new EVEX-encoded instructions (VSHA512MSG1, VSHA512MSG2, and VSHA512RNDS2), enabling efficient 64-bit message processing and round updates for the longer-hash variant used in applications requiring higher security margins, such as and mining. These were first implemented in the Arrow Lake (Core Ultra 200S series) and Lunar Lake (Core Ultra 200V series) processors, with detection via the dedicated SHA512 bit (EAX=07H, ECX=1: EAX[bit 0]=1). As of November 2025, has not introduced equivalent SHA-512 in its architectures, leaving as the primary provider for this extension. Ongoing discussions within the x86 ecosystem suggest potential future standardization of instructions in broader instruction set extensions to enhance cross-vendor compatibility.

x86 Extensions

Core Instruction Set

The SHA instruction set comprises seven base instructions—four dedicated to SHA-1 processing and three to SHA-256—that leverage 128-bit XMM registers to enable parallel computation across hash state elements, facilitating efficient of cryptographic hashing within the x86 architecture. These instructions integrate seamlessly with the (SSE) framework, utilizing the full 128-bit width of XMM registers to handle dword-sized lanes of data, thereby supporting vectorized operations without requiring full SIMD parallelism across independent data streams. This design allows for the concurrent manipulation of multiple components of the hash state, optimizing throughput in compute-intensive hashing tasks. Instruction encoding follows a consistent two-operand format, where the first operand serves as both destination and source (typically an XMM register), and the second is a source operand that can be an XMM register or a 16-byte aligned memory location (denoted as xmm2/m128). An optional 8-bit immediate operand provides control over round-specific parameters, such as logic functions or constants, enabling flexible execution without additional register loads. For enhanced compatibility with wider vector extensions, the instructions support VEX prefixes associated with AVX, allowing 128-bit operations within 256-bit YMM register contexts while preserving legacy SSE behavior. The execution model emphasizes efficiency by processing multiple rounds of the hashing algorithm in a single instruction, for example, executing four rounds per operation to substantially reduce iteration overhead in software loops. This batched approach minimizes branch instructions and register dependencies, aligning with the pipeline optimizations of modern x86 processors. Support for these instructions is determined through , specifically the SHA feature bit (bit 29) in EBX when =7 and ECX=0, ensuring detection of hardware availability. Additionally, the instructions operate fully in 64-bit mode, providing broad compatibility across Intel's IA-32e architecture.

SHA-1 Specific Instructions

The SHA-1 specific instructions in the Intel SHA Extensions consist of four dedicated operations that accelerate the compression function of the algorithm, which processes 512-bit message blocks through 80 rounds to produce a 160-bit value. These instructions operate on 128-bit XMM registers and leverage SIMD parallelism to handle multiple words simultaneously, focusing on message scheduling and round computations while integrating with the core framework for state management. The SHA1MSG1 instruction performs the first stage of message scheduling for , computing intermediate values for four consecutive message schedule words () by XORing pairs of previous words from the source operands. It takes two 128-bit operands: the destination/source XMM (xmm1) holding four words (W0 to W3), and another XMM or (xmm2/m128) providing the next two words (W4 and W5). The extracts these words and computes DEST[127:96] = W2 XOR W0, DEST[95:64] = W3 XOR W1, DEST[63:32] = W4 XOR W2, and DEST[31:0] = W5 XOR W3, preparing partial terms for the full expansion formula Wt = (Wt-16 XOR Wt-14 XOR Wt-8 XOR Wt-3) ROL 1 used in rounds 16 to 79. Following SHA1MSG1, the SHA1MSG2 instruction completes the message scheduling by finalizing four message dwords, incorporating additional XORs and a left . It uses the intermediate results in xmm1 (from SHA1MSG1) and previous words from xmm2/m128 (providing W13 to W16). The computation involves XORing the intermediate values with W13 to W16, then rotating left by 1 bit: for example, W16 = (SRC1[127:96] XOR W13) ROL 1, with similar steps for W17 to W19, storing the results in xmm1. This produces the fully expanded schedule words needed for the compression rounds, enabling efficient preparation of the 512-bit block's 80 words. The SHA1RNDS4 instruction executes four sequential rounds of the compression function, updating the working variables A, B, C, and D based on the current state and precomputed message inputs. It operates on xmm1 (holding initial A, B, C, D as 32-bit words) and xmm2/m128 (holding W0 + E, W1, W2, W3), with an 8-bit immediate (imm8[1:0]) selecting the round function f (e.g., f0 = (B AND C) OR (NOT B AND D)) and constant (e.g., K0 = 0x5A827999 for rounds 0-19). The pseudocode-like flow is: A1 = f(B0, C0, D0) + (A0 ROL 5) + W0 + E0 + ; B1 = A0; C1 = B0 ROL 30; D1 = C0; E1 = D0; followed by three more iterations using updated states and subsequent W_i + E_i, storing the final , B4, C4, D4 in xmm1. This covers groups of four rounds across the 80 total, with imm8 cycling through the four f/ phases. The SHA1NEXTE instruction computes the updated state variable E for the next set of four rounds, rotating the current A left by 30 bits and adding it to the first scheduled message word. It processes xmm1 (current A in [127:96], with lower bits unused) and xmm2/m128 (holding four scheduled words W4 to W7), producing DEST[127:96] = W4 + (A ROL 30), while copying W5 to W7 unchanged into the lower 96 bits of xmm1. This chains the state across SHA1RNDS4 invocations, ensuring E is prepared as the sum of the rotated previous A and the next Wt for input to the following rounds. These instructions are interdependent in processing a full SHA-1 block: SHA1MSG1 and SHA1MSG2 first expand the initial 16 words into the full 80-word schedule in a pipelined manner (e.g., applying MSG1 to overlapping windows, then MSG2 to finalize), followed by a loop of 20 iterations where SHA1RNDS4 performs the rounds and SHA1NEXTE updates E for , ultimately yielding the updated state H after 80 rounds.

SHA-256 Specific Instructions

The SHA-256 specific instructions in the Intel SHA Extensions consist of three dedicated operations: SHA256RNDS2 for performing rounds, and SHA256MSG1 and SHA256MSG2 for efficient expansion. These instructions operate on 128-bit XMM registers to process the SHA-256 algorithm's 512-bit input blocks, leveraging the (SSE) to parallelize computations that would otherwise be sequential in software implementations. By handling multiple words simultaneously, they significantly reduce the instruction count required for hash computation, achieving up to 2-3 times gains over optimized scalar code on supported processors. The SHA256MSG1 instruction computes partial terms for the message schedule by adding four consecutive message words with the results of the sigma0 function applied to earlier words (W_{t-15} to W_{t-12}). It takes two XMM register operands, where the first source register holds the current and prior message words, and the second provides the values for sigma computations, storing the four 32-bit partial sums (σ_0(W_{t-15}) + W_{t-16}, etc.) in the destination register. This enables the preparation of 16 initial message words from the 512-bit block, which are then expanded to the full 64-word schedule required for all rounds. The instruction's design exploits the repetitive nature of SHA-256's message expansion, processing four dwords in parallel to minimize loop overhead. Complementing SHA256MSG1, the SHA256MSG2 instruction finalizes the message schedule by merging the partial sums from previous steps with additional sigma1 terms and earlier words. It operates on two XMM registers, adding the output of SHA256MSG1 (held in the destination of the prior instruction) to σ_1(W_{t-2}) for words W_{t-7} to W_{t-4}, effectively computing the complete expanded words W_t = W_{t-16} + σ_0(W_{t-15}) + σ_1(W_{t-2}) + W_{t-7}. For example, if xmm1 previously holds the partial for W_{t-15}, the result incorporates the blending with prior schedule results to produce the next set of four words. This pairwise approach—alternating SHA256MSG1 and SHA256MSG2—generates the entire 64-word message schedule in 16 iterations, handling the 512-bit block's expansion efficiently before round computations begin. The SHA256RNDS2 instruction executes two consecutive SHA-256 compression rounds simultaneously, updating the eight 32-bit working variables (A through H) by processing even and odd pairs in : one XMM register for A/B/E/F and another for C/D/G/H. It takes two source XMM registers (containing the current pairs) and an implicit third from XMM0 (holding the word W_t plus round K_t for the current and next round), computing the additions, rotations, XORs, and majority/ functions as defined in the algorithm to produce updated states. For instance, it calculates T1 = H + Σ_1(E) + Ch(E,F,G) + K_t + W_t and T2 = Σ_0(A) + Maj(A,B,C), then shifts variables accordingly for both rounds in one operation. This double-round processing is invoked 32 times per 512-bit to cover all 64 rounds, interleaved with scheduling to maintain data flow, resulting in a compact loop that ping-pongs between pairs for updates. The instruction's stems from its ability to handle the algorithm's double-pipe , reducing the per-round instruction footprint by approximately 50% compared to emulating rounds with general-purpose operations. In typical usage, the instructions process a 512-bit by first loading the initial 16 message words into XMM registers, then alternating SHA256MSG1 and SHA256MSG2 for 16 steps to expand to 64 words stored across multiple registers. Concurrently, SHA256RNDS2 is executed 32 times, loading scheduled words and constants into XMM0 each pair of rounds, updating the hash state (initialized from the previous 's digest) in a ping-pong fashion between two XMM pairs to avoid data dependencies. After 64 rounds, the final state is added to the initial values to produce the output digest. This interleaved flow ensures minimal , with the extensions enabling single-instruction handling of complex operations that dominate SHA-256's computational cost.

SHA-512 Specific Instructions

The SHA-512 specific instructions, introduced by in 2024 as part of the extensions, provide for the SHA-512 algorithm, which processes 1024-bit message blocks over 80 rounds using 64-bit words. These instructions build on the vectorized approach of earlier SHA extensions but scale to SHA-512's larger state size, enabling high-throughput hashing suitable for demanding applications such as cryptographic operations in post-quantum contexts. Unlike the 32-bit focused SHA-256 instructions, they operate on 512-bit ZMM registers to handle 64-bit arithmetic, supporting parallelism across even and odd lanes for processing multiple hash states simultaneously. The core computation instruction, VSHA512RNDS2, performs a vectorized double-round operation for SHA-512, updating the hash by incorporating two consecutive message schedule words and round constants. It takes three ZMM operands: the current in one source , the message schedule in another source , and a destination for the updated , effectively advancing the compression function by two rounds per execution. This design allows efficient traversal of SHA-512's 80 rounds while leveraging 's 512-bit width to process up to eight 64-bit state elements in parallel. The instruction requires the foundation and is available starting with processors like Arrow Lake-S. Message scheduling is handled by VSHA512MSG1 and VSHA512MSG2, which accelerate the expansion of the 1024-bit input into the 80 required words using SHA-512's functions. VSHA512MSG1 computes partial terms for four consecutive 64-bit words, applying the function to the input stored in ZMM registers; it uses three operands, including temporary for results, to generate partial words. Complementing this, VSHA512MSG2 completes the expansion by adding the prior words to the output and the original word, as in the operation zmm2 += (zmm1) + W_t, where zmm2 holds the updated . These instructions also operate on 512-bit ZMM vectors, enabling parallel scheduling for multiple , and mandate foundation support. Together, these instructions differ from prior SHA accelerations by addressing SHA-512's doubled block size and round count, necessitating 's wider registers for viable performance gains in 64-bit operations. They facilitate optimized implementations in software libraries, reducing cycles per byte for SHA-512 computations on supported .

Processor Implementations

Intel Support

Intel introduced the SHA Extensions, a set of instructions for accelerating and hashing, in a published in July 2013. The first implementation appeared in the for low-power processors, such as those in the Apollo Lake family, released in 2016. Support expanded to mainstream client processors starting with Cannon Lake in 2018, followed by broader adoption in subsequent generations including Ice Lake (2019), (2020), (2021), and (2022). In 2024, added dedicated instructions for SHA-512 acceleration, including VSHA512MSG1, VSHA512MSG2, and VSHA512RNDS2, in the Arrow Lake for desktop processors and the Lunar Lake for mobile processors. These extensions build on prior SHA support by providing vectorized operations for 64-bit words, enabling efficient processing of longer hash outputs. The instructions are decoded into micro-operations and executed on the processor's vector floating-point execution units, typically ports 0, 1, or 5 depending on the . For example, on Ice Lake cores, the SHA256RNDS2 instruction exhibits a of 4 cycles and a throughput of up to 3 instructions per cycle when pipelined. On , the is 6 cycles with similar pipelined throughput. These characteristics allow for balanced performance in computation loops without significant pipeline stalls. Software can detect SHA Extensions support via the CPUID instruction: with EAX=7 and ECX=0, check if EBX bit 29 is set. For SHA-512 instructions in Arrow Lake and Lunar Lake, detection uses the same CPUID leaf 7 (subleaf 0) with EBX bit 29 set to 1.

AMD Support

AMD's support for the SHA instruction set commenced with the Zen microarchitecture, introduced in 2017 through the Ryzen consumer processors and EPYC server processors. This marked the first full implementation of the SHA-1 and SHA-256 extensions in AMD's x86 lineup, providing hardware acceleration for these cryptographic hash algorithms. Earlier architectures, such as Bulldozer (2011) and Excavator (2015 APUs), did not support SHA instructions. As of 2025, SHA instructions are universally available across all -based processors, including (2019) and subsequent generations like , , and Zen 5. This encompasses both client-side processors and server-oriented lines, ensuring broad deployment in desktops, laptops, and data centers. Notably, has not implemented the SHA-512 extensions, unlike Intel's more recent architectures that include support for SHA-512 hashing, limiting 's hardware acceleration to the original and set. The SHA instructions in AMD processors utilize the standard x86 encoding defined in the Intel SHA extensions, promoting interoperability. Performance characteristics align closely with Intel's implementations, with representative latencies such as 6 cycles for the SHA1RNDS4 instruction on cores. Software detection relies on the same mechanism as Intel, specifically checking bit 29 in ECX from function 1 (feature flag "sha_ni"), which enables runtime identification of support. This shared detection and encoding facilitates in heterogeneous environments with mixed and systems.

Usage and Applications

Integration in Software

Developers integrate SHA instructions into software primarily through compiler intrinsics, which provide high-level access to the hardware operations without requiring manual assembly coding. Intel provides a set of intrinsics in the <immintrin.h> header, such as _mm_sha1rnds4_epu32 for performing four rounds of SHA-1 operations on a 128-bit vector representing the SHA-1 state (A, B, C, D). Similar intrinsics exist for SHA-256, like _mm_sha256rnds2_epu32, enabling vectorized hashing in C/C++ code. These functions map directly to the underlying SHA-NI opcodes, allowing programmers to accelerate hash computations in performance-critical sections, such as message digest loops. Compilers like and support automatic generation of SHA-NI code through architecture-specific flags. Enabling the -msha flag in or permits the use of SHA instructions in generated assembly, including for auto-vectorization of suitable loops in functions or user code. For instance, when compiling with -msse4.1 -msha, the compiler can replace scalar SHA operations with vectorized intrinsics if the source code structure allows, improving throughput on supported hardware without explicit programmer intervention. Major cryptographic libraries detect and utilize SHA-NI transparently to enhance performance. OpenSSL version 1.1.0 and later includes assembly-optimized paths for and SHA-256 that leverage SHA-NI when available, invoked via high-level like EVP_sha256(). The library queries CPU capabilities at and selects the hardware-accelerated for the EVP interface, falling back to portable software routines otherwise. Similarly, Crypto++ has supported Intel SHA extensions since version 5.7, integrating them into its and SHA-256 classes for accelerated hashing in applications like secure communication protocols. , a fork of used in projects like , incorporates SHA-NI support in its SHA assembly implementations, enabling hardware acceleration for hash functions in TLS handshakes and verification. For custom optimizations beyond library abstractions, developers can employ inline to craft tight using instructions directly. This approach is useful in specialized scenarios, such as high-volume or validation, where manual control over register usage yields further gains; for example, chaining SHA256RNDS2 instructions in a while managing message padding and handling little-endian byte order conversions via BSWAP operations. Such implementations must account for hash context initialization, including precomputing round constants and intermediate states, to ensure compliance with the SHA algorithms. To ensure cross-platform compatibility, software routinely performs runtime detection of SHA-NI availability using the instruction. Specifically, executing with =7 and ECX=0 sets EBX bit 29 to indicate SHA_NI support, allowing applications to dispatch to hardware-accelerated paths or revert to pure software implementations on older or non-x86 processors. This detection mechanism, often wrapped in library initialization routines, prevents crashes and maintains portability across diverse hardware environments.

Performance and Optimization

The SHA instruction set provides significant throughput improvements over pure software implementations of the corresponding hash algorithms. On modern processors supporting and extensions, single-core SHA-256 performance can reach approximately 3 GB/s, compared to around 500 MB/s for optimized software implementations without . Similarly, processors with extensions achieve about 2 GB/s for SHA-256, representing roughly a 4x speedup over native software routines. For , throughputs approach 2 GB/s on platforms using extensions, while SHA-512 performance on recent cores (such as those from ) is around 1 GB/s per core with hardware support, outperforming software baselines by 2-3x. Performance is influenced by several architectural factors, including pipeline stalls arising from data dependencies between message scheduling and round computations in the SHA algorithms. To mitigate this, optimal interleaving of message preparation instructions (e.g., SHA256MSG1/2) with round execution instructions (e.g., SHA256RNDS2) allows latency hiding, enabling concurrent processing within the CPU . On processors with support, masking techniques can efficiently handle partial message blocks, reducing overhead for non-multiples of the block size. Key optimizations include prefetching message data into to minimize during bulk hashing operations. Combining instructions with AES-NI enables accelerated processing of full cryptographic suites, such as TLS handshakes, where hashing and occur sequentially. In mobile or low-power environments, such as Intel Atom-based systems, power and thermal constraints must be considered, as sustained high-throughput hashing can increase by up to 20-30% compared to software alternatives without careful throttling. Despite these gains, limitations exist, particularly for short messages where instruction setup overhead dominates, leading to diminished returns below 1 KB input sizes. Additionally, the deprecation of SHA-1 due to collision vulnerabilities has shifted development focus toward SHA-256 and SHA-512 implementations, with software fallbacks often preferred for legacy compatibility in such cases.

References

  1. [1]
    [PDF] Intel® SHA Extensions
    Intel SHA extensions are new instructions for performance acceleration of SHA, a cryptographic hashing algorithm, used for data integrity, authentication, and ...
  2. [2]
    About the Cryptographic Extension - Arm Developer
    It also adds instructions to implement the Secure Hash Algorithm (SHA) functions SHA-1, SHA-224, and SHA-256. Note: The optional Cryptographic Extension is not ...
  3. [3]
    SHA Instructions - x86 Assembly Language Reference Manual
    3.22 SHA Instructions ; sha1nexte. SHA1NEXTE. Calculate SHA1 State Variable E after Four Rounds ; sha1rnds4. SHA1RNDS4. Perform Four Rounds of SHA1 Operation.
  4. [4]
    Cortex-A53 - Cryptography Extension - Arm Developer
    The Secure Hash Algorithm (SHA) functions SHA-1, SHA-224, and SHA-256. Finite field arithmetic used in algorithms such as Galois/Counter Mode and Elliptic Curve ...
  5. [5]
    None
    ### Summary of SHA-1, SHA-256, SHA-512 from FIPS PUB 180-4
  6. [6]
    [PDF] Intel® Architecture Instruction Set Extensions and Future Features ...
    Added table listing recent instruction set extensions introduction in Intel. 64 and IA-32 Processors. • Updated CPUID instruction with additional details. • ...
  7. [7]
    NIST Transitioning Away from SHA-1 for All Applications | CSRC
    As a result, NIST will transition away from the use of SHA-1 for applying cryptographic protection to all applications by December 31, 2030. Note that after ...
  8. [8]
    Intel® Cryptography Primitives Library Release Notes
    Oct 22, 2025 · Optimized SHA-512 hash family algorithms with Intel® Secure Hash Algorithm 512 (Intel® SHA512) instructions for Intel® Core™ Ultra ...
  9. [9]
    FIPS 180-1, Secure Hash Standard | CSRC
    Share ia Email. Documentation Topics. Date Published: April 17, 1995. Supersedes: FIPS 180 (05/11/1993). Author(s). National Institute of Standards and ...
  10. [10]
    FIPS 180-2, Secure Hash Standard (SHS) | CSRC
    October 1, 2025: Due to a lapse in federal funding, this website is not being updated. Learn more. Publications. FIPS 180-2. Withdrawn on February 25, 2004.
  11. [11]
    Cryptographic Standards and Guidelines | CSRC
    Learn about NIST's process for developing crypto standards and guidelines in NISTIR 7977 and on the project homepage. NIST now also has a Crypto Publication ...Publications · AES Development · Block Cipher Techniques · Hash Functions
  12. [12]
    Zen - Microarchitectures - AMD - WikiChip
    Zen also supports SHA, secure hash implementation instructions that are currently only found in Intel's ultra-low power microarchitectures (e.g. Goldmont) but ...
  13. [13]
    Will SME, SEV, and hw SHA be CPU game-changers? | AnandTech ...
    wccftech said: Zen will also contain hardware SHA – which means it's going to offer significant performance improvement over previous iterations of AMD ...
  14. [14]
    Intel confirms Arrow Lake-S & Lunar Lake CPUs will support ...
    Jul 3, 2023 · The guide confirms that both series will support a range of instructions, including AVX-VNNI-INT16, SHA512, SM3, and SM4 focusing on AI workloads and hashing ...
  15. [15]
    Manuals for Intel® 64 and IA-32 Architectures
    ### Summary of Intel® 64 and IA-32 Architectures Software Developer's Manual
  16. [16]
  17. [17]
  18. [18]
    SHA1RNDS4 — Perform Four Rounds of SHA1 Operation
    The SHA1RNDS4 instruction performs four rounds of SHA1 operation using an initial SHA1 state (A,B,C,D) from the first operand (which is a source operand and the ...
  19. [19]
  20. [20]
    [PDF] architecture-instruction-set-extensions-programming-reference.pdf
    This document covers instruction set extensions and future features, including new instructions like AADD, AAND, and TDPFP16PS, and a Next Generation PMU.
  21. [21]
    [PDF] 12th Generation Intel® Core™ Processors
    May 2, 2025 · processors. The Intel® SHA Extensions are a family of seven instructions based on the Intel®. Streaming SIMD Extensions (Intel® SSE) that are ...
  22. [22]
    [PDF] 4. Instruction tables - Agner Fog
    Sep 20, 2025 · The present manual contains tables of instruction latencies, throughputs and micro-operation breakdown and other tables for x86 family ...<|control11|><|separator|>
  23. [23]
    [PDF] Exploring SHA Instructions and Its Application to AES-based Schemes
    Figure 14: Pipeline executions for 4 rounds of Areion-256 and 6 invocations of sha256rnds2 on Ice/Tiger/Alder Lake processors. ... latency of sha256rnds2 is 4, ...
  24. [24]
    Why modern software is slow - Hacker News
    Sep 29, 2022 · Because SHA256RNDS2 is 4 clock cycles on Zen3 CPUs. 6 clock cycles on Intel Alder Lake (12th gen). Source is the measurements here: https ...
  25. [25]
    [PDF] 3. The microarchitecture of Intel, AMD, and VIA CPUs - Agner Fog
    Sep 20, 2025 · The present manual describes the details of the microarchitectures of x86 microprocessors from Intel, AMD, and VIA. The Itanium processor is ...
  26. [26]
    [PDF] EPYC Offers x86 Compatibility | AMD
    SHA1/SHA256 assist with the calculation of the message and digest of these secure hash algorithms. SMEP prevents supervisor mode execution from user pages and ...
  27. [27]
    Zen 2 - Microarchitectures - AMD - WikiChip
    Sep 1, 2025 · For performance desktop and mobile computing, Zen is branded as Athlon, Ryzen 3, Ryzen 5, Ryzen 7, Ryzen 9, and Ryzen Threadripper processors.
  28. [28]
    CPUID — CPU Identification
    Bit 29: SHA. supports Intel® Secure Hash Algorithm Extensions (Intel® SHA Extensions) if 1. Bit 30: AVX512BW. Bit 31: AVX512VL. Table 3 ...
  29. [29]
    Intel® Intrinsics Guide
    This intrinsic generates a sequence of instructions, which may perform worse than a native instruction. Consider the performance impact of this intrinsic.Missing: SHA extensions
  30. [30]
    x86 Options (Using the GNU Compiler Collection (GCC))
    AMD Family 17h core based CPUs with x86-64 instruction set support. (This supersets BMI, BMI2, CLWB, F16C, FMA, FSGSBASE, AVX, AVX2, ADCX, RDSEED, MWAITX, SHA, ...