Intermediate representation
An intermediate representation (IR) is a data structure or abstract programming language used internally by a compiler to represent source code in a form that is independent of both the original high-level source language and the specific target machine architecture, enabling accurate program execution without loss of information.[1] IRs play a central role in the compiler pipeline by bridging the front-end—responsible for parsing, lexical analysis, and semantic checking of source code—with the back-end, which handles machine-specific code generation and optimization.[2] This separation promotes modularity, allowing a single IR to support multiple source languages through reusable front-ends and multiple target platforms via interchangeable back-ends, thereby reducing the overall complexity of building compilers for diverse environments.[2] By providing a standardized intermediate form, IRs facilitate platform-independent optimizations, such as constant propagation, dead code elimination, and instruction scheduling, which can be applied uniformly regardless of the input language or output hardware.[3] Common types of IRs vary in level of abstraction and structure to balance expressiveness with ease of analysis and transformation. High-level IRs (HIRs) closely resemble the abstract syntax tree (AST) of the source code, supporting advanced optimizations like inlining and high-level data flow analysis while retaining language-specific constructs.[4] Mid-level IRs (MIRs) offer a balance, often using forms like three-address code (e.g., statements of the form x ← y OP z) or static single assignment (SSA) form to simplify control flow reorganization and enable precise optimizations.[4] Low-level IRs (LIRs), in contrast, approximate assembly language with pseudo-instructions and virtual registers, aiding in register allocation and final code emission for specific architectures.[4] Structural variants include tree-based representations for hierarchical expressions (e.g., binary operations and calls), flat sequential forms for linear control flow, and stack-machine IRs like Java bytecode or Common Intermediate Language (CIL) in .NET, which model operand stacks for virtual machine execution.[3][1] The design of an IR significantly impacts compiler performance, maintainability, and memory usage, with choices often tailored to the target domain—such as just-in-time (JIT) compilation in virtual machines or ahead-of-time (AOT) optimization in static compilers.[1] Notable implementations include LLVM's MIR and LIR for cross-platform code generation, GCC's expression trees for optimization passes, and JVM bytecode as a portable stack-based IR that supports dynamic loading and verification.[4] Overall, IRs are foundational to modern compiler technology, enabling efficient, retargetable, and optimizable software development across heterogeneous computing ecosystems.[1]Definition and Purpose
Core Concept
An intermediate representation (IR) is a programming language or data structure that serves as an intermediary form between a program's source code and its target machine code, enabling compilers to perform analysis, optimization, and transformation in a structured manner.[5] This representation captures the essential semantics of the source program without loss of information, allowing accurate execution while abstracting away source-language specifics and target-platform details.[1] By acting as a bridge, IR facilitates modular compiler design, where the front-end parser generates the IR independently of the back-end code generator.[6] Key characteristics of an IR include platform independence, which ensures that optimizations and analyses can be applied without tying them to a specific hardware architecture, thereby enhancing compiler portability across diverse targets.[7] It also emphasizes ease of optimization, as the IR's structured form simplifies the implementation of passes that detect and rewrite code patterns for improved performance or correctness. IRs commonly appear in forms such as tree-based structures derived from abstract syntax trees (ASTs), which preserve hierarchical relationships in the code, or linear sequences that facilitate sequential processing and instruction-level manipulations.[5] In terms of basic structure, an IR typically consists of simplified tokens, operations, and variables that distill complex source expressions into more manageable units; for instance, a source statement likeresult = (a + b) * c might be represented as a sequence such as t1 = a + b; result = t1 * c, where temporary variables like t1 hold intermediate results.[8] This contrasts with direct translation from source to machine code, as IR enables a separation between parsing the source language and emitting target-specific instructions, promoting reusability and reducing the complexity of supporting multiple languages or architectures in a single compiler framework.[5]
Role in Compilers
In the compilation process, intermediate representation (IR) occupies a pivotal position following the front-end phases of lexical analysis, parsing, and semantic analysis, where source code is transformed from its high-level form into a structured abstract syntax tree (AST). The front-end then lowers this AST into IR, which becomes the primary data structure for subsequent stages, serving as input to the middle-end for optimizations and to the back-end for code generation. This placement decouples language-specific analysis from machine-specific code emission, allowing the compiler to process the program in a more abstract and manipulable form.[9] The use of IR offers significant benefits for compiler design, particularly in enabling retargeting to multiple target architectures without rewriting optimization logic, as the same IR can feed into different back-ends tailored to specific hardware. This modularity promotes code reuse and simplifies maintenance in multi-platform environments. Additionally, IR supports just-in-time (JIT) compilation in virtual machines, such as the Java Virtual Machine or .NET Common Language Runtime, by providing a portable, optimizable form that can be dynamically translated and tuned at runtime for performance gains.[5][10] A standard workflow in compilers illustrates IR's centrality: source code is first parsed to generate IR, which undergoes optimization passes to improve efficiency, and is then converted by the code generator into target machine code, as depicted in the sequence Source code → Parser → IR → Optimizer → Code Generator → Machine code. This pipeline allows iterative refinement, where IR may be progressively lowered through multiple levels to facilitate targeted analyses. IR can manifest in high-level or low-level forms depending on the optimization needs, but its uniform structure addresses key challenges by normalizing language-specific features—such as varying syntax or semantics—into a consistent representation, thereby reducing complexity in designing polyglot compilers that support multiple source languages.[9][5]Historical Development
Early Innovations
The concept of intermediate representation (IR) first emerged in the mid-1950s amid efforts to automate the translation of high-level languages into machine code for resource-constrained computers. The seminal FORTRAN compiler, developed by John Backus and his team at IBM and released in 1957 for the IBM 704, introduced early forms of IR through its multi-pass compilation process. Source statements were analyzed and stored in internal tables, with the compiler generating an abstract representation of the program divided into basic blocks—linear sequences of instructions without branches—for optimization purposes, such as common subexpression elimination and loop invariant code motion. This approach used simple linear codes to model subscript expressions and control flow, enabling efficient transformations before final assembly into 704 machine instructions.[11] Key advancements in the late 1950s and early 1960s were influenced by the ALGOL 60 design, which emphasized formal syntax definitions via Backus-Naur Form (BNF) and promoted syntax-directed translation methods. These techniques treated the parse tree as an implicit IR, allowing translators to attach semantic actions to syntactic constructs for incremental code generation. Edgar T. Irons' 1961 implementation of a syntax-directed compiler for ALGOL 60 exemplified this by using linked lists to represent syntactic diagrams, iteratively building an intermediate string of symbolic instructions that resembled assembler macros before expanding to target code. Concurrently, the Burroughs B5000, launched in 1961, integrated stack-based IR directly into its hardware architecture, with instructions in reverse Polish notation facilitating straightforward compilation of ALGOL expressions onto a operand stack, thus blurring the line between machine code and language-specific abstraction.[12][13] The primary motivations for these early IRs stemmed from the limitations of 1950s hardware, including small memory capacities (e.g., 4K to 32K words on machines like the IBM 704) and scarce registers, which made direct assembly translation inefficient and error-prone. By introducing abstract, machine-independent forms, compilers like FORTRAN reduced the complexity of generating optimal code, enhanced portability across similar architectures, and eased debugging by isolating syntax analysis from machine-specific details. This shift allowed programmers to focus on algorithmic logic rather than hardware idiosyncrasies, marking a pivotal move toward retargetable software.[11] By the mid-1960s, syntax-directed compilers had become a milestone in establishing IR as a standard layer in compiler pipelines, with table-driven parsers automating the construction of parse trees or directed graphs as central representations. Works building on Irons' framework, such as those for the GIER ALGOL system, refined these methods to handle context-sensitive semantics while maintaining modularity, solidifying IR's role in scalable compiler design.[12][14]Evolution in Modern Systems
In the 1980s and 1990s, the development of intermediate representations advanced significantly with the emergence of optimizing compilers designed for portability across hardware platforms. The GNU Compiler Collection (GCC), first released in 1987 by Richard Stallman, introduced a portable intermediate representation based on Register Transfer Language (RTL) to support compilation of C code for multiple architectures, enabling widespread adoption in open-source software ecosystems.[15][16] As support for object-oriented languages expanded—such as the addition of C++ in GCC version 1.15.3 later that year—these IRs evolved to handle richer structures like classes, inheritance, and polymorphism, necessitating more expressive tree-based representations to capture complex language semantics without tying optimizations to specific source languages. The 2000s marked a shift toward modular, reusable compiler infrastructures, exemplified by the LLVM project initiated in 2000 by Chris Lattner at the University of Illinois at Urbana-Champaign. LLVM's intermediate representation, a static single assignment (SSA)-based form, was designed for lifelong program analysis and transformation, allowing frontends for multiple languages to share a common backend for optimizations and code generation, thus promoting scalability in diverse environments.[17] This approach gained traction in just-in-time (JIT) compilation systems, where dynamic optimization required efficient IR handling; for instance, systems like the Java HotSpot virtual machine, released in 1999, utilized graph-based IRs such as Ideal to enable runtime adaptations for performance-critical applications. Post-2010 developments further emphasized IRs tailored to emerging domains like web and machine learning. WebAssembly, announced in 2015 by the World Wide Web Consortium (W3C) and major browser vendors, emerged as a stack-based, low-level binary IR optimized for safe, portable execution in browsers, addressing the limitations of JavaScript for high-performance computing tasks.[18] Concurrently, machine learning frameworks adopted IRs for graph-based computations; TensorFlow, released in 2015 by Google, employs dataflow graphs as an intermediate representation to model neural network operations, facilitating distributed training and hardware-specific optimizations across GPUs and TPUs.[19] More recently, the LLVM project's Multi-Level Intermediate Representation (MLIR), introduced in 2019, provides a flexible framework for multi-level IRs supporting domain-specific languages and optimizations in areas like machine learning and hardware acceleration.[20] These advancements have profoundly impacted modern compiler ecosystems by enabling seamless cross-compilation and multi-target support. For example, Rust's rustc compiler leverages LLVM IR to generate efficient code for diverse platforms, from embedded devices to web targets via WebAssembly.[21] Similarly, Apple's Swift compiler uses LLVM IR in its backend to achieve high-performance code generation across iOS, macOS, and Linux, demonstrating how shared IR infrastructures reduce development overhead while maintaining optimization portability.[22]Types and Forms
High-Level Representations
High-level intermediate representations (HIRs) in compilers are abstract structures that closely mirror the source code's semantics and organization, facilitating early-stage analysis and transformation without losing essential high-level details. These representations typically preserve constructs such as loops, conditionals, and data structures like arrays or classes, often manifesting as annotated abstract syntax trees (ASTs) or tree-based forms that encode control flow and variable usage directly on program-level identifiers rather than machine registers.[23] By retaining this source-like fidelity, HIRs enable compilers to perform semantic checks and initial optimizations while maintaining traceability to the original program structure.[24] Common forms of HIRs include tree-structured data that bridge the front-end parser and middle-end optimizer, such as language-independent trees or extended ASTs with additional annotations for type information and scoping. These are particularly suited for the transition from source parsing to broader analysis, allowing the representation to handle multiple source languages through a unified interface. For instance, in GCC, the GENERIC form serves as a core HIR, generated by front-ends to represent programs in a tree format that captures high-level expressions and statements before lowering to more restricted forms.[24] Similarly, Clang employs AST-based representations, with emerging extensions like ClangIR providing a structured, high-level dialect built on MLIR that attaches provenance to AST nodes for precise analysis.[25] The primary advantages of HIRs lie in their support for intuitive semantic analysis and error handling, as the preserved high-level constructs simplify debugging and recovery during compilation. They also enable language-specific optimizations, such as inline expansions of high-level calls, by keeping the representation abstract and close to the source semantics, which reduces the complexity of early compiler passes compared to lower-level forms that prioritize machine efficiency.[23] In contrast to low-level representations, HIRs emphasize conceptual fidelity over hardware-oriented details, making them ideal for front-to-middle-end workflows.[7]Mid-Level Representations
Mid-level intermediate representations (MIRs) provide a balance between the source-like abstraction of high-level IRs and the hardware proximity of low-level IRs, using simplified forms that facilitate control flow analysis and optimizations like data flow tracking. These representations often employ three-address code (TAC), consisting of statements like x ← y OP z where operands are variables or constants, organized into basic blocks within a control flow graph (CFG). TAC avoids complex subexpressions, making it suitable for transformations such as common subexpression elimination.[4] A key variant is static single assignment (SSA) form, where each variable is assigned exactly once, renaming variables to expose precise data dependencies and simplify analyses like reaching definitions or liveness. SSA is commonly built on TAC or similar structures and is widely used in optimizers, as it enables efficient implementations of algorithms for dead code elimination and constant propagation without tracking mutable state.[4] MIRs like these bridge parsing and code generation, supporting platform-independent optimizations while being easier to lower to machine-specific forms than HIRs.Low-Level Representations
Low-level intermediate representations (LIRs) in compilers are data structures or code forms that approximate machine instructions, enabling late-stage optimizations and code generation by providing a neutral abstraction for hardware mapping. These can be machine-dependent, such as register-transfer languages (RTL) that describe operations between registers and memory using the target architecture's conventions, or more abstracted forms like stack-based operations that simulate virtual machine execution. LIRs maintain sufficient detail for transformations like peephole optimizations while facilitating register allocation and instruction selection.[26] Common abstracted forms of LIRs include stack-oriented instructions that model pushes, pops, and computations on an operand stack, reducing explicit register management. For example, Java Virtual Machine (JVM) bytecode is a stack-based LIR that represents platform-independent instructions for Java code, executed via just-in-time compilation on the JVM to ensure portability across architectures.[27] Similarly, the .NET Common Intermediate Language (CIL) uses stack operations to abstract object-oriented bytecode, supporting type-safe optimizations and cross-platform deployment through the Common Language Runtime.[28] LIRs offer key advantages in backend processes, such as streamlining register allocation with virtual registers that map to physical hardware, and instruction selection via pattern matching against target opcodes. Machine-dependent LIRs like GCC's RTL allow precise modeling of addressing modes and register sets for efficient code emission, while abstracted forms like bytecodes enable verification and dynamic loading in virtual environments. The focus on execution behavior supports hardware simulations, such as for embedded systems.Key Examples
Three-Address Code
Three-address code is a linear intermediate representation commonly used in compilers, where each instruction specifies an operation with at most three operands, typically in the formx = y op z, where x is the destination, y and z are source operands (such as variables or constants), and op is a binary operator like addition or multiplication.[29] This form also accommodates unary operations (x = op y), assignments (x = y), and control flow instructions such as unconditional jumps (goto L) or conditional branches (if x rop y goto L, where rop is a relational operator).[29] Instructions are often labeled for control flow, enabling the representation of structured constructs like if-then-else or loops through sequences of these basic statements.[29] For example, the expression z = x + 2 * y might be translated into:
wheret1 = 2 * y t2 = x + t1 z = t2t1 = 2 * y t2 = x + t1 z = t2
t1 and t2 are temporary variables.[30]
The generation of three-address code typically involves traversing the abstract syntax tree (AST) produced by the parser, often using a post-order traversal to break down complex expressions into simpler, sequential instructions.[31] This process ensures that subexpressions are evaluated and assigned to temporaries before being used in parent operations, converting nested structures into a straight-line sequence suitable for further processing.[32] For instance, during the traversal of an AST node for a binary operation, the code generator recursively produces code for the left and right subtrees, then emits the final assignment instruction.
This representation simplifies data-flow analysis by limiting expressions to simple operations, allowing compilers to track variable definitions and uses more straightforwardly across basic blocks.[33] The use of temporary variables facilitates the creation of straight-line code, which supports optimizations like common subexpression elimination and constant folding without handling complex expression trees.[34] Its assembly-like simplicity also aids in instruction selection and register allocation during code generation for RISC architectures.[34]
A key limitation of three-address code is the potential for generating an excessive number of temporary variables, which can inflate the symbol table size and increase compilation time due to additional entries and lookups.[34] This proliferation often arises from breaking down expressions, leading to redundant copies that must be managed through techniques like copy propagation, where assignments such as t = x are replaced by substituting x directly in subsequent uses to reduce unnecessary temporaries.[35]
LLVM Intermediate Representation
LLVM Intermediate Representation (IR), introduced in 2000 by Chris Lattner as part of the LLVM project at the University of Illinois, is a typed, Static Single Assignment (SSA)-based, assembly-like language designed for both static and dynamic compilation.[36] It serves as a portable, target-independent code representation that facilitates optimization and code generation across diverse architectures, existing in forms such as human-readable text, efficient bitcode, and in-memory structures.[37] This IR builds upon earlier intermediate representation concepts like three-address code by incorporating strong typing and SSA form to enable precise analyses and transformations.[37] Key syntax elements in LLVM IR include instructions, functions, and modules. Instructions follow a form like%result = add i32 %a, %b, where %result is an SSA variable, add is the operation, i32 specifies the type, and %a and %b are operands.[37] Functions are defined with define followed by the return type, name (prefixed with @), and parameters, such as define i32 @main(i32 %argc, ptr %argv) { ... }, while declarations use declare for external functions like declare i32 @puts(ptr) nounwind.[37] Modules act as top-level containers scoping global variables (e.g., @global = common global i32 0) and functions, often including target specifications like target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128".[37]
LLVM IR features strong static typing with primitives such as i32 for 32-bit integers, float for single-precision floats, pointers (ptr), and composite types like vectors (<4 x i32>) and structures ({ i32, ptr }).[37] The SSA form ensures each variable is assigned exactly once, promoting optimizations like constant propagation, as in %x = add i32 1, 2.[37] Metadata attachments provide debugging and analysis information, such as !dbg !0 on instructions linking to source locations.[37] Exception handling is supported through instructions like invoke void @func() to label %normal unwind label %error, which branches to a landing pad for error recovery.[37]
In the broader ecosystem, LLVM IR is integral to frontends like Clang for C/C++/Objective-C, the Rust compiler for generating machine code from MIR, and the Swift compiler for lowering SIL to optimized binaries.[38][39][22] Its pass manager enables modular optimizations, including loop unrolling via the LoopUnrollPass, which transforms loops into straight-line code for performance gains on supported hardware.[37] This extensibility has made LLVM IR a cornerstone for just-in-time compilation in tools like those in the Julia and Julia ecosystem, as well as static analyzers.[40]