C--
C-- (pronounced C minus minus) is a C-like programming language designed as a portable assembly language to serve as an efficient intermediate representation primarily generated by compilers for high-level languages, with particular emphasis on supporting garbage collection and optimizing for speed across major computer architectures.[1][2] Developed in the late 1990s by researchers including Simon Peyton Jones from the University of Glasgow, Thomas Nordin, and Dino Oliva, C-- aims to balance minimal hardware dependencies with high performance and usability, prioritizing these over strict orthogonality or minimality to make it suitable as a backend for both mainstream and research compilers.[1] The language was introduced through key publications such as the 1997 paper "C--: A Portable Assembly Language" and the 1998 C-- Language Reference Manual, which detail its syntax and semantics tailored for compiler backends.[3] Later revisions involved contributions from Norman Ramsey, enhancing its infrastructure for practical use.[4] Key features of C-- include explicit support for low-level operations like register allocation and memory management while providing abstractions for high-level constructs such as closures and garbage collection interfaces, enabling portable code generation without sacrificing efficiency.[2] It has been employed in compiler pipelines, including early versions of the Glasgow Haskell Compiler (GHC), where it facilitated the translation of functional language code to machine code.[5] The design allows programmers familiar with imperative languages to write or understand C-- code, though its primary role remains as a target for code generation rather than direct human authoring.[3]Introduction
Overview
C-- (pronounced "C minus minus") is a simplified, C-like intermediate representation (IR) language designed for code generation in compilers targeting high-level languages, especially functional or garbage-collected ones.[6] Developed as a portable assembly language, it enables efficient translation from abstract source semantics to machine code across multiple architectures.[7] Distinct from standard C—a full general-purpose programming language—C-- functions primarily as a backend target in compiler pipelines rather than for direct human-written programs.[6] It emphasizes minimalism and portability, avoiding high-level constructs like I/O while providing low-level control suitable for optimization and code emission.[7] The core purpose of C-- is to bridge high-level program semantics and low-level machine instructions through an assembly-like syntax, incorporating features such as explicit stack management and interfaces for runtime services like garbage collection.[6] This design supports compiler writers in generating efficient, architecture-independent code for complex language features.[7] The original C--, introduced in the 1990s by Simon Peyton Jones and Norman Ramsey, underwent refinements in the 2000s, culminating in Version 2 with improved specifications for broader compiler integration.[8]History
C-- originated in the mid-1990s at the University of Glasgow and Digital Equipment Corporation's Systems Research Center, where it was initiated by Simon Peyton Jones, Norman Ramsey, and collaborators including Fermín Reig to overcome limitations in existing compiler backends for functional languages like Haskell and ML.[6] The project sought to create a portable assembly language that facilitated efficient code generation while supporting high-level runtime features such as garbage collection.[6] The first edition, C-- v1, emerged around 1997, with its formal definition outlined in "The C-- Language Reference Manual" published in 1998 by Simon Peyton Jones and contributors Thomas Nordin, Dino Oliva, and Pablo Nogueira Iglesias.[2] By the early 2000s, development advanced to C-- v2, which enhanced support for segmented memory architectures and provided a more rigorous formal specification, as detailed in subsequent academic works. A pivotal publication, "The C-- Compiler Infrastructure" by Norman Ramsey and Simon L. Peyton Jones in 2004, described the evolution toward a robust infrastructure for compiler construction. Adoption of C-- reached its height in the 2000s through integration into major compilers, particularly the Glasgow Haskell Compiler (GHC), where a simplified fork called Cmm was introduced around 2005 to serve as the low-level intermediate representation for code generation.[9] This enabled efficient backend processing for Haskell while leveraging C--'s design for portability and runtime support. Post-2010, the core language saw no major version releases due to its maturity, though minor updates persisted in associated tools and forks like Cmm within GHC.[10] As of 2025, Cmm remains a foundational component of GHC's compilation pipeline, supporting native code generation across multiple architectures without significant changes to the original C-- v2 framework.[10]Design
Goals and Principles
C-- was designed primarily to serve as a portable assembly language that simplifies the generation of efficient machine code across diverse platforms for compiler writers, addressing the challenges of developing retargetable backends for high-level languages.[1] By providing a common intermediate representation, it enables the reuse of code generators while preserving the essential features of source languages, particularly those requiring advanced runtime support like garbage collection.[1] The core principles of C-- emphasize minimalism and simplicity, eschewing a standard library and restricting features to essentials in order to avoid unnecessary complexity and bloat.[1] Portability is a foundational tenet, achieved through architecture-agnostic constructs that abstract machine-specific details, allowing explicit control over low-level elements such as registers, memory layout, and calling conventions without inheriting the full intricacies and undefined behaviors of C.[1] To accommodate modern programming paradigms, especially in functional and garbage-collected languages, C-- incorporates built-in mechanisms for garbage collection interfaces and support for closures via runtime primitives, facilitating efficient implementation of these features in the generated code.[11] Design choices prioritize formal, verifiable semantics to eliminate ambiguities common in lower-level languages, with a focus on automated code generation by front-end compilers rather than manual authoring by programmers.[1]Key Features
C-- uses a procedure-based execution model with an unlimited number of local variables, which the backend maps to registers or stack locations, providing explicit control over evaluation order and facilitating straightforward code generation and optimization by compilers targeting the language.[2] This model requires explicit management of variables, enhancing portability across different architectures.[2] The language supports a minimal set of primitive types, such as word1, word2, word4, word8 (for integers and pointers) and float4, float8 (for floating-point numbers), or in later versions bits8, bits16, bits32, bits64 and float32, float64, float80.[2][11] These sized types guide resource allocation for code generation without the complexity of higher-level structures. Garbage collection safety is supported through span directives to identify pointers, ensuring portability across memory models.[2] Control flow in C-- is structured around basic blocks connected by jumps and conditional branches, providing a linear, assembly-like flow suitable for optimization passes.[12] It includes explicit support for tail calls, allowing efficient implementation of recursive or iterative patterns without stack overflow in compatible runtimes.[2] Recursion and higher-order functions are not directly supported in the core language syntax; instead, closures are handled through runtime primitives that manage environment capture and invocation, delegating complexity to the underlying implementation.[12] Built-in operations encompass arithmetic on words or bits and floats, bit manipulation for low-level efficiency, and calls to external functions, all crafted to expose opportunities for compiler optimizations such as constant folding and dead code elimination.[2] Unlike standard C, C-- omits the preprocessor to avoid macro expansion issues in generated code, flattens structs and unions into explicit memory layouts without aggregate types, and eschews automatic variable allocation in favor of manual stack or heap management, making it a more suitable target for high-level language compilers.[2]Language Elements
Syntax
C-- programs consist of a sequence of top-level declarations, including data layout directives and procedure definitions, with support for import and export declarations but no modules or include directives, assuming a flat structure suitable for compiler backends. Procedure definitions use the form[conv] proc name(type arg1, ..., type argn) { body }, where "conv" specifies a calling convention, types are explicit for parameters, and the body contains a sequence of statements. The language supports multiple return values via return(expr1, ..., exprn); and tail calls with jump name(args); for optimization. Control flow includes blocks delimited by curly braces for grouping statements.[2]
Variable declarations appear as type name1, ..., namen; within procedure bodies or blocks, without inline initialization; scoping is block-based but simplified without nested functions. Statements include assignment (name = expr;), conditionals (if expr rel expr block [else block]; where rel is ==, !=, etc.), jumps ([goto](/page/Goto) name;), and returns (return [exprs];), following a low-level, assembly-oriented notation. Labels are defined as name:. Iteration is achieved using goto statements and labels, as there is no built-in while or for loop.[2]
Expressions include constants, variable names, memory accesses (e.g., type [expr] for dereference), operators (infix binary like +, *, ==; prefix unary like -), and primitive calls (prim(op, args)). Operator precedence follows standard rules, with parentheses for grouping. Arithmetic uses +, -, *, /; comparisons use ==, !=, <, etc.; no logical && or || as high-level booleans, but conditions via comparisons. Function (procedure) calls are name(args). For example, (a + b) * c follows standard precedence.[2]
Comments use block style only: /* ... */ for multi-line annotations, non-nested. Whitespace is insignificant except to separate tokens; the language is not indentation-sensitive, and statements end with semicolons. Keywords are reserved (e.g., proc, if, return), and identifiers start with letters or underscore, followed by letters, digits, or underscores, case-sensitive.[2]
The formal grammar is context-free, with nonterminals in italics and terminals in bold. Key BNF from the specification includes:
- program ::= pal_element | program pal_element
- pal_element ::= data_decl | proc_decl
- proc_decl ::= [conv] Name ( typed_arg_list ) [ data_decl ] block
- block ::= { stm_list }
- stm ::= skip ; | type Name_list ; | Name = expr ; | if expr rel_op expr block [ else block ] ; | goto Name ; | return ( expr_list ) ; | Name : stm
- expr ::= const | Name | type [ expr ] | ( expr ) | expr binop expr | prim ( op , expr_list )
- binop ::= + | - | * | / | == | != | < | > | <= | >=
Type System
C-- uses a simple static type system for low-level code generation, focusing on explicit sizes to support verification and optimization in backends, with types mapping directly to hardware. It lacks higher-level features like subtyping, prioritizing portability for functional language compilers. The description here is for the original version 1 specification (1998); later versions and implementations like GHC's Cmm introduce variations.[2] Basic types includeword n for fixed-width values of n bytes (n=1,2,4,8, typically unsigned or machine-word), float m for IEEE 754 floating-point (m=4 for single-precision, m=8 for double-precision). Pointers are represented as word n (n matching architecture pointer size) for raw addresses. For garbage collection support, gcptr denotes heap pointers, enabling accurate GC via stack maps and tagged values without runtime checks. These types inform register allocation and operations.[2]
No implicit type conversions; all must be explicit via casts (e.g., word4 n), with compiler checks for size compatibility and alignment. Memory accesses specify type and optional alignment (e.g., word4{align2}[ptr] for 4-byte access at 2-byte alignment). The system ensures type consistency for performance, avoiding dynamic checks.[2]
The type system is monomorphic, without polymorphism, generics, or inheritance, keeping it lightweight for backend tasks. No ad-hoc polymorphism or dynamic dispatch.
Type inference is absent; all types must be explicitly annotated in declarations, parameters, and operations to maintain clarity for generated code and enable optimizations like inlining.[2]
Functions (procedures) require explicit typed signatures, e.g., word4 proc_name(word4 arg1, word8 arg2), supporting multiple returns like return(word4, word8);, ensuring consistency for code emission and analysis.[2]
Runtime Support
Memory Management
C-- employs a segmented memory model to promote portability across diverse hardware architectures, dividing memory into four distinct segments: code, stack, heap, and global. The code segment contains the program's executable instructions as initialized data in sections. The global segment accommodates static data, with its size and initial contents fixed at compile time for predictable layout. The stack segment handles local variables and function activation records via stackdata labels, dynamically expanding and contracting as procedures are invoked and returned. The heap segment supports runtime allocation for variable-sized objects, managed by the front-end runtime system and abstracting away platform-specific details to facilitate compiler backend implementation. This segmentation allows C-- programs to reason about memory independently of the underlying machine's addressing conventions.[13] Allocation and deallocation in C-- are handled by the front-end runtime system rather than language primitives, enabling compilers to generate portable code without embedded machine-specific assembly. Heap objects under garbage collection do not require manual deallocation, while non-collected memory management is delegated to the runtime. Stack space is reserved implicitly through procedure activations and stackdata declarations.[13] Access to memory in C-- uses load and store operations of the formtype[expr], where type is typically bits k for the native pointer size, applied across segments without explicit qualifiers. For example, bits32[ptr] = value stores a 32-bit value at the address given by ptr, with analogous loads like value = bits32[ptr]. Pointers use a single native type (bits k), checked at compile time for compatibility. C-- is an unsafe language, permitting unchecked runtime errors for invalid accesses such as null dereferences or out-of-bounds pointers, without built-in compile-time prevention beyond basic type matching. Alignment can be asserted optionally (e.g., aligned int), but is not enforced by default.[13]
Alignment in C-- is governed by type specifications to accommodate hardware constraints portably. For instance, a bits(64) type requires 8-byte alignment on 64-bit systems, with the compiler inserting padding as necessary to satisfy this during layout in any segment. Larger or compound types inherit alignment from their components, ensuring consistent behavior across targets. This rule-based approach minimizes runtime overhead while avoiding undefined behavior from misaligned accesses. Overall, C--'s memory model balances explicit control with safety guarantees, making it suitable as an intermediate representation for high-level language compilers.[13]
Garbage Collection
C-- incorporates a dedicated runtime interface for garbage collection, enabling compilers for high-level languages to integrate efficient memory reclamation without relying on platform-specific details. This support, introduced in the initial design and refined in later versions like v2.0, is achieved through mechanisms that expose operations for tracing and synchronization, while abstracting low-level machine dependencies. The design emphasizes portability across architectures and compatibility with diverse GC strategies, making it suitable for languages requiring automatic memory management.[11][13] Central to C--'s GC mechanism is the management of roots, where live pointers in registers or on the stack are identified using front-end descriptors and marked with thegc_root directive to inform the collector of potential references. This approach allows precise tracing during collection phases, reducing overhead by avoiding conservative approximations.[11]
For incremental and concurrent collection, C-- supports write barriers through the invariant keyword, which identifies variables unchanged across procedure calls to aid reachability analysis without halting the mutator. This enables techniques like remembered sets in generational collectors to track modifications efficiently. Such mechanisms are essential for maintaining collector invariants in low-latency applications.[11]
Object representation in C-- facilitates GC through tagged values, where allocated blocks include headers with type tags and slots for forwarding addresses during copying phases. Allocation is performed manually by the front-end (e.g., updating a heap pointer), ensuring objects are initialized for tracing. Execution interruptions are managed via yield(GC) calls at safe points, allowing the collector to suspend the program for reclamation after non-allocating operations. These features collectively enable precise, type-aware collection while supporting forwarding for compacting algorithms.[11]
The GC interface is engineered for flexibility, accommodating stop-the-world collectors for simplicity, generational schemes for performance in long-running programs, and custom implementations via extensible runtime hooks. A foundational element is the segmented heap model, which confines heap objects to designated memory segments; the collector scans only these segments for roots and pointers, bypassing irrelevant address space regions like code or stack, thereby optimizing scan times and minimizing false positives in pointer identification. This segmented approach underpins C--'s efficiency in resource-constrained or large-address-space environments.[11]
These features position C-- as an effective backend for functional language compilers, where frequent allocations and complex pointer structures necessitate robust GC support without compromising portability.[11]
Examples
Basic Program
A basic program in C-- demonstrates the language's core syntax for defining procedures, using type annotations, performing arithmetic, and returning results. The following example defines a simple procedure to add two 32-bit integers and invokes it to compute the sum of 1 and 2, returning the result.[2]This program defines the procedurecproc add (word4 x, word4 y) { return (x + y); } proc main () { let result = add (1, 2); return (result); }proc add (word4 x, word4 y) { return (x + y); } proc main () { let result = add (1, 2); return (result); }
add, which takes two parameters typed as word4—32-bit words representing integers—and returns their sum using the + arithmetic operator. The type word4 specifies a 4-byte (32-bit) integer, ensuring portable representation across architectures. The main procedure calls add(1, 2) with literal integer constants and returns the result. Literals are promoted to word4 as needed.[2]
Execution proceeds by entering the main procedure, evaluating the call to add, performing the addition (3 in decimal), and returning the value. C--'s evaluation model uses registers where possible, spilling to the stack if necessary, with arithmetic operators handling two's complement operations. This highlights C--'s low-level control while maintaining portability.[2]
When compiled, this program generates machine code that loads constants, adds them, and exits, typically producing an executable that returns 3 as its exit status. Its purpose is to illustrate a simple computation in C-- as a portable intermediate language.[2]
Compiler-Generated Snippet
C-- provides interfaces for garbage collection, allowing compiler-generated code to interact with runtime systems. The following illustrative snippet demonstrates low-level memory management for a closure-like structure, using data segments and foreign calls for allocation, as supported in the language design for high-level runtimes like those in functional compilers. Note that implementations such as the Glasgow Haskell Compiler (GHC) extend C-- with variants like Cmm for specific features like explicit GC primitives.[1][14]This snippet shows allocation via a foreign call to a GC runtime (cdata { word tag; word code_ptr; word env_ptr; } closure; proc create_closure (word env) { foreign "gc_alloc" (3 * sizeof(word)) -> r1; // Call to runtime allocator r1 ! 0 = 1; // Tag r1 ! 1 = &closure_body; // Code pointer r1 ! 2 = env; // Environment return (r1); } proc closure_body (word self, word arg) { let env = self ! 2; let x = env ! 0; let result = x + arg; return (result); }data { word tag; word code_ptr; word env_ptr; } closure; proc create_closure (word env) { foreign "gc_alloc" (3 * sizeof(word)) -> r1; // Call to runtime allocator r1 ! 0 = 1; // Tag r1 ! 1 = &closure_body; // Code pointer r1 ! 2 = env; // Environment return (r1); } proc closure_body (word self, word arg) { let env = self ! 2; let x = env ! 0; let result = x + arg; return (result); }
gc_alloc), initialization of a closure structure in a data segment, and access in the body procedure. Write barriers are handled by the runtime during pointer stores. In standard C--, GC support is via foreign interfaces rather than built-in primitives; GHC's Cmm adds explicit support for such operations to facilitate code generation from Haskell. These features enable optimization while preserving semantics for garbage-collected environments.[1][15]