Fact-checked by Grok 2 weeks ago

Tensor Processing Unit

The Tensor Processing Unit (TPU) is a custom-designed application-specific integrated circuit (ASIC) developed by Google to accelerate machine learning workloads, particularly the training and inference of neural networks through optimized tensor computations.^[1] TPUs excel in performing high-throughput matrix multiplications and other operations central to deep learning, offering significantly higher performance and energy efficiency compared to general-purpose processors like CPUs or GPUs for AI-specific tasks. They are deployed in large-scale pods via Google Cloud, enabling scalable AI applications ranging from recommendation systems to generative models.^[1] Google initiated TPU development in 2014 to address surging computational demands for AI, such as speech recognition at global scale, which outpaced existing hardware capabilities.^[2] The first-generation TPU (v1), an ASIC focused on inference with a systolic array for efficient matrix and vector math, was deployed internally in 2015, powering over 100,000 units for services like Google Ads, Search, AlphaGo, and self-driving car projects.^[2] Subsequent generations evolved to support training: v2 introduced supercomputer-scale pods with 256 chips and high-bandwidth interconnects; v3 added liquid cooling for density; and v4 incorporated optical circuit switches for faster communication.^[2] TPUs became publicly available through Google Cloud in early 2018, democratizing access for researchers and enterprises.^[2] Later iterations include v5e and v5p for cost-effective inference and high-performance training, respectively; v6 (Trillium), announced in 2024, delivering 4.7 times the compute performance of v5e and supporting advanced models like Gemini 1.5 Flash.^[2] The seventh generation, Ironwood, unveiled in April 2025 and rolling out in November 2025, represents Google's most advanced TPU yet, optimized for the inference demands of generative AI with over four times the performance of Trillium and ten times that of v5p, while emphasizing energy efficiency.^[3]^[4] Architecturally, TPUs feature a systolic array core for parallel tensor processing, integrated high-bandwidth memory (HBM), and custom interconnects like Inter-Chip Interconnect (ICI) to enable massive parallelism across thousands of chips. This design prioritizes low-precision arithmetic (e.g., bfloat16) for speed while maintaining accuracy for AI tasks, consuming around 40W per chip in early versions and scaling efficiently in modern pods.^[5] Today, TPUs are used by more than 60% of funded generative AI startups on Google Cloud, while nearly 90% of generative AI unicorns use Google Cloud's AI infrastructure, including TPUs, driving innovations in large language models, computer vision, and beyond.^[2]

Overview and Design

Core Architecture

The Tensor Processing Unit (TPU) is a custom application-specific integrated circuit (ASIC) developed by Google specifically to accelerate tensor operations, such as matrix multiplications and convolutions, central to neural network computations in machine learning.^[6]^[5] At the heart of the TPU lies the matrix multiply unit (MXU), a systolic array comprising 65,536 multiply-accumulate (MAC) units organized in a 256 by 256 grid, enabling high-throughput parallel processing of matrix operations without the overhead of general-purpose instruction fetching.^[6]^[5] The architecture integrates a host interface via a PCIe Gen3 x16 bus for data transfer between the TPU and the host CPU, alongside dedicated memory components including activation memory and weight memory to minimize latency in neural network inference and training.^[6]^[5] A key feature is the unified buffer, a 24 MB static random-access memory (SRAM) that stores both weights and activations, facilitating efficient data reuse and delivering an internal bandwidth of 600 GB/s to the MXU in early TPU versions.^[6]^[5] To optimize for speed while preserving model accuracy, TPUs employ reduced-precision data types, such as 8-bit fixed-point integers for computations in initial designs and bfloat16—a 16-bit floating-point format with an 8-bit exponent—for subsequent iterations, allowing aggressive quantization without significant loss in representational power.^[6]^[5]

Systolic Array Mechanism

The systolic array is a computational architecture consisting of a grid of interconnected processing elements (PEs) that enable pipelined data flow, where data moves rhythmically through the array in a manner analogous to the pulsing of blood through the circulatory system.^[5] This concept originated in the late 1970s and early 1980s, pioneered by H. T. Kung and Charles Leiserson at Carnegie Mellon University, who proposed systolic arrays as an efficient VLSI design for parallel processing of algorithms like matrix multiplication and signal processing, emphasizing regular data propagation to minimize control overhead and maximize throughput.^[7] In the context of Tensor Processing Units (TPUs), Google adapted this paradigm from its 1980s roots to accelerate tensor operations in neural networks, leveraging the array's inherent parallelism for deep learning workloads.^[5] In a TPU's systolic array, computation occurs across a two-dimensional grid of multiply-accumulate (MAC) units, with data flowing systolically to facilitate efficient parallel execution. Weights are pre-loaded and remain stationary within the PEs to avoid repeated memory fetches, while activations (input feature maps) stream in from one edge—typically the left—and partial results propagate toward the opposite corner, such as the bottom-right.^[5] This weight-stationary dataflow reduces memory access overhead by enabling local, neighbor-to-neighbor data passing without global interconnects or frequent off-chip transfers, allowing the array to sustain high utilization during dense computations.^[8] For instance, in the first-generation TPU, the array comprises a 256×256 grid of 8-bit MAC units, where each PE performs a multiplication followed by accumulation as data pulses through in coordinated wavefronts.^[5] The mathematical foundation of the systolic array in TPUs centers on efficient matrix multiplication, a core operation in neural networks represented as \mathbf{C} = \mathbf{A} \times \mathbf{B}, where \mathbf{A} denotes the activation matrix and \mathbf{B} the weight matrix. Each PE in the grid computes elements of \mathbf{C} via parallel dot products, with the MAC operation at position (i, j) accumulating c_{ij} = \sum_k a_{ik} \cdot b_{kj}, where the summation occurs over the inner dimension k as activations and partial sums flow through the array.^[5]

c_{ij} = \sum_{k} a_{ik} \cdot b_{kj}

This formulation enables the array to decompose the multiplication into wavefronts of scalar operations across the grid, achieving pipelined execution that overlaps computation and data movement.^[8] For deep learning, the systolic array's design minimizes data movement between processing elements and external memory, which is a primary bottleneck in neural network training and inference, thereby enabling high throughput for operations like convolutions and fully connected layers that dominate model computations.^[5] By keeping data local and synchronizing flows, the architecture achieves near-peak efficiency for dense tensor algebra, supporting the massive parallelism required for scaling models without proportional increases in energy or latency.^[8] Subsequent TPU designs have evolved the systolic array to handle sparsity, incorporating mechanisms like structured sparsity support in versions such as v4, where zero-valued elements in tensors are pruned according to patterns (e.g., 2:4 sparsity), allowing the array to skip unnecessary MAC operations and further optimize for real-world sparse neural networks.

Performance Metrics

The Tensor Processing Unit (TPU) delivers high throughput tailored for matrix multiply operations prevalent in neural networks, achieving a peak of 92 tera-operations per second (TOPS) for 8-bit integer (INT8) computations on its core systolic array. This performance stems from a 256x256 array of multiply-accumulate units operating at 700 MHz, optimized for low-precision arithmetic to maximize efficiency in inference and training tasks. Power efficiency is a cornerstone of TPU design, with the first-generation chip consuming 40 W during operation, yielding approximately 2.3 TOPS per watt for INT8 tasks—a metric that underscores its advantage in datacenter-scale deployments where energy costs dominate. Sustained performance often approaches 70-80% of peak under typical neural network loads, balancing compute intensity with thermal constraints. The systolic array mechanism briefly referenced here enables this efficiency by streaming data through the array in a pipelined manner, reducing idle cycles and power overhead from data movement.^[5] Memory access in TPUs features unified patterns across on-chip and off-chip storage to minimize latency, with 28 MiB of fast SRAM for activations and accumulators, and 8 GiB of off-chip DDR3 DRAM delivering 34 GB/s bandwidth for weights and larger tensors. This hierarchy supports low-latency unified memory access, where activations flow directly from compute units to storage without frequent host intervention, achieving latencies under 1 μs for core operations. Scalability is facilitated by TPU pods, interconnected via the high-bandwidth Inter-Chip Interconnect (ICI), allowing multi-chip configurations to aggregate performance; for instance, pods of thousands of chips routinely scale to exaFLOPS of aggregate compute, enabling distributed training across large models.^[5] In benchmarks on common workloads like ImageNet classification using the Inception v3 model, TPUs have demonstrated significant reductions in processing times, with inference throughput up to 30x higher than equivalent CPU or GPU setups, processing over 100,000 images per second per chip. For training scenarios on similar scales, TPU clusters reduce end-to-end training times from weeks to days by leveraging high sustained FLOPS and efficient all-reduce operations over ICI, establishing their impact on large-scale deep learning pipelines.^[9]

Comparison to Other Processors

Advantages over CPUs

Tensor Processing Units (TPUs) offer substantial advantages over central processing units (CPUs) for artificial intelligence workloads due to their specialized design as application-specific integrated circuits (ASICs) tailored for tensor operations. Unlike CPUs, which follow a general-purpose von Neumann architecture capable of handling diverse tasks but limited by sequential instruction fetching and data movement bottlenecks, TPUs focus on accelerating matrix multiplications and convolutions prevalent in neural networks. This fixed-function approach results in 15–30× higher performance for machine learning inference on production workloads compared to contemporary server-class CPUs.^[9]^[10] A key benefit stems from reduced instruction overhead in TPUs, which execute predefined tensor instructions in a dataflow manner without the branching or control logic required in CPUs. CPUs incur significant latency from repeatedly decoding instructions and managing complex control flows under the von Neumann model, whereas TPUs stream data through dedicated hardware paths, minimizing idle cycles and overhead for compute-bound operations. The TPU's systolic array mechanism further enhances this by enabling massive parallelism with localized data reuse, avoiding the global memory accesses that bottleneck CPUs.^[10] TPUs also excel in energy efficiency, consuming far less power for parallel multiply-accumulate (MAC) operations essential to deep learning. In early deployments, TPUs demonstrated 30–80× improvement in tera-operations per watt (TOPS/W) over conventional CPUs, making them ideal for large-scale AI tasks where power constraints are critical. This efficiency arises from the TPU's streamlined architecture, which eliminates unnecessary general-purpose features and optimizes for high-throughput tensor computations without the overhead of versatile but power-hungry CPU components.^[9]^[10] For instance, in benchmarking production-scale AI models, TPUs achieved 15–30× higher performance over contemporary server-class CPUs, with up to 71× for certain workloads like specific convolutional neural networks, highlighting their superiority in handling production-scale AI models. Subsequent generations of TPUs, starting with v2, enabled faster training of Inception-like models compared to CPU clusters, reducing iteration times dramatically in Google's integrations for services like image search and translation. These gains underscore the TPU's role in scaling AI training and inference efficiently beyond CPU limitations.^[5]^[10]

Advantages over GPUs

Tensor Processing Units (TPUs) are purpose-built for accelerating tensor operations central to deep neural networks (DNNs), in contrast to graphics processing units (GPUs), which were originally designed for rendering graphics and excel in versatile parallel computing tasks such as gaming and scientific simulations.^[11] This specialization allows TPUs to achieve 2-5x greater efficiency in DNN training workloads compared to GPUs; for instance, Google's TPU v2 processes the Transformer (Big) model 4.3x faster and the Evolved Transformer (Medium) 5.2x faster than NVIDIA P100 GPUs.^[12] The TPU's systolic array architecture minimizes data movement overhead during matrix multiplications, a core operation in DNNs, enabling higher throughput for AI-specific computations without the general-purpose overhead inherent in GPUs.^[5] TPUs natively support lower-precision formats like bfloat16 (BF16) for computations and INT8 for quantization, which significantly reduce memory footprint and computational demands while maintaining model accuracy in many AI tasks.^[13] In comparison, GPUs traditionally emphasize FP32 precision, though modern iterations have added lower-precision support; however, TPUs integrate these formats from the hardware level, allowing for up to 2x faster matrix operations in BF16 and further efficiencies in INT8 for inference-heavy workloads. This design choice not only lowers power consumption but also enables larger batch sizes and models to fit within constrained memory, providing a clear edge in scaling DNN training and inference.^[14] In cloud environments, TPUs integrated into Google Cloud deliver cost-effectiveness through predictable, usage-based pricing models that avoid the supply-driven variability often seen with GPU instances from third-party providers.^[15] This results in 2x or greater cost-efficiency improvements for AI inference tasks compared to equivalent GPU setups, making TPUs particularly advantageous for large-scale, sustained ML deployments.^[15] A real-world example is the training of AlphaGo, where TPUs enabled faster iterations and deeper search capabilities than GPU equivalents would have allowed, accelerating the path to superhuman performance in Go.^[16]

Limitations and Trade-offs

Tensor Processing Units (TPUs) are highly specialized accelerators optimized for tensor operations in deep neural networks, but this focus introduces significant constraints in general-purpose computing. Unlike CPUs or GPUs, TPUs lack essential features such as caches, branch prediction, out-of-order execution, multithreading, and support for sparse matrices or floating-point operations beyond specific precisions like bfloat16.^[6] This design renders them unsuitable for workloads involving frequent branching, numerous element-wise operations, high-precision arithmetic, or custom operations outside standard tensor flows, necessitating the recompilation of models into a static graph of supported tensor primitives.^[17] As a result, TPUs cannot execute arbitrary code and are ineffective for non-AI tasks, limiting their applicability to environments where models can be fully expressed as matrix multiplications and activations.^[6] Programmability poses another key challenge, as TPUs rely on domain-specific frameworks like TensorFlow or JAX, which compile code via the XLA (Accelerated Linear Algebra) just-in-time compiler to generate TPU-compatible instructions.^[17] This process demands static tensor shapes and restricts dynamic behaviors, often requiring substantial refactoring for compatibility, in contrast to the more flexible CUDA ecosystem for GPUs that supports broader low-level control and general-purpose GPU computing.^[17] The TPU's CISC instruction set is deliberately limited to a small repertoire—such as MatrixMultiply and Send/Receive—optimized for systolic array execution but hindering ease of development and adoption among developers accustomed to versatile programming models.^[6] Consequently, while TPUs excel in structured AI pipelines, their framework dependency can slow prototyping and integration compared to GPU alternatives. In terms of scalability, TPUs face hurdles in small-scale deployments due to their architecture's emphasis on parallelism across large clusters, or "pods." Individual TPU nodes perform suboptimally with small batch sizes or models, as the runtime automatically shards batches across multiple cores (e.g., eight in a v3 device), diluting efficiency and increasing padding overhead.^[18] High setup costs further exacerbate this for modest workloads; provisioning even a single TPU node incurs hourly charges (e.g., approximately $1.46 for a v5e in certain regions), but optimal performance requires multi-node slices or pods, which demand significant infrastructure commitment and quota approvals, making them less viable for edge or low-volume applications without specialized variants.^[19] This contrasts with more granular, on-demand GPU options better suited to variable or small-scale needs. Thermal and cost trade-offs represent additional considerations in TPU deployment. The high power density of TPU systolic arrays—delivering up to 92 teraops per second in early generations—generates substantial heat, necessitating advanced cooling solutions like liquid immersion to mitigate hotspots and prevent reliability degradation, particularly under sustained loads in dense pods.^[20] Initial investments are elevated due to custom ASIC fabrication and cloud provisioning minimums, with pod-scale setups costing thousands per hour, though these amortize effectively in large-scale AI training where TPUs achieve 15–30× higher performance and 30–80× better performance per watt than contemporary CPUs or GPUs.^[5] For hyperscale operations, such as those in Google services, this yields substantial long-term savings in energy and operational expenses, but smaller organizations may find the upfront barriers and poor energy proportionality at low utilization (e.g., 88% power draw at 10% load) prohibitive.^[6]

Development History

Inception at Google

The development of the Tensor Processing Unit (TPU) began in 2013 at Google, when a cross-functional team led by hardware engineer Norman Jouppi initiated a project to design custom silicon for accelerating deep neural network inference. This proposal emerged from internal discussions recognizing that the rapid growth in AI workloads demanded specialized hardware to sustain scalability without proportionally expanding datacenter infrastructure.^[6] The primary motivation was a 2013 projection indicating that if Google users engaged in voice search for just three minutes per day using speech recognition deep neural networks (DNNs), it would double the company's datacenter compute capacity—an increase deemed unaffordable with existing CPUs and GPUs.^[6] At the time, Google was already facing challenges in provisioning sufficient compute for expanding AI applications, including neural networks integral to services like Search and Translate, where software optimizations alone proved insufficient to handle the projected scale. Jouppi, a veteran in processor design with prior work on the MIPS architecture, advocated for this shift to application-specific integrated circuits (ASICs), arguing that custom hardware could deliver the necessary performance and energy efficiency gains.^[5] By 2015, the team had developed initial TPU prototypes, which underwent internal testing focused on accelerating speech recognition inference to validate their efficacy in real-world workloads.^[6] These early efforts marked a pivotal transition from reliance on general-purpose processors and software tweaks to purpose-built accelerators tailored for tensor operations in DNNs, setting the foundation for broader AI hardware innovation at Google.

Key Research Milestones

The development of Tensor Processing Units (TPUs) began with the internal deployment of the first-generation TPU (v1) in 2015, optimized exclusively for inference tasks in Google's data centers to accelerate neural network computations for services like search and translation.^[2] This marked a pivotal shift toward custom hardware tailored for machine learning workloads, addressing limitations in general-purpose processors at scale.^[21] In 2016, Google publicly revealed the TPU at its I/O developer conference, highlighting its role in powering internal AI applications and providing the first high-level details on its systolic array design for efficient matrix multiplications.^[22] This announcement spurred interest in application-specific integrated circuits (ASICs) for AI, though the inaugural technical paper detailing the v1's architecture and performance—titled "In-Datacenter Performance Analysis of a Tensor Processing Unit"—followed in 2017, quantifying its 15-30x speedup over contemporary CPUs and GPUs for inference.^[23]^[6] The second-generation TPU (v2) was introduced in 2017, expanding capabilities to include model training alongside inference through support for bfloat16 floating-point operations, effectively doubling peak performance to 180 tera-operations per second (TOPS) in INT8 precision compared to v1. This advancement enabled end-to-end deep learning workflows on custom silicon, with v2 chips integrated into Google Cloud for broader accessibility.^[24] Between 2018 and 2020, Google released the third- (v3) and fourth-generation (v4) TPUs, incorporating innovations like liquid cooling for higher sustained throughput in v3 pods and dedicated SparseCores for accelerating sparse matrix operations in v4, which reduced computational overhead for models with inherent sparsity such as embeddings in natural language processing.^[25]^[26] These releases scaled TPU interconnects to support larger clusters, with v4 achieving over 2x performance per chip over v3 while improving energy efficiency by 2.7x.^[27] From 2021 to 2023, the v5p variant advanced pod-scale training for massive models, enabling configurations up to 8,960 chips that delivered exaFLOPS-scale compute—approximately 4 exaFLOPS in BF16 precision—for distributed training of large language models like Gemini.^[28] This generation emphasized hyper-scale interconnects and doubled high-bandwidth memory per chip, facilitating breakthroughs in foundation model development.^[29] In 2024, Google announced Trillium, the sixth-generation TPU (v6e), prioritizing energy efficiency with a 67% improvement over v5e while delivering over 4x training performance and 3x inference throughput gains, tailored for sustainable scaling of generative AI workloads.^[30] By 2025, the seventh-generation Ironwood TPU was unveiled, optimized for inference dominance in reasoning-heavy AI applications, offering more than 4x per-chip performance over Trillium and up to 42.5 exaFLOPS in FP8 precision across pods of 9,216 chips to handle real-time, agentic AI at unprecedented scale.^[3]^[4]

Integration into Google Services

Tensor Processing Units (TPUs) have been integral to accelerating machine learning workloads across Google's core services since their initial deployment in 2015. Internally, TPUs power key features such as RankBrain in Google Search, which enhances search result relevancy by processing neural network inferences efficiently.^[16] They also support recommendation systems in YouTube, where TPU v5e platforms serve personalized content on the homepage and Watch Next to billions of users daily, delivering high throughput for large-scale inference.^[15] In Google Photos, TPUs enable rapid image analysis and vision models, processing millions of photos per day to support features like object recognition and search.^[9] A notable early impact was in Google Translate, where TPUs were deployed in 2016 to handle inference for neural machine translation, achieving a 99th-percentile prediction latency of approximately 7 milliseconds for consistent user responsiveness.^[5] This deployment significantly reduced translation latency compared to prior CPU-based systems, enabling real-time processing for over a billion daily requests in later scales.^[31] In 2018, Google made TPUs available as a managed service on Google Cloud Platform (GCP) for external developers and enterprises, initially in beta for TensorFlow-based machine learning training and inference.^[2] This Cloud TPU offering allows users to scale AI workloads without custom hardware, supporting applications from model training to serving predictions at datacenter scale.^[1] To broaden accessibility, Google integrated TPUs into collaborative platforms like Google Colab, providing free and paid runtime options for prototyping ML models with TPU acceleration since 2017. Through Vertex AI on GCP, TPUs enable end-to-end workflows for training, tuning, and deploying models, with support for frameworks like TensorFlow, JAX, and PyTorch, making advanced AI tools available to partners and customers worldwide.

Cloud TPU Generations

First Generation (v1)

The first-generation Tensor Processing Unit (TPU v1) was developed by Google as a custom ASIC designed specifically to accelerate inference workloads for neural networks. It was initially deployed internally within Google's data centers in 2015, marking the company's shift toward specialized hardware for machine learning tasks. The chip was publicly announced in May 2016 at the Google I/O conference, highlighting its role in enhancing production-scale AI inference.^[2]^[5] TPU v1 is a single-chip design fabricated on a 28 nm process node, operating at 700 MHz and consuming 40 W of power. It features a 256 × 256 systolic array of processing elements optimized for matrix multiplications, delivering a peak performance of 92 tera-operations per second (TOPS) for 8-bit integer operations. The architecture supports only 8-bit integer data types for both weights and activations, enabling efficient quantized inference without floating-point support. This form factor includes two vector processing units alongside the matrix multiply unit, but the design is dedicated exclusively to inference, lacking capabilities for model training.^[5]^[6] Upon deployment, TPU v1 was integrated into Google's production services to handle real-time inference for applications such as speech recognition, neural machine translation, and image search in Google Photos. For instance, it powered the inference backend for models like Inception, achieving up to 15–30 times higher performance than contemporary CPUs and 30–80 times better performance per watt. A key innovation was the use of a weight-stationary systolic array, which minimized data movement and maximized throughput for dense matrix operations central to neural network inference.^[6]^[5]

Second Generation (v2)

The second-generation Tensor Processing Unit, known as TPU v2, was announced by Google in May 2017 as an advancement over the inference-only first generation, introducing full support for neural network training workloads.^[32] This generation marked a significant shift by incorporating hardware optimizations for the forward and backward passes of deep learning models, including backpropagation and gradient computations essential for optimizing parameters via stochastic gradient descent. By enabling these operations natively on the systolic array architecture, TPU v2 reduced the need for host CPU involvement in core computations, improving efficiency for large-scale training. Architecturally, each TPU v2 chip features two cores and provides 45 teraflops (TFLOPS) in bfloat16 (BF16) precision for matrix multiplications central to training, with support for 32-bit floating-point (FP32) operations at 45 TFLOPS per chip. The Cloud TPU v2 board integrates four such chips with 64 gigabytes (GB) of high-bandwidth memory (HBM), delivering a peak of 180 TFLOPS in BF16 for mixed-precision training, where BF16 handles most computations while FP32 maintains accumulator precision to mitigate numerical instability.^[33] This design prioritized high throughput for tensor operations while maintaining compatibility with TensorFlow's automatic differentiation for gradient calculations.^[34] TPU v2 pioneered scalable distributed training through its first 512-chip pods, interconnected via a custom two-dimensional torus network with high-speed links supporting up to 400 gigabits per second (Gbps) inter-chip bandwidth.^[33] These pods facilitated synchronous data-parallel training across hundreds of devices, enabling efficient all-reduce operations for gradient aggregation in large models— a capability that was previously limited in scale on earlier hardware. For instance, TPU v2 accelerated the training phase of AlphaZero, DeepMind's reinforcement learning system that mastered chess, shogi, and Go; 64 TPU v2 devices trained the neural networks following self-play simulations on first-generation TPUs, achieving superhuman performance in hours of wall-clock time. This deployment highlighted TPU v2's role in enabling breakthroughs in compute-intensive AI research.

Third Generation (v3)

The third-generation Tensor Processing Unit (TPU v3) was announced by Google in May 2018 at the Google I/O developer conference, marking a significant advancement in AI accelerator hardware for large-scale machine learning workloads.^[25] This generation built upon the training foundations established in TPU v2 by enhancing scalability and thermal management to support increasingly complex neural network models.^[35] Manufactured using a 16 nm FinFET process, TPU v3 chips delivered improved power efficiency and density compared to prior iterations, enabling deployment in high-performance computing environments.^[36] Each TPU v3 chip provides a peak compute performance of 123 teraflops (TFLOPS) in bfloat16 (BF16) precision, optimized for the matrix multiplications central to deep learning operations. Equipped with 32 GiB of high-bandwidth memory (HBM2) per chip and a memory bandwidth of 900 GB/s, these chips address the memory bottlenecks in training large models by facilitating rapid data access during computations. A key hardware refinement in TPU v3 is its liquid-cooling system, the first such implementation in Google's data centers, which allows sustained high performance by effectively dissipating the heat generated during intensive operations—chips consume between 123 W minimum and 262 W maximum power.^[37] This cooling approach reduces the physical footprint of servers while supporting denser packing, contributing to overall efficiency gains for AI training.^[38] TPU v3 introduced innovations in interconnect technology, featuring a high-speed 2D torus topology that enables seamless communication across large clusters. Pods scale to 1,024 interconnected chips, providing an aggregate peak performance exceeding 100 petaFLOPS in BF16—specifically 126 petaFLOPS—and 32 TB of total HBM2 memory, with all-reduce bandwidth reaching 340 TB/s per pod. This enhanced interconnect facilitates efficient data movement in distributed training scenarios, minimizing latency and maximizing throughput for massive models.^[39] In practice, TPU v3 pods were instrumental in training large language models, including BERT variants, where they accelerated pre-training on vast datasets by leveraging the system's collective compute and memory resources.^[40]

Fourth Generation (v4)

The fourth generation Tensor Processing Unit, designated TPU v4, was announced by Google in May 2021 and became available in Google Cloud platforms shortly thereafter.^[41] Fabricated using a 7nm process node, it delivers a peak performance of 275 teraflops in BF16 precision per chip, enabling efficient handling of matrix multiplications central to neural network training and inference. This generation prioritizes advancements in sparsity acceleration and power efficiency, making it particularly suited for diverse AI tasks including large-scale recommendation systems and transformer-based models.^[42] A major innovation in TPU v4 is the SparseCore unit, a dedicated dataflow processor that accelerates sparse operations for embeddings and other irregular computations common in recommendation workloads. It supports 2:4 structured sparsity, where two non-zero values are permitted in every group of four elements, effectively doubling the throughput for compatible sparse models by skipping zero computations without significant load imbalance.^[42] This sparsity handling provides up to 3x faster performance on recommendation models compared to TPU v3, while also reducing memory bandwidth demands and improving overall system utilization for real-world AI applications.^[43] TPU v4 achieves substantial power efficiency gains, with a 2.7x improvement in performance per watt over its predecessor, translating to approximately 70% lower energy use per operation through optimized systolic array designs and reduced voltage scaling on the 7nm node.^[43] These efficiencies are amplified in pod-scale deployments, where 4,096 chips form exaflop-class supercomputers interconnected via optical circuit switching (OCS).^[42] The OCS enables dynamic reconfiguration of the 3D torus topology, enhancing modularity and fault tolerance for large AI training jobs while minimizing communication overhead.^[43] In Google Cloud, TPU v4 powers recommendation systems by leveraging its sparsity optimizations to process vast embedding tables efficiently, supporting services like personalized content delivery at scale.^[44] This deployment extends the pod concepts from TPU v3 by incorporating OCS for greater flexibility in interconnecting thousands of chips without fixed wiring constraints.^[42] Overall, these features position TPU v4 as a versatile accelerator for energy-conscious AI infrastructure.

Fifth Generation (v5p)

The fifth generation Tensor Processing Unit, designated v5p, was released by Google in December 2023 as a high-performance accelerator optimized for large-scale AI training within the Cloud TPU ecosystem.^[28] Each v5p chip provides 459 teraFLOPS of peak bfloat16 compute performance, paired with 95 GB of HBM2e memory delivering 2,765 GB/s bandwidth, enabling efficient handling of memory-intensive neural network operations.^[45] In pod configuration, v5p scales to 8,960 interconnected chips arranged in a 3D torus topology, yielding approximately 4.1 exaFLOPS of aggregate peak performance for massive distributed training workloads.^[45] Key innovations in v5p center on enhanced interconnectivity and architectural flexibility to support exascale AI systems. The inter-chip interconnect (ICI) bandwidth reaches 4,800 Gbps per chip, a significant upgrade that allows for 4x greater total FLOPS scalability per pod compared to the v4 generation, facilitating seamless data flow across thousands of devices.^[28] Building on v4's sparsity support, v5p incorporates four SparseCores per chip to accelerate sparse matrix computations common in modern models, while introducing ICI resiliency for fault-tolerant operations in large slices.^[45] Additionally, v5p provides native optimizations for mixture-of-experts (MoE) models, enabling dynamic expert routing and reduced activation overhead in expansive architectures.^[46] v5p is primarily deployed for training large language models at unprecedented scale, delivering up to 2.8x faster training throughput than v4 for dense LLMs and supporting the development of advanced systems akin to PaLM 2.^[28] This generation powers Google's AI Hypercomputer infrastructure, where pods enable end-to-end orchestration for generative AI tasks, emphasizing performance and adaptability over prior iterations.^[28]

Sixth Generation (Trillium)

The sixth-generation Tensor Processing Unit, known as Trillium or TPU v6e, was announced by Google in May 2024 and became generally available in Google Cloud in December 2024.^[30]^[46] This generation emphasizes cost-effective inference for generative AI workloads while maintaining strong training capabilities, delivering over 4x improvement in training performance and up to 3x increase in inference throughput compared to the previous TPU v5e.^[47] Trillium achieves a peak compute performance of 918 TFLOPs per chip in BF16 precision, representing a 4.7x increase over the v5e's 197 TFLOPs. Key innovations in Trillium include doubled High Bandwidth Memory capacity to 32 GB per chip and doubled interchip interconnect bandwidth, enabling efficient handling of long-context and multimodal models.^[30] It supports low-precision formats such as FP8, which enhances inference efficiency for generative AI tasks by reducing computational overhead without significant accuracy loss. Additionally, Trillium is over 67% more energy-efficient than TPU v5e, contributing to lower operational costs—up to 2.1x better performance per dollar—and supporting sustainable AI scaling.^[46] Trillium is deployed in Google Cloud's AI Hypercomputer infrastructure, where it powers training and serving of models like Gemini, enabling faster development of foundation models with reduced latency.^[30] Configurations scale up to pods of 256 chips, providing high-bandwidth, low-latency interconnects for large-scale distributed computing.^[47] This design prioritizes broad workload support, making it suitable for both dense large language models and inference-heavy applications in production environments.^[48]

Seventh Generation (Ironwood)

The seventh generation Tensor Processing Unit, known as Ironwood, was announced by Google on April 9, 2025, as its most advanced AI accelerator optimized for inference-intensive workloads in the generative AI era.^[3] Ironwood became generally available in November 2025, with deployment ongoing as of November 2025 following the announcement on November 6, 2025, enabling scalable deployment for real-time AI applications such as chatbots and agentic systems.^[49] This generation prioritizes low-latency inference while maintaining capabilities for training, addressing the shift toward "thinking" AI models that require efficient, high-throughput processing at scale.^[3] Ironwood achieves up to 10 times the peak performance of the fifth-generation TPU v5p and more than four times the per-chip performance of the sixth-generation Trillium for both training and inference tasks.^[49] Each Ironwood chip delivers 4,614 TFLOPs of dense FP8 performance, supported by 192 GB of high-bandwidth memory (HBM) with up to 7.4 TB/s bandwidth.^[50] Superpods scale to 9,216 interconnected chips, providing 42.5 exaFLOPs of FP8 compute power to handle massive inference demands, surpassing the capabilities of many supercomputers.^[3] Ironwood is the first TPU to natively support FP8 precision, enhancing efficiency for inference over previous generations that emulated it, while also optimizing bfloat16 operations for reduced precision needs in generative models.^[3] Key innovations in Ironwood focus on power efficiency and latency reduction, making it suitable for real-time generative AI applications through architectural advancements in matrix multiply units and interconnects.^[3] Fabricated on an advanced semiconductor process node—specific details pending full technical disclosure—the chip emphasizes energy efficiency to support sustainable large-scale AI deployments.^[51] In deployment, Ironwood integrates deeply with Google Cloud services, powering inference at scale through partnerships like the expanded collaboration with Anthropic, which plans to access up to one million TPUs for training and serving its Claude family of AI models.^[52] This enables cost-effective, high-performance inference for enterprise and research workloads, building briefly on Trillium's efficiency foundations without repeating prior details.^[49]

Edge and Consumer Variants

Edge TPU

The Edge TPU is a compact application-specific integrated circuit (ASIC) designed by Google specifically for executing machine learning inference on resource-constrained edge devices, such as those in IoT and embedded systems. Released in early 2019 as a core component of the Coral platform, it enables developers to deploy AI models locally without relying on cloud connectivity, prioritizing low power consumption and real-time processing. The Edge TPU achieves a peak performance of 4 tera-operations per second (TOPS) using 8-bit integer (INT8) precision, while drawing approximately 2 watts of power for an efficiency of 2 TOPS per watt. It incorporates 8 MB of on-chip static random-access memory (SRAM) to handle model weights and activations, minimizing data movement overhead. Available in compact form factors including USB accelerators and M.2 modules, the design facilitates straightforward integration into devices like single-board computers or custom hardware. The architecture features a scaled-down systolic array similar to that in Google's cloud TPUs, optimized for low-latency execution of quantized neural networks, and exclusively supports models compiled via the TensorFlow Lite framework for efficient deployment.^[53]^[54] Common applications of the Edge TPU include smart cameras for real-time object detection and image classification, as well as edge analytics in manufacturing settings for predictive maintenance and quality control. For instance, it powers vision-based systems that process video streams on-device to identify defects or anomalies without transmitting sensitive data to the cloud.^[55] In October 2025, Google announced the Coral NPU as the next-generation evolution of the Edge TPU platform, integrating advanced AI acceleration into system-on-chips for even lower-power, always-on applications in wearables and IoT devices, though specific performance metrics for the new hardware were not detailed at launch.^[56]

Pixel Neural Core

The Pixel Neural Core is a dedicated neural processing unit (NPU) integrated into Google's Pixel 4 smartphones, released in October 2019, as part of the Titan M security chip. This hardware accelerator, based on an instantiation of the Edge TPU architecture, enables efficient on-device machine learning inference for privacy-sensitive tasks without relying on cloud processing.^[57]^[58] Optimized for mobile AI workloads, the Pixel Neural Core delivers approximately 4 TOPS of performance while maintaining minimal power draw to preserve battery life. It supports key features such as secure face unlock through dedicated arithmetic logic units for facial recognition algorithms and advanced photo processing, including real-time previews for effects like Night Sight and Portrait Mode. Additionally, it handles neural audio processing for enhanced voice recognition in Google Assistant interactions.^[59]^[58]^[60] The Pixel Neural Core served as a precursor to broader mobile TPU integrations, evolving into the custom TPU embedded within the Google Tensor SoC starting with the Pixel 6 series in 2021. This advancement expanded capabilities to include on-device support for real-time language translation via Live Translate, leveraging optimized neural models for audio and vision tasks while upholding ultra-low power efficiency suitable for always-on smartphone use.^[61]^[62]

Google Tensor SoC

The Google Tensor system-on-chip (SoC) represents Google's integration of Tensor Processing Unit (TPU) technology into mobile consumer devices, primarily powering the Pixel smartphone lineup since 2021. This custom ARM-based SoC combines a TPU for on-device machine learning acceleration with CPU and GPU components optimized for AI-driven tasks such as image processing and natural language understanding, enabling features that run efficiently without cloud dependency. Unlike dedicated cloud TPUs, the Tensor's TPU variant is tailored for low-power, edge computing in smartphones, evolving from earlier Pixel Neural Core designs to support more advanced generative AI workloads.^[63] The first-generation Tensor G1, released in the Pixel 6 series in October 2021, marked the debut of this architecture on a Samsung 5 nm process node. It features an octa-core CPU with two Cortex-X1 cores at up to 2.8 GHz, two Cortex-A76 cores at 2.25 GHz, and four Cortex-A55 cores at 1.8 GHz, paired with a Mali-G78 MP20 GPU and an integrated Edge TPU for AI tasks. The TPU in the G1 provides foundational on-device ML capabilities, handling operations like real-time translation and photo enhancement at around 4 TOPS of performance, balancing power efficiency with everyday computing demands. Subsequent iterations built on this foundation, with the second-generation Tensor G2 in the Pixel 7 series (October 2022) also on Samsung's 5 nm node, upgrading to two Cortex-X1 cores at 2.85 GHz, two Cortex-A78 at 2.35 GHz, four Cortex-A55 at 1.8 GHz, a Mali-G710 MP7 GPU, and a TPU improved by up to 60% in speed for camera and speech processing.^[64]^[65]^[64] By the third-generation Tensor G3 in the Pixel 8 series (October 2023), manufactured on Samsung's 4 nm process, the SoC shifted toward enhanced AI focus with a single Cortex-X3 core at 2.91 GHz, four Cortex-A715 at 2.37 GHz, four Cortex-A510 at 1.7 GHz, and an Immortalis-G715 MP7 GPU. Its custom TPU (codenamed Rio) enables on-device generative AI, such as Gemini Nano, with improved efficiency over previous generations. The fourth-generation Tensor G4, introduced in the Pixel 9 series (August 2024) on the same 4 nm node, refines the CPU to a Cortex-X4 at 3.1 GHz, three Cortex-A720 at 2.6 GHz, and four Cortex-A520 at 1.92 GHz, retaining the Rio TPU for consistent AI performance while improving thermal management for sustained gaming and photography workloads. Most recently, the fifth-generation Tensor G5 in the Pixel 10 series (August 2025), fabricated on TSMC's 3 nm N3E process, boosts the CPU with a Cortex-X4 at 3.78 GHz, five Cortex-A725 at 3.05 GHz, and two Cortex-A520 at 2.25 GHz, alongside a fourth-generation TPU offering up to 60% greater AI processing power than the G4, supporting more complex offline large language models (LLMs), including advanced multimodal capabilities with Gemini Nano.^[66]^[67] Key innovations in the Tensor SoC series center on on-device generative AI, facilitated by the integrated TPU. For instance, the G3 and later generations power features like Magic Editor in Google Photos, which uses diffusion models to seamlessly edit photo elements—such as repositioning subjects or altering skies—entirely on-device for privacy and speed, without requiring internet connectivity. The TPU also enables offline LLM support, starting with Gemini Nano on the Pixel 8 Pro, allowing local processing of multimodal queries for tasks like real-time captioning or smart replies, with the G5 further enhancing this through optimized quantization for larger models. These advancements prioritize conceptual efficiency in consumer scenarios, such as computational photography where the TPU collaborates with the ISP for features like Best Take, which intelligently swaps faces across burst shots.^[68]^[69]^[67] In terms of performance, the Tensor SoC emphasizes balanced integration of its TPU with CPU and GPU elements to handle mixed workloads, rather than raw speed in benchmarks. For example, the G5 achieves 34% average CPU gains and 30% power savings over the G4, enabling prolonged gaming sessions on titles like Genshin Impact while the TPU offloads AI-enhanced graphics upscaling. In photography, the TPU's role in real-time object detection and noise reduction contributes to superior low-light performance, as seen in Pixel devices outperforming competitors in computational imaging tests by leveraging on-device inference for faster shutter response. This holistic design supports consumer use cases like augmented reality filters and voice assistants, where the TPU's efficiency reduces latency to under 100 ms for AI inferences. Manufacturing transitions, from Samsung's nodes for G1–G4 to TSMC's 3 nm for G5, have improved yield and power density, allowing larger TPU allocation without compromising battery life in compact form factors.^[70]^[71]^[72]

Generation	Release (Pixel Series)	Process Node	CPU Configuration	GPU	TPU Performance	Key AI Features
G1	Oct 2021 (Pixel 6)	Samsung 5 nm	2x X1 @2.8GHz, 2x A76 @2.25GHz, 4x A55 @1.8GHz	Mali-G78 MP20	~4 TOPS	Face Unlock, Live Translate
G2	Oct 2022 (Pixel 7)	Samsung 5 nm	2x X1 @2.85GHz, 2x A78 @2.35GHz, 4x A55 @1.8GHz	Mali-G710 MP7	Up to 60% faster than G1	Clear Calling, Photo Unblur
G3	Oct 2023 (Pixel 8)	Samsung 4 nm	1x X3 @2.91GHz, 4x A715 @2.37GHz, 4x A510 @1.7GHz	Immortalis-G715 MP7	Enhanced (Rio) for generative AI	Magic Editor, Gemini Nano offline
G4	Aug 2024 (Pixel 9)	Samsung 4 nm	1x X4 @3.1GHz, 3x A720 @2.6GHz, 4x A520 @1.92GHz	Immortalis-G715 MP7	Retained Rio TPU	Enhanced Best Take, on-device summarization
G5	Aug 2025 (Pixel 10)	TSMC 3 nm	1x X4 @3.78GHz, 5x A725 @3.05GHz, 2x A520 @2.25GHz	Immortalis-G715 (upgraded)	60% > G4	Advanced offline LLMs, Magic Cue

Legal and Ethical Considerations

Patent Infringement Lawsuit

In December 2019, Singular Computing LLC, a company founded by computer scientist Dr. Joseph Bates, filed a patent infringement lawsuit against Google LLC in the U.S. District Court for the District of Massachusetts.^[73]^[74] The suit alleged that Google's Tensor Processing Units (TPUs), specifically versions 2 and 3, infringed on Singular's patents related to efficient computation techniques for artificial intelligence workloads.^[75]^[74] Dr. Bates, a former child prodigy and MIT researcher, developed his patented technologies in the 1990s and early 2000s, focusing on architectures for sparse computing and low-precision data formats that enable more efficient multiply-accumulate (MAC) operations in parallel arrays.^[76] Singular claimed that Bates pitched these ideas to Google engineers in 2014, after which Google incorporated similar concepts into its TPUs without licensing, allegedly copying elements like dynamic sparsity exploitation to skip unnecessary computations and bfloat16-like formats for reduced precision arithmetic.^[77]^[78] The asserted patents included U.S. Patent Nos. 8,407,273; 9,218,156; and 10,416,961, which describe reconfigurable processor arrays capable of handling variable-precision operations and sparse data efficiently.^[74] The lawsuit centered on accusations that TPUs v2 and v3 utilized Bates' innovations for their systolic array designs, enabling high-throughput AI training and inference by optimizing MAC arrays for sparsity and low-precision formats, thereby avoiding the need for larger hardware footprints.^[75]^[79] Singular sought $1.67 billion in damages for willful infringement, arguing that these features powered Google's AI services and generated substantial revenue.^[73] Google contested the claims, asserting that the patents were invalid and that its TPU designs were independently developed.^[73] The case proceeded to trial in January 2024 in Boston, following years of pretrial motions and inter partes reviews where Google challenged patent validity.^[74] On January 24, 2024, just before closing arguments, the parties reached a confidential settlement, with Google agreeing to an undisclosed payment without admitting liability or facing any injunction on TPU production.^[73]^[75] The resolution highlighted ongoing tensions in AI hardware intellectual property but did not disrupt bfloat16's adoption as an industry standard for machine learning, though it underscored Bates' foundational contributions to such formats.^[76]^[77]

Broader Implications for AI Hardware

The development of Tensor Processing Units (TPUs) has significantly accelerated the industry shift toward application-specific integrated circuits (ASICs) for AI workloads, challenging the dominance of general-purpose graphics processing units (GPUs). By optimizing hardware specifically for tensor operations in machine learning, TPUs offer superior efficiency for training and inference tasks compared to GPUs, prompting competitors like Nvidia and AMD to enhance their ASIC offerings and explore custom designs. For instance, the adoption of TPUs by major players such as OpenAI has pressured Nvidia's market position, encouraging broader exploration of ASIC alternatives to reduce costs and improve performance in large-scale AI deployments.^[80]^[81]^[82] TPUs also play a pivotal role in addressing ethical concerns surrounding the energy consumption of AI hardware, promoting more sustainable computing practices amid growing environmental scrutiny. Google's lifecycle assessments indicate that TPUs have achieved a threefold improvement in carbon efficiency for AI workloads from the v4 generation to Trillium, primarily through reduced operational electricity use, which accounts for about 70% of emissions in AI accelerators. This efficiency edge over GPUs—enabling faster computations with lower power draw—positions TPUs as a contributor to greener AI infrastructure, helping mitigate the sector's projected doubling of data center electricity demands by 2026.^[83]^[84]^[85] The widespread adoption of the bfloat16 (BF16) format, originally developed for TPUs, exemplifies standardization efforts in AI hardware that extend beyond Google, fostering interoperability across ecosystems. BF16's balance of precision and performance has led to its integration in diverse accelerators, including Intel's AVX10.2 instructions and various ML-optimized chips, enhancing efficiency without sacrificing accuracy in neural network training. This trend gained momentum following the 2024 settlement of Google's patent infringement lawsuit with Singular Computing, which resolved disputes over foundational AI chip technologies and cleared paths for broader innovation in floating-point formats.^[34]^[86]^[73] Looking ahead, TPUs are poised to drive future collaborations in AI hardware, though full open-sourcing remains unlikely given Google's cloud-centric model; instead, partnerships like Anthropic's commitment to up to one million Ironwood TPUs underscore ecosystem expansion. In the 2025 AI arms race, Ironwood—the seventh-generation TPU—intensifies competition with Nvidia's Blackwell platform by delivering over four times the performance of its predecessor in scaled pods, enabling exaflop-level inference that outpaces GPU clusters in efficiency and throughput for real-time applications.^[52]^[4]^[87]

References

[1]
Tensor Processing Units (TPUs) - Google Cloud
What is a Tensor Processing Unit (TPU)?. Google Cloud TPUs are custom-designed AI accelerators, which are optimized for training and inference of AI models.
[2]
TPU transformation: A look back at 10 years of our AI-specialized chips
Jul 31, 2024 · Google's Tensor Processing Units were created in response to the growing demand for AI compute and have evolved over years to meet that ...
[3]
Ironwood: The first Google TPU for the age of inference - The Keyword
Apr 9, 2025 · Ironwood is our most powerful, capable and energy efficient TPU yet, designed to power thinking, inferential AI models at scale.
[4]
https://www.cnbc.com/2025/11/06/google-unveils-ironwood-seventh-generation-tpu-competing-with-nvidia.html
[5]
An in-depth look at Google's first Tensor Processing Unit (TPU)
May 12, 2017 · In this post, we'll take an in-depth look at the technology inside the Google TPU and discuss how it delivers such outstanding performance.Missing: history | Show results with:history
[6]
[PDF] In-Datacenter Performance Analysis of a Tensor Processing UnitTM
The goal was to improve cost-performance by 10X over GPUs. Given this mandate, the TPU was designed, verified [Ste15], built, and deployed in datacenters in ...
[7]
[PDF] Why Systolic Architectures? - Computer Science
Jan 4, 1982 · 5. H. T. Kung and C. E. Leiserson, "Systolic Arrays (for. VLSI)," Sparse Matrix Proc. 1978, Society for Industrial and Applied ...
[8]
TPU architecture | Google Cloud
### Summary of Systolic Array in TPUs
[9]
Quantifying the performance of the TPU, our first machine learning ...
Apr 5, 2017 · On our production AI workloads that utilize neural network inference, the TPU is 15x to 30x faster than contemporary GPUs and CPUs. The TPU also ...
[10]
https://arxiv.org/abs/1704.04760
[11]
Understanding TPUs vs GPUs in AI: A Comprehensive Guide
May 30, 2024 · Overall, TPUs provide significant advantages in performance, efficiency, and scalability for AI applications compared to GPUs.Missing: speedup | Show results with:speedup
[12]
[PDF] Carbon Emissions and Large Neural Network Training - arXiv
Google's custom TPU v2 processor runs Transformer (Big) 4.3X faster than P100 GPUs and Evolved Transformer (Medium) 5.2X faster. 7 TPU v2 also uses less power: ...
[13]
Improve your model's performance with bfloat16 | Cloud TPU
By default, TPUs perform matrix multiplication operations with bfloat16 values and accumulations with IEEE float32 values.
[14]
TPU vs GPU: Comprehensive Technical Comparison - Wevolver
Sep 16, 2025 · Energy Efficiency: For AI workloads, TPUs offer higher performance per watt, up to 2–3 times improvement over contemporary GPUs. Google's latest ...<|control11|><|separator|>
[15]
Performance per dollar of GPUs and TPUs for AI inference
Sep 12, 2023 · The new Cloud TPU v5e, announced at Google Cloud Next, enables high-performance and cost-effective inference for a broad range AI workloads, ...
[16]
Google supercharges machine learning tasks with TPU custom chip
May 19, 2016 · AlphaGo was powered by TPUs in the matches against Go world champion, Lee Sedol, enabling it to "think" much faster and look farther ahead ...
[17]
Introduction to Cloud TPU | Google Cloud
### TPU Limitations Summary
[18]
Cloud TPU performance guide - Google Cloud Documentation
Your first step when troubleshooting TPU performance is to profile your model. For more information on capturing a performance profile, see Profiling your ...TPU model performance · XLA compiler optimizations · Padding · Batch size
[19]
Cloud TPU pricing - Google Cloud
Cloud TPU pricing varies by product, deployment model & region. Charges for Cloud TPU accrue while a TPU node is in a READY state.Missing: high setup
[20]
https://ieeexplore.ieee.org/document/9715098
[21]
Google's First Tensor Processing Unit : Origins - The Chip Letter
Feb 25, 2024 · Note: This is a two-part post. This part provides context and covers the story of the development of the first Google TPU.
[22]
Google Built Its Very Own Chips to Power Its AI Bots - WIRED
May 18, 2016 · Google calls its chip the Tensor Processing Unit, or TPU, because it underpins TensorFlow, the software engine that drives its deep learning ...
[23]
Google I/O: Custom TPU chip amplifies machine learning performance
May 18, 2016 · Google on Wednesday revealed that for the past year, it's been powering data centers with a custom-built Tensor Processing Unit (TPU) chip designed for machine ...Missing: public | Show results with:public<|separator|>
[24]
Google Rattles the Tech World With a New AI Chip for All - WIRED
May 17, 2017 · Dubbed TPU 2.0 or the Cloud TPU, the new chip is a sequel to a custom-built processor that has helped drive Google's own AI services, including ...
[25]
Google launches TPU 3.0, third version of AI chips - CNBC
and you can make your own. watch now. VIDEO2:04 ...Missing: reveal O
[26]
TPU architecture | Google Cloud Documentation
Tensor Processing Units (TPUs) are application specific integrated circuits (ASICs) designed by Google to accelerate machine learning workloads.
[27]
TPU v4: An Optically Reconfigurable Supercomputer for Machine ...
Apr 4, 2023 · The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models.
[28]
Introducing Cloud TPU v5p and AI Hypercomputer - Google Cloud
Dec 6, 2023 · Compared to TPU v4, TPU v5p features more than 2X greater FLOPS and 3X more high-bandwidth memory (HBM). Designed for performance, flexibility, ...Missing: limitations small
[29]
TPU v5p - Google Cloud Documentation
v5p values. Peak compute per chip (bf16), 459 TFLOPs. HBM2e capacity and bandwidth, 95GB, 2765 GBps. TPU Pod size, 8960 chips. Interconnect topology, 3D Torus ...
[30]
Introducing Trillium, sixth-generation TPUs | Google Cloud Blog
May 15, 2024 · Trillium TPUs make it possible to train the next wave of foundation models faster and serve those models with reduced latency and lower cost.
[31]
In-Datacenter Performance Analysis of a Tensor Processing Unit
This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural ...
[32]
Google Debuts TPU v2 and will Add to Google Cloud - HPCwire
Google last week announced the second generation TPU v2 will soon be added to Google Compute Engine and Google Cloud shortly thereafter.
[33]
TPU v2 - Google Cloud Documentation
This document describes the architecture and supported configurations of Cloud TPU v2. System architecture. Architectural details and performance ...
[34]
BFloat16: The secret to high performance on Cloud TPUs
Aug 23, 2019 · The second- and third-generation TPU chips are available to Google Cloud customers as Cloud TPUs. They deliver up to 420 teraflops per Cloud TPU ...
[35]
Cloud TPU Pods Now Available with TensorFlow 2.1 Support
Mar 4, 2020 · A single Cloud TPU v3 device (left) with 420 teraflops and 128 GB HBM, and a Cloud TPU v3 Pod (right) with 100+ petaflops and 32 TB HBM ...
[36]
[PDF] The Design Process for Google's Training Chips: TPUv2 and TPUv3
Feb 28, 2021 · Two years later, TPUv2 powered key Google services with fast and cost-effective training. Challenges and Opportunities of Building ML Hardware.
[37]
Enabling 1 MW IT racks and liquid cooling at OCP EMEA Summit
Apr 29, 2025 · Google first used liquid cooling in TPU v3 that was deployed in 2018. Liquid-cooled ML servers have nearly half the geometrical volume of ...
[38]
Tearing Apart Google's TPU 3.0 AI Coprocessor - The Next Platform
May 10, 2018 · The TPUv3 chip runs so hot that for first time Google has introduced liquid cooling in its datacenters; Each TPUv3 pod will be eight times more ...Shelves And Boards · Cloud Tpus · Tpu Chips
[39]
[PDF] Google's Training Chips Revealed: TPUv2 and TPUv3
322b VLIW bundle. ○ 2 scalar slots. ○ 4 vector slots (2 for load/store). ○ 2 matrix slots (push, pop). ○ 1 misc slot. ○ 6 immediates.
[40]
Cloud TPU Pods break AI training records | Google Cloud Blog
Jul 11, 2019 · All three record-setting results ran on Cloud TPU v3 Pods, the latest generation of supercomputers that Google has built specifically for machine learning.
[41]
Google Launches TPU v4 AI Chips - HPCwire
May 20, 2021 · With the new release, the company has boosted the performance of its TPU hardware by more than two times over the previous TPU v3 chips, ...
[42]
[PDF] TPU v4: An Optically Reconfigurable Supercomputer for ... - arXiv
TPU v3 is faster than CPUs by 9.8x. TPU v4 beats TPU v3 by 3.1x and CPUs by 30.1x. When embeddings are placed in CPU memory for TPU v4, performance drops by 5x– ...
[43]
TPU v4 enables performance, energy and CO2e efficiency gains
Apr 5, 2023 · A new paper describes how Google's Cloud TPU v4 outperforms TPU v3 by 2.1x on a per-chip basis, and improves performance/Watt by 2.7x.
[44]
Building Large Scale Recommenders using Cloud TPUs
Oct 7, 2022 · In this blog post, we introduce concepts to generate and analyze traces to debug PyTorch training performance on TPU VM.
[45]
TPU v5p | Google Cloud
### Technical Specifications for TPU v5p
[46]
Trillium TPU is GA | Google Cloud Blog
Dec 11, 2024 · Trillium, Google's sixth-generation Tensor Processing Unit (TPU) is now GA, delivering enhanced performance and cost-effectiveness for AI ...Missing: metrics | Show results with:metrics<|control11|><|separator|>
[47]
Trillium sixth-generation TPU is in preview | Google Cloud Blog
Oct 30, 2024 · Over 4x improvement in training performance · Up to 3x increase in inference throughput · A 67% increase in energy efficiency · An impressive 4.7x ...
[48]
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
[49]
https://cloud.google.com/blog/products/compute/ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads
[50]
Google Provides Detailed Insight on Next-Gen "Ironwood" TPU ...
Aug 25, 2025 · This year, the Ironwood TPU Superpod will offer 9216 chips per pod, with 192 GB of 7.4 TB/s HBM memory, and a massive 4614 TFLOPs of peak FLOPs ...
[51]
Google TPU Ironwood: Revolutionizing AI Inference at Scale
Apr 22, 2025 · Ironwood offers 4,614 TFLOPs of peak FP8 performance per chip, making it the most powerful TPU Google has ever designed. With the ability to ...<|control11|><|separator|>
[52]
Expanding our use of Google Cloud TPUs and Services - Anthropic
Oct 23, 2025 · “Anthropic and Google have a longstanding partnership and this latest expansion will help us continue to grow the compute we need to define the ...
[53]
https://coral.ai/docs/edgetpu/benchmarks/
[54]
Google Coral Edge TPU explained in depth - Q-engineering
Google has so far not revealed the measures in the Edge, but a well-founded estimate is 64 x 64 with a clock of 480 MHz, resulting in 4 TOPS. ... Not only for the ...Missing: 8MB SRAM
[55]
https://coral.ai/docs/edgetpu/models-intro
[56]
Introducing Coral NPU: A full-stack platform for Edge AI
Oct 15, 2025 · Coral NPU: A full-stack platform for Edge AI, enabling private, efficient, and always-on intelligence for wearables and IoT devices.
[57]
Introducing the Next Generation of On-Device Vision Models
Nov 13, 2019 · ... Pixel Neural Core that contains an instantiation of the Edge TPU architecture, Google's machine learning accelerator for edge computing ...
[58]
What is the Google Pixel 4's Neural Core? - Android Authority
Oct 25, 2019 · The Neural Core accelerates image and voice algorithms that run less efficiently on a CPU or GPU. The Neural Core builds dedicated ...Missing: Titan M
[59]
AI Accelerator Module (featuring the Coral Edge TPU™ from Google)
Google Edge TPU coprocessor: 4 TOPS (int8); 2 TOPS per watt. RAM, 1 GB LPDDR4 (option for 2 GB or 4 GB coming soon). Flash memory, 8 GB eMMC, MicroSD slot.
[60]
Google Pixel 4 and 4 XL hands-on: this time, it's not about the camera
Oct 15, 2019 · Pixel Neural Core. The reason the Pixel 4 is able to show those real-time previews is a new coprocessor Google created, the Pixel Neural Core.<|control11|><|separator|>
[61]
Google Tensor is a milestone for machine learning
Oct 19, 2021 · Google Tensor is the first ever system on a chip made by Google, unlocking new experiences in Pixel 6 and Pixel 6 Pro.
[62]
Improved On-Device ML on Pixel 6, with Neural Architecture Search
Nov 8, 2021 · We share the improvements in on-device machine learning made possible by designing the ML models for Google Tensor's TPU.
[63]
How Google Tensor Helps Google Pixel Phones Do More
It's custom-built with upgrades across the chip, like a TPU that's up to 60% more powerful to deliver helpful AI experiences more efficiently, a CPU that's 34% ...
[64]
Google Tensor G2 chip: Everything you need to know
Jan 22, 2024 · We'll discuss real-world performance results in the next section. Google's upgraded TPU handles camera and speech tasks up to 60% faster. ...
[65]
Google Tensor Processor - Benchmarks and Specs - Notebookcheck
It is positioned as a high end SoC and focuses on the integrated TPU for efficient AI and ML calculations. The CPU part integrates two fast and big ARM Cortex ...
[66]
Google Tensor G3: The new chip that gives your Pixel an AI upgrade
Oct 4, 2023 · It includes the latest generation of ARM CPUs, an upgraded GPU, new ISP and Imaging DSP and our next-gen TPU, which was custom-designed to run ...Missing: Neural | Show results with:Neural
[67]
Google Tensor G4 explained: Everything you need to know about ...
Aug 22, 2024 · Google Tensor G4 specs ; CPU. Google Tensor G4. 1x Arm Cortex-X4 (3.1GHz) 3x Arm Cortex-A720 (2.6GHz) 4x Arm Cortex-A520 (1.92GHz). Google Tensor ...
[68]
AI Photo Editing on Pixel Levels Up Your Photography - Google Store
Enhance the photos you take on Pixel's best-in-class camera: move, remove, and resize people or objects in an image, change the background and more.
[69]
Google Pixel 10: 9 new AI features and updates - The Keyword
Aug 20, 2025 · Pixel 10 phones have new AI features to help you save time. The Google Tensor G5 chip and Gemini Nano model power features like Magic Cue ...Missing: innovations Editor
[70]
5 reasons why Google Tensor G5 is a game-changer for Pixel
Aug 20, 2025 · Google's new Tensor G5 chip will deliver a major performance boost to Pixel 10 phones. You can expect faster on-device AI, proactive features, ...Missing: Neural | Show results with:Neural
[71]
Google Tensor G5 Chip: TSMC's 3nm Power Brings Big Changes
Aug 27, 2025 · Google claims that the fourth-generation TPU delivers up to 60% more AI processing power compared to the previous version, which perfectly ...
[72]
Pixel 10's Tensor G5 deep dive: All the info Google didn't tell us
Aug 27, 2025 · Google claims its fourth-generation TPU is up to 60% more powerful than the previous version, allowing for higher-quality AI experiences.
[73]
Google settles AI-related chip patent lawsuit that sought $1.67 bln
Jan 24, 2024 · The settlement comes the same day that closing arguments were scheduled to begin in a trial on Singular Computing's lawsuit, which had sought $ ...
[74]
Google settles Singular Computing patent infringement case
Jan 25, 2024 · These patents (US 8,407,273, US 9,218,156, and 10,416,961) described a computer architecture designed to execute a large number of low-precision ...
[75]
Google settles $1.67bn TPU patent infringement lawsuit - DCD
Jan 25, 2024 · Google has reached a settlement in the $1.67 billion patent infringement lawsuit filed by computer scientist Dr. Joseph Bates.
[76]
First major AI patent settlement: former child prodigy sued Google for ...
Jan 25, 2024 · Google disputed both infringement and validity. One of Singular's patents-in-suit in this case was invalidated by the Patent Trial and Appeal ...
[77]
TimbersRepresents Singular Computing LLP in a Patent…
Jul 27, 2024 · Terms of the settlement are not public. Kerry Timbers, Managing Partner at Sunstein LLP, served as Singular's lead trial counsel in a two-week ...
[78]
Google Defends Its AI Chips in $7 Billion Trial | Potter Clarkson
By infringing Singular's patented LPHDR architectures, Singular allege that Google were able to avoid needing to otherwise double its computing footprint, in ...
[79]
Google battles billion-dollar patent lawsuit over its TPU AI chips
Jan 10, 2024 · The company has been accused of infringing on patents held by Singular and its founder, the computer scientist Joseph Bates, who is formerly an ...
[80]
OpenAI Shifts to Google TPU Chips: A Major Shift in AI Compute ...
Jul 9, 2025 · The adoption of Google TPUs may prompt more companies to explore alternatives to NVIDIA GPUs, thereby breaking NVIDIA's dominant position in the ...<|separator|>
[81]
NVIDIA's AI Chip Dominance Faces a Crucible: Can It Hold ... - AInvest
Jul 1, 2025 · Its custom ASICs—designed for specific tasks like inference or Google's TPU-like workloads—offer a 75% cost advantage over NVIDIA's GPUs. In ...
[82]
AI Chip Wars: A Comparison of GPU vs. TPU vs. ASIC for AI
Jan 9, 2025 · A deep dive into the AI chip wars. This guide compares GPUs, TPUs, ASICs, and FPGAs, analyzing the hardware strategies of NVIDIA, AMD, ...
[83]
TPUs improved carbon-efficiency of AI workloads by 3x
Feb 6, 2025 · Today, operational electricity emissions comprise the vast majority (70%+) of a Google TPU's lifetime emissions. This underscores the importance ...
[84]
How Google's AI Chip Upgrades Set Sustainability Standards
Feb 7, 2025 · The study revealed that 70% of emissions across a TPU's lifecycle come from operational electricity usage, highlighting the importance of both ...
[85]
We did the math on AI's energy footprint. Here's the story you haven't ...
May 20, 2025 · Data centers started getting built with energy-intensive hardware designed for AI, which led them to double their electricity consumption by ...
[86]
OFP8, Bfloat16, Posit, and Takum Arithmetics - arXiv
Apr 29, 2025 · Also supported in AVX10.2 is the 16-bit bfloat16 format (bfloat16) , developed by Google Brain. Widely implemented in hardware accelerators, ...<|separator|>
[87]
https://www.tomshardware.com/tech-industry/artificial-intelligence/google-deploys-new-axion-cpus-and-seventh-gen-ironwood-tpu-training-and-inferencing-pods-beat-nvidia-gb300-and-shape-ai-hypercomputer-model