GPU virtualization
GPU virtualization is a computing technique that enables the partitioning and sharing of a single physical graphics processing unit (GPU) among multiple virtual machines (VMs) within a virtualized environment, allowing each VM to access a portion of the GPU's computational resources as if it had dedicated hardware. This approach addresses the challenges of GPU underutilization in cloud and data center settings by providing isolation, scalability, and efficient resource allocation for parallel processing workloads.[1][2] The development of GPU virtualization emerged in the mid-2000s alongside the rise of general-purpose GPU (GPGPU) computing, with early research focusing on enabling GPU acceleration in virtualized systems to support high-performance computing (HPC) applications. Initial efforts, such as those documented in 2008, introduced taxonomies for virtualization strategies, including API remoting—where GPU commands are forwarded over a network to a remote physical GPU—and direct device pass-through, which assigns an entire GPU to a single VM for near-native performance but limits sharing. By the early 2010s, mediated pass-through techniques like gVirt (introduced in 2014) advanced the field by combining device emulation in software with hardware isolation to support multiple VMs per GPU while running native drivers inside VMs.[3][4][2] Major implementations have been driven by industry leaders, with NVIDIA's virtual GPU (vGPU) software, first released around 2013 as part of the GRID platform, providing hardware-accelerated sharing for virtual desktops, workstations, and applications across hypervisors like VMware vSphere and KVM. NVIDIA vGPU profiles allocate specific fractions of GPU memory and compute cores to VMs, supporting use cases in AI training, graphics-intensive design, and virtual desktop infrastructure (VDI). Similarly, AMD's MxGPU and partition-based virtualization, integrated into products like the Versal AI Edge Series, divide GPU shaders into isolated slices and partitions, ensuring secure multi-VM access via hardware arbiters and memory management units for embedded and edge computing scenarios. These solutions emphasize time-slicing, spatial partitioning, and fine-grained scheduling to balance performance and fairness.[5][1][6] Key benefits of GPU virtualization include cost reduction through hardware consolidation, enhanced security via VM isolation to prevent interference, and improved scalability for cloud providers offering GPU-accelerated instances. However, challenges persist, such as overhead from context switching, ensuring equitable resource distribution among VMs, and maintaining low-latency performance for real-time graphics. Ongoing research explores hybrid approaches, including hardware-assisted virtualization compliant with standards like PCI-SIG SR-IOV, to further optimize for emerging demands in machine learning and remote rendering. As of 2025, NVIDIA has released vGPU versions 18 and 19, enhancing support for AI workloads including LLM fine-tuning in virtualized environments.[2][1][6][7]Introduction
Definition and Principles
GPU virtualization is the process of abstracting a physical graphics processing unit (GPU) to enable multiple virtual machines (VMs) or containers to share its resources for graphics rendering or general-purpose computing tasks, without providing direct hardware access to any individual instance.[8] This abstraction allows a single physical GPU to be partitioned into multiple virtual GPUs (vGPUs), each appearing as a dedicated device to the guest environment, thereby facilitating efficient resource allocation in shared computing setups.[8] The core principles of GPU virtualization revolve around resource isolation, scheduling, and balancing performance trade-offs. Resource isolation ensures that workloads from different VMs do not interfere with each other, maintaining security and stability by preventing unauthorized access to shared hardware components like memory and processing cores.[8] Scheduling mechanisms, such as time-slicing (where the GPU alternates execution between vGPUs over short intervals) or spatial partitioning (dividing the GPU into concurrent sub-units), manage access to optimize utilization while minimizing latency.[8] These principles involve inherent trade-offs, where sharing efficiency gains come at the cost of overhead compared to native GPU performance, typically resulting in a 3-10% reduction in throughput depending on the workload and virtualization technique.[9] In the basic architecture, a host-level GPU driver acts as a mediator, intercepting and routing commands from guest VMs to the physical hardware while enforcing isolation policies.[8] Guest environments interact with virtual GPUs through paravirtualized interfaces (requiring guest driver modifications for awareness of the virtualization layer) or fully virtualized interfaces (emulating a complete GPU without guest changes), enabling seamless integration with hypervisors like KVM or VMware.[8] GPUs play essential roles in both graphics rendering and general-purpose computing, necessitating virtualization to address resource inefficiencies. For graphics, APIs such as OpenGL—a cross-platform standard for high-performance 2D and 3D rendering—and DirectX—Microsoft's suite for hardware-accelerated 2D/3D graphics in multimedia applications—rely on GPUs to process vertex transformations, shading, and rasterization for real-time visuals in games and simulations.[10][11] In general-purpose GPU (GPGPU) computing, frameworks like CUDA (NVIDIA's parallel computing platform) and OpenCL (an open standard for heterogeneous parallel programming) offload compute-intensive tasks such as machine learning and scientific simulations to GPU cores for massive parallelism.[12][13] Virtualization becomes crucial in multi-tenant environments like cloud data centers, where GPUs often remain underutilized due to bursty workloads, leading to inefficient resource pooling without sharing mechanisms.[14] The benefits of GPU virtualization include significant cost savings through resource pooling, allowing multiple tenants to share expensive hardware, and improved scalability for diverse workloads in cloud infrastructures.[8] However, it introduces limitations such as mediation overhead from command interception and context switching, which can degrade performance, and potential security risks like side-channel attacks or data leakage in shared memory spaces between VMs.[8][15]Historical Development
The concept of GPU virtualization emerged in the late 2000s, driven by the need to enable graphics acceleration within virtual machines for improved performance in hosted environments. In 2008, researchers at VMware introduced a foundational approach through their paper on hosted I/O architecture, which proposed strategies for sharing GPU resources among multiple virtual machines, focusing initially on graphics workloads to overcome the limitations of software-only rendering. This work laid the groundwork for taxonomy in GPU virtualization techniques, emphasizing the challenges of direct hardware access in virtualized settings.[16] The 2010s marked significant commercialization and technical advancements, spurred by virtual desktop infrastructure (VDI) demands. NVIDIA launched GRID in 2012, the industry's first cloud-based GPU solution, enabling workstation-class graphics delivery to remote users across various devices and paving the way for scalable VDI deployments. Concurrently, the Linux kernel introduced mediated device support in 2016, facilitating secure GPU sharing via frameworks like VFIO-mdev, which allowed hypervisors to partition GPU resources without full passthrough.[17][18] AMD followed in 2016 with SR-IOV support in its Radeon Instinct accelerators, introducing MxGPU technology that conformed to the Single Root I/O Virtualization standard for multi-user GPU partitioning. A pivotal academic contribution came from the USENIX ATC 2014 paper on gVirt, which detailed a full GPU virtualization solution using mediated pass-through, enabling native drivers in guest VMs while supporting both graphics and compute workloads. The rise of general-purpose GPU (GPGPU) computing, ignited by NVIDIA's CUDA platform in 2006, further accelerated virtualization efforts, with demands intensifying around the 2017 AI boom as deep learning applications required efficient GPU resource allocation in shared environments.[4][19] Entering the 2020s, GPU virtualization evolved to address AI and machine learning scalability, shifting focus from primarily VDI to high-performance computing in cloud and containerized setups. NVIDIA introduced Multi-Instance GPU (MIG) with the A100 Tensor Core GPU in 2020, allowing a single GPU to be partitioned into up to seven isolated instances for guaranteed resource allocation and enhanced utilization in multi-tenant environments. Integration with container orchestration advanced in 2021, as Kubernetes gained robust GPU sharing capabilities through device plugins and operators supporting MIG and time-slicing, enabling efficient workload distribution in cloud-native AI pipelines. By 2025, NVIDIA released vGPU software version 18.0, adding support for Windows Server 2025 and AI-optimized VDI, facilitating seamless Linux workloads via Windows Subsystem for Linux and broadening virtualization for generative AI applications.[20][7][21] These developments were propelled by the transition from VDI-centric use cases to AI/ML imperatives, where efficient GPU sharing became critical for cost-effective scaling. The data center GPU market, encompassing virtualization technologies, grew from $18.4 billion in 2024 to a projected $92 billion by 2030, reflecting surging demand for virtualized compute in hyperscale environments.[22]Virtualization Techniques
API Remoting
API remoting is a software-based technique for GPU virtualization that enables multiple virtual machines (VMs) to share a physical GPU without requiring specialized hardware support. In this approach, API calls from graphics or compute applications running in a guest VM—such as those to OpenGL for rendering or CUDA for parallel computing—are intercepted by a proxy driver or middleware in the guest. These calls are then serialized into a data stream and forwarded to the host system, either through inter-process communication (IPC) for local virtualization or over a network for remote execution. On the host, the calls are deserialized, executed on the physical GPU using the native driver, and the results are returned to the guest VM in a similar manner. This method abstracts the GPU hardware, allowing transparent access while the host retains full control over the device.[23][24] Prominent implementations of API remoting include rCUDA, VirtGL, and gVirtuS, each targeting specific APIs and use cases. rCUDA, introduced around 2010, focuses on remote CUDA execution, enabling GPU-accelerated applications in HPC clusters to offload computations to distant accelerators via network forwarding, thereby reducing the need for local GPUs in every node.[25] VirtGL, developed as part of the Mesa 3D graphics library, provides OpenGL acceleration in QEMU-based VMs by translating guest OpenGL calls to host-side rendering through a virtual 3D GPU interface, supporting desktop and lightweight graphics workloads.[26] gVirtuS, originating from a 2010 framework for cloud-based GPGPU, offers general-purpose API forwarding for CUDA and other libraries, facilitating transparent virtualization across heterogeneous environments like ARM clusters accessing x86-hosted GPUs.[27] The primary advantages of API remoting lie in its low hardware requirements and flexibility for distributed systems, as it requires no modifications to the GPU itself and supports dynamic resource sharing among VMs. It is particularly well-suited for high-performance computing (HPC) environments where compute locality is less critical than resource efficiency, such as in multi-node setups integrated with Message Passing Interface (MPI) for parallel workloads like AI training or scientific simulations. However, drawbacks include significant latency from serialization, deserialization, and transmission—often requiring sub-20 μs round-trip times to limit overhead to under 5% in inference tasks—which can result in 20-50% performance degradation for bandwidth-intensive or latency-sensitive graphics applications due to data transfer overheads.[24][28] From a security perspective, API remoting enhances isolation by confining guest access to mediated API interactions rather than direct hardware control, thereby reducing risks of GPU side-channel attacks or VM escapes that could arise in pass-through scenarios. This software-mediated approach ensures that sensitive data remains within VM boundaries, with the host enforcing access policies, though careful implementation is needed to prevent implicit resource contention.[23][29]Device Emulation
Device emulation in GPU virtualization refers to the hypervisor's full software simulation of GPU hardware, presenting a virtual graphics device to the guest operating system that mimics physical GPU behavior without any involvement of actual hardware. The hypervisor intercepts and emulates key GPU components, including registers for configuration and control, memory mappings for framebuffers and textures, and command submission queues where the guest issues rendering instructions. These operations are handled entirely in software by the hypervisor's device models, ensuring isolation and compatibility while processing I/O traps from the guest. This technique is foundational in emulators like QEMU, where it enables basic graphics support in virtualized environments devoid of dedicated GPUs.[30][31] A key example is QEMU's virtio-gpu device model, which implements a paravirtualized GPU interface for both 2D and limited 3D acceleration. The guest OS loads a virtio-compatible driver that communicates with the hypervisor via a standardized ring buffer, submitting graphics commands that QEMU emulates using CPU-based backends. For 3D workloads, virtio-gpu integrates with software renderers like LLVMpipe in the Mesa 3D graphics library, which translates OpenGL calls into multithreaded CPU instructions for rasterization, vertex processing, and shading without hardware acceleration. LLVMpipe leverages LLVM for just-in-time code generation, supporting up to 32 CPU cores for parallel execution, but remains constrained to basic OpenGL features.[31][32][33] The advantages of device emulation include its independence from physical GPUs, providing broad compatibility across host hardware and allowing virtualization on standard servers or even CPU-only systems. It ensures strong isolation since no real hardware is shared, making it suitable for secure or resource-constrained deployments. However, performance drawbacks are significant: software-based rendering imposes heavy CPU overhead, limiting throughput to basic tasks and rendering complex 3D scenes impractically slow, often with frame rates below 30 FPS even on multi-core hosts for simple low-resolution workloads. This makes it viable only for lightweight graphics, such as desktop icons, text rendering, and simple UI elements, while failing for demanding applications like gaming or GPGPU compute due to the absence of parallel hardware execution. Unlike API remoting techniques, which can proxy compute operations to physical GPUs, device emulation cannot support high-performance GPGPU effectively.[31][32][34] Technically, paravirtualized drivers in the guest enhance efficiency by reducing trap frequency compared to fully emulated legacy devices like VGA; the driver batches commands into virtqueues for the hypervisor to process, emulating responses for register reads/writes and memory operations. This handles straightforward workloads proficiently—such as 2D compositing in desktop environments—but bottlenecks arise in shader-heavy or texture-intensive scenarios, where CPU simulation of GPU pipelines leads to orders-of-magnitude slowdowns relative to native hardware. GPGPU emulation is particularly unsupported, as the model focuses on graphics APIs rather than parallel compute kernels.[33][31] Evolutionarily, device emulation has benefited from integrations like the SPICE protocol, which enhances remote display by efficiently transporting emulated graphics output from the hypervisor to clients, supporting features such as dynamic resolution adjustment and multi-monitor setups without hardware dependencies. Initially limited to frame-based protocols like VNC, SPICE's adoption in QEMU improved latency and bandwidth for software-rendered content, but the approach persists as a fallback for hosts without GPU resources, supplanted by hardware-accelerated methods in production environments.[35][36]Fixed Pass-Through
Fixed pass-through, also known as direct device assignment or PCI passthrough, dedicates an entire physical GPU to a single virtual machine (VM) by assigning the hardware directly to the guest, allowing it to operate as if it were native hardware. This technique leverages frameworks like VFIO in Linux to bind the GPU device to the VM, bypassing the hypervisor's intervention in device operations.[37] The guest operating system interacts with the GPU through standard drivers, perceiving it as a physical device without emulation overhead.[38] To implement fixed pass-through, an Input-Output Memory Management Unit (IOMMU), such as Intel VT-d or AMD-Vi, must be enabled in the host BIOS to provide address translation, DMA isolation, and interrupt remapping, ensuring the assigned GPU cannot access unauthorized host memory.[37] The setup involves unbinding the GPU from the host's native driver (e.g., via sysfs in Linux) and rebinding it to a VFIO driver like vfio-pci, which creates an IOMMU-protected container for the device.[37] In hypervisors such as KVM/QEMU, the GPU is then attached to the VM configuration, typically using commands or XML descriptors to specify the PCI device ID, allowing the guest to load its own vendor-specific drivers upon boot.[38] This approach delivers near-native performance, often achieving 98-100% of bare-metal GPU efficiency in workloads like CUDA and OpenCL benchmarks across hypervisors including KVM.[39] It provides full access to GPU features, including compute capabilities and direct memory access, making it suitable for latency-sensitive applications. However, it lacks resource sharing, requiring one GPU per VM and leaving the device idle when the VM is powered off or inactive.[39] The configuration process is complex, demanding precise hardware compatibility and manual intervention for binding and isolation.[37] Fixed pass-through is commonly employed in gaming VMs for high-fidelity rendering and single-tenant AI training environments where dedicated hardware maximizes throughput.[38] It also supports multi-GPU configurations, enabling passthrough of multiple devices to a single VM for scaled workloads. A key limitation is the dependency on one GPU per VM, which can lead to underutilization in multi-tenant setups, and challenges in error recovery; if the VM crashes, the GPU may enter an unresponsive state requiring host-level resets, as direct access prevents the hypervisor from managing device state.[40] This has prompted evolutions toward mediated techniques for safer sharing.[38]Mediated Pass-Through
Mediated pass-through is a GPU virtualization technique that allows multiple virtual machines (VMs) to share a single physical GPU through kernel-level software mediation, providing each VM with a virtual GPU (vGPU) device while maintaining high performance and isolation. This method relies on the Linux mediated device (mdev) framework, which enables the creation of virtual devices backed by the physical GPU hardware. The hypervisor then schedules access to the GPU among the vGPUs using time-slicing mechanisms or lightweight approximations of Single Root I/O Virtualization (SR-IOV), ensuring fair resource allocation without dedicating the entire GPU to one VM.[41][4] In practice, the mdev framework registers virtual device types with the VFIO (Virtual Function I/O) subsystem, allowing user-space tools to instantiate vGPUs as mediated devices. For instance, NVIDIA's vGPU driver integrates with this framework to generate mediated devices supporting configurable profiles, such as dividing a 16 GB GPU into up to 16 slices of 1 GB each, tailored to workload needs like graphics rendering or compute tasks. These vGPUs appear as PCI devices to the VMs, enabling direct driver access while the host kernel mediates command submissions and resource contention.[42][43] The technique offers a balance between multi-tenancy and efficiency, supporting up to 32 VMs per GPU in fine-grained profiles, with performance reaching 80-95% of native execution for GPU-intensive workloads, depending on the sharing ratio and application. However, it introduces overhead from GPU context switching—typically 5-20%—and requires proprietary licensing for commercial implementations like NVIDIA vGPU. Additional technical aspects include memory pinning to prevent page faults during VM execution and error containment to limit the impact of faults to individual vGPUs rather than the host or other VMs.[4][42][44] From a security perspective, mediated pass-through enhances isolation by leveraging Input-Output Memory Management Units (IOMMUs) to restrict DMA operations, preventing malicious VMs from accessing unauthorized memory regions on the host or peers. This mediation layer also confines GPU faults, such as invalid commands or resource exhaustion, to the affected VM, reducing the risk of denial-of-service across the system. Unlike fixed pass-through, which assigns the full GPU to a single VM, this approach enables secure sharing through scheduled, mediated access.[45]Hardware-Assisted Partitioning
Hardware-assisted partitioning leverages specialized GPU hardware features to divide a single physical GPU into multiple isolated sub-partitions or virtual functions, enabling direct assignment to virtual machines (VMs) with minimal software intervention. This approach primarily utilizes Single Root I/O Virtualization (SR-IOV), a PCIe standard that allows a physical function (PF) on the GPU to create multiple lightweight virtual functions (VFs), each appearing as an independent PCIe device assignable to separate VMs for direct I/O access. Complementing SR-IOV, proprietary technologies like NVIDIA's Multi-Instance GPU (MIG) further partition the GPU into isolated instances, allocating dedicated slices of compute cores, memory, and cache enforced at the hardware level to ensure resource exclusivity and security. Under SR-IOV, the PCIe specification supports up to 256 VFs per PF, though GPU implementations typically limit this to 8–64 VFs depending on the device to balance resource granularity and overhead. Each VF provides near-direct access to GPU resources without hypervisor mediation, bypassing traditional software virtualization layers for reduced latency. In MIG, partitioning divides the GPU's streaming multiprocessors (SMs), high-bandwidth memory (HBM), and L2 cache into configurable slices—such as 1/7th or 1/3rd of total resources— with hardware mechanisms like memory protection units and fault isolation domains preventing cross-instance interference or data leakage. Prominent examples include NVIDIA's A100 and H100 GPUs, which introduced MIG in 2020 and support up to seven isolated instances per GPU, each with independent compute (e.g., 10–40 SMs) and memory (e.g., 5–40 GB HBM) allocations tailored for data center workloads. AMD's Instinct MI-series accelerators, such as the MI25 and later models, employ SR-IOV via their MxGPU technology to generate up to 16 VFs, enabling fine-grained sharing of compute and memory resources across VMs. Intel's Graphics Virtualization Technology with direct device assignment (GVT-d), extended to select discrete GPUs like the Arc Pro series (e.g., B50 and B60, introduced in 2025) through SR-IOV enablement, allows partitioning into multiple virtual GPUs for isolated graphics acceleration.[46] This method delivers near-native performance, often exceeding 95% of bare-metal throughput per partition due to hardware-level resource dedication and minimal overhead, while providing strong isolation comparable to physical device passthrough. However, adoption is constrained by hardware availability—only specific high-end GPUs support these features—and partition sizes are fixed at configuration time, limiting dynamic resizing without rebooting the system. In 2025, hardware-assisted partitioning has evolved for AI applications through integration with confidential computing, where features like NVIDIA's Hopper and Blackwell GPU enclaves enable secure, attested execution of sensitive models in isolated partitions, protecting against host or multi-tenant threats during inference and training. AMD is also advancing GPU confidential computing capabilities on Instinct accelerators.[47][48]Vendor-Specific Implementations
NVIDIA
NVIDIA's GPU virtualization ecosystem centers on its proprietary vGPU software, formerly known as GRID, which entered private beta in 2013 and enables multiple virtual machines to share a single physical GPU through time-slicing or hardware partitioning techniques.[49] This platform supports a range of data center GPUs, including Tesla, RTX, and A-series models, providing direct access to NVIDIA's graphics and compute capabilities in virtualized environments for applications like virtual desktops, professional visualization, and AI workloads.[1] By leveraging mediated pass-through, vGPU allows efficient resource allocation while maintaining isolation between VMs.[50] Key features of NVIDIA vGPU include flexible profiles for VM sizing, such as the A40-8Q profile that assigns 8 GB of frame buffer to support medium-intensity graphics tasks.[51] The vGPU 18.0 release in 2025 extends compatibility to Windows Server 2025 as a guest OS, introduces AI-optimized VDI for generative AI applications, and incorporates confidential vGPU capabilities to enhance data privacy in multi-tenant setups.[7] These advancements prioritize secure, high-performance virtualization tailored for enterprise AI and remote work scenarios. Hardware integration in NVIDIA vGPU utilizes SR-IOV on A100 and subsequent GPUs to enable virtual functions with full IOMMU protection, reducing overhead and improving VM isolation.[52] Complementing this, Multi-Instance GPU (MIG) partitioning divides a GPU like the A100 into up to seven independent instances, each with dedicated compute, memory, and bandwidth for assignment to VMs.[53] Licensing options are segmented by use case: vApps and vPC for virtual desktop infrastructure (VDI), vWS for professional visualization, and vCS for compute workloads supporting CUDA acceleration.[54] Performance scaling allows up to 32 vGPUs per physical GPU on models like the A40, optimizing density for large-scale deployments while preserving CUDA and multi-instance support for machine learning tasks.[55] The broader ecosystem integrates with NVIDIA GPU Cloud (NGC) for deploying pre-built AI containers on vGPU instances and with Kubernetes via the NVIDIA GPU Operator, facilitating orchestrated GPU sharing in containerized environments.[56][57]AMD
AMD's GPU virtualization technology, known as MxGPU, was introduced in 2016 as the industry's first hardware-virtualized GPU solution, leveraging the Single Root I/O Virtualization (SR-IOV) standard to enable secure and efficient sharing of GPU resources among multiple virtual machines (VMs).[58] MxGPU partitions the physical GPU into virtual functions (VFs), each appearing as an independent device on the PCIe bus, allowing up to 16 vGPUs per physical GPU on supported models such as the Radeon Instinct MI25 and MI50 accelerators.[59] This spatial partitioning approach dedicates fixed slices of GPU resources—like compute units, memory, and engines—to each VF, eliminating the need for time-slicing and providing predictable quality of service without software-mediated overhead.[60] Key features of MxGPU include direct hardware access for VMs, which supports both graphics and compute workloads through integration with APIs such as OpenGL and Vulkan, as well as AMD's ROCm open-source platform for high-performance computing (HPC) and AI applications.[59] Unlike time-sharing methods, MxGPU relies on SR-IOV for isolation and resource allocation, ensuring each vGPU receives dedicated hardware slices for enhanced security and minimal contention.[61] The technology supports fine-grained partitioning modes, such as single-precision (SPX) with 1 VF or compute-precision (CPX) with 8 VFs, optimized for specific workloads like AI training on Instinct GPUs.[62] Hardware support for MxGPU spans the MI-series Instinct accelerators, with SR-IOV enabling up to 16 VFs on models like MI25 and MI50, and broader capabilities on newer architectures.[59] As of 2025, updates for AI-focused workloads incorporate the CDNA architecture in Instinct MI350X and MI355X GPUs, which maintain MxGPU compatibility while delivering enhanced tensor core performance for machine learning tasks.[59] These GPUs, paired with AMD EPYC processors in server environments, facilitate scalable virtualization for data centers. MxGPU achieves near-native performance for both graphics rendering and compute operations, with VMs accessing GPU resources directly via VFs to minimize latency and maximize throughput in virtualized setups.[61] The ecosystem relies on the open-source amdgpu driver stack, including a physical function (PF) driver for the host and virtual function (VF) drivers for guests, alongside ROCm for compute acceleration and AMD SMI tools for management.[63] This open-source emphasis promotes broad compatibility across hypervisors like KVM/QEMU, distinguishing MxGPU in enterprise deployments.Intel and Others
Intel's approach to GPU virtualization emphasizes integrated graphics processing units (iGPUs) and low-power discrete solutions, prioritizing efficient sharing for virtual desktop infrastructure (VDI) and media workloads. Introduced in 2013 with the 5th generation Intel Core processors (Broadwell), Intel Graphics Virtualization Technology for graphics (GVT-g) provides mediated pass-through capabilities, enabling the creation of up to seven virtual GPUs (vGPUs) from a single iGPU. This technology emulates full or partial GPU instances, allowing multiple virtual machines to access graphics acceleration while maintaining isolation through the VFIO mediated device framework. GVT-g supports platforms up to 10th generation Intel Core processors and is particularly effective for lightweight graphics tasks, including 3D rendering and display output.[64][65][66] A key feature of GVT-g is its integration with Intel Quick Sync Video, which enables hardware-accelerated video encoding and decoding within virtualized environments, supporting codecs like H.264 and HEVC for applications such as video conferencing and streaming. For newer integrated GPUs, such as the Iris Xe in 11th generation Intel Core processors (Tiger Lake) and beyond, Intel shifted to Single Root I/O Virtualization (SR-IOV), which partitions the iGPU into up to seven virtual functions for direct assignment to VMs, reducing overhead compared to emulation-based approaches. This SR-IOV support extends to discrete GPUs in the Intel Arc Pro series, including the Battlemage (B-series) lineup, where it facilitates time-sliced or partitioned access for multi-tenant scenarios. Additionally, the Intel Data Center GPU Flex Series, introduced in 2022, builds on SR-IOV with enhanced partitioning for VDI and visual AI, allowing flexible resource allocation across up to 32 Xe cores and 4 media engines per GPU, depending on the model (e.g., Flex 170).[67][68][69][46][70] Performance benchmarks for these Intel solutions show strong results for light VDI use cases, with virtualized workloads achieving over 85% of native iGPU performance for 3D tasks like office applications and basic 3D modeling, though efficiency drops for compute-heavy GPGPU operations due to emulation or partitioning overhead.[71] Intel's Gaudi3 AI accelerators, entering broader availability in 2025, incorporate SR-IOV-like virtualization through PCI passthrough in KVM environments, enabling scalable AI training and inference in virtualized data centers while supporting open-source frameworks like PyTorch.[72][73] Other vendors contribute niche solutions tailored to specific ecosystems. ARM's Mali GPUs, common in mobile and embedded systems, support virtualization via paravirtualization extensions in hypervisors like KVM, where a modified kernel driver and arbiter remap registers and route interrupts to enable secure GPU sharing across VMs without full hardware passthrough. In cloud platforms, AWS leverages the Nitro hypervisor for underlying isolation, combined with software techniques like time-slicing in Amazon EKS, to share GPUs across EC2 instances for inference workloads. Google Cloud's Tensor Processing Units (TPUs) integrate virtualization layers through TPU VMs, allowing direct SSH access to dedicated or multi-sliced accelerators for AI tasks, optimizing for high-throughput matrix operations in virtualized setups.[74][75][76][77]Hypervisor and Platform Support
KVM/QEMU
KVM (Kernel-based Virtual Machine) serves as the kernel module providing hardware-assisted virtualization acceleration, while QEMU acts as the user-space emulator and orchestrator for device handling in virtual machines, enabling GPU virtualization through paravirtualized interfaces, direct passthrough, and mediated devices.[78] This combination supports VFIO (Virtual Function I/O) for fixed PCI passthrough, allowing a physical GPU to be directly assigned to a guest VM with near-native performance, and mediated devices (mdev) for sharing a single GPU across multiple VMs via vendor-specific virtualization frameworks.[78][79] Setup for GPU virtualization in KVM/QEMU typically begins with enabling IOMMU support in the host kernel (e.g., viaintel_iommu=on for Intel or amd_iommu=on for AMD in GRUB configuration) to facilitate secure device isolation. For emulated graphics, virtio-gpu is configured as the virtual display device using QEMU's -device virtio-gpu option, paired with the virglrenderer backend on the host for 3D acceleration in guests supporting OpenGL.[80] PCI passthrough for fixed assignment is managed through libvirt by editing the VM's XML configuration to include a <hostdev> element with type='pci' and the VFIO driver, requiring the GPU to be unbound from host drivers beforehand using tools like vfio-pci.[78] For mediated vGPUs, vendor drivers (e.g., NVIDIA GRID or AMD MxGPU) are installed on the host to create mdev instances via mdevctl, which are then attached to VMs as PCI-like devices in libvirt XML with type='mdev'.[79][78]
Key features include seamless integration with libvirt for declarative VM configuration, allowing GPU devices to be specified in XML without direct QEMU command-line intervention, which simplifies management in tools like virt-manager. Multi-queue support in virtio devices enhances performance by distributing workloads across multiple vCPUs, reducing bottlenecks in I/O-intensive scenarios such as graphics rendering. As of October 2025, libvirt 11.8.0 and later versions provide support for NVIDIA Multi-Instance GPU (MIG) configurations through mediated devices, enabling fine-grained GPU resource allocation to VMs on compatible hardware like A100 or H100 GPUs.[81]
Limitations include the need for manual configuration of SR-IOV virtual functions, which requires explicit host-side setup for creating and binding VFs before passthrough, and optimal performance is primarily achieved with Linux guests due to better driver support for virtio and VFIO interfaces. Proxmox VE, a popular open-source virtualization platform, leverages KVM/QEMU for GPU sharing by supporting both PCI passthrough and mediated devices in its web-based interface, facilitating multi-VM GPU utilization in datacenter environments.[82][78]
VMware
VMware vSphere 7 and later versions integrate GPU virtualization through support for NVIDIA Virtual GPU (vGPU) software, which utilizes mediated devices to enable time-sliced sharing of NVIDIA GPUs among multiple virtual machines (VMs), and AMD MxGPU technology, which leverages SR-IOV for hardware-based partitioning of AMD GPUs.[83][60] This allows enterprise environments to deploy GPU-accelerated workloads efficiently on virtualized infrastructure. VMware Horizon, a virtual desktop infrastructure (VDI) solution, builds on vSphere to deliver remote access to these GPU-enabled VMs, optimizing for graphics-intensive applications such as design and simulation.[84] To set up GPU virtualization in vSphere, administrators install the NVIDIA vGPU Manager on the ESXi host via the vSphere Client or command line, followed by configuring VM profiles that define the allocated GPU resources, such as frame buffer size and compute capabilities.[85] For AMD MxGPU, the process involves enabling SR-IOV in the server BIOS, running the MxGPU Setup Script to create virtual functions (VFs), and assigning these PCI passthrough devices to VMs through vSphere.[60] Fixed pass-through using SR-IOV is also supported for dedicated GPU allocation to individual VMs in both NVIDIA and AMD configurations.[86] Key features include dynamic resource allocation via configurable vGPU profiles, which allow flexible partitioning of GPU memory and cores to match workload demands, and vMotion compatibility for live migration of GPU-enabled VMs between hosts without downtime, provided both hosts share compatible GPU configurations.[87] In 2025, NVIDIA vGPU 18.0 introduced AI extensions that enhance VDI support for machine learning tasks, including compatibility with Windows Subsystem for Linux on Windows Server 2025 and improved inference acceleration in virtualized environments.[7] Performance is optimized for graphics and compute workloads, with NVIDIA vGPU enabling low-latency rendering in VDI scenarios and AMD MxGPU providing near-native throughput for professional visualization applications.[88][60] Configurations can support up to 16 vGPUs per physical GPU, depending on the hardware and profile selected, balancing density and performance for enterprise-scale deployments.[89][60] Security for shared GPU environments is bolstered by vSphere Virtual Machine Encryption, which protects VM data at rest and in transit, including configurations with mediated pass-through devices where multiple VMs access the same physical GPU. Additionally, both NVIDIA vGPU and AMD MxGPU enforce isolation between virtual instances to prevent cross-VM interference, ensuring compliance in multi-tenant setups.[42][60]Microsoft Hyper-V
Microsoft Hyper-V provides GPU virtualization through two primary mechanisms: GPU Partitioning (GPU-P), introduced in Windows Server 2022, which enables sharing a single physical GPU among multiple virtual machines (VMs) by allocating dedicated fractions of its resources, and Discrete Device Assignment (DDA), which allows full passthrough of an entire GPU to a single VM for direct hardware access without hypervisor mediation.[90][91] GPU-P leverages hardware-assisted partitioning, similar to SR-IOV techniques, to create isolated virtual functions from the physical GPU, ensuring each VM receives a consistent slice of compute, memory, and encode/decode capabilities while maintaining security isolation.[90] To set up GPU-P, administrators use PowerShell cmdlets on the Hyper-V host to enumerate supported GPUs, create partitions (e.g., dividing a GPU into four equal 25% slices), and assign them to VMs; this process requires compatible GPU drivers from vendors like NVIDIA or AMD installed on the host.[92][93] DDA setup involves dismounting the GPU from the host using PowerShell commands likeDismount-VMHostAssignableDevice, then assigning it to a VM via Add-VMAssignableDevice, followed by VM reconfiguration to recognize the device.[91] Both methods support NVIDIA and AMD GPUs, with NVIDIA's drivers enabling advanced features like vGPU profiles in partitioned modes.[94]
Key features of GPU-P include support for up to a vendor-defined maximum of partitions per GPU—often 4 or more depending on the hardware OEM configuration—and compatibility with SR-IOV for efficient resource virtualization, allowing VMs to access GPU resources as native PCIe devices.[92][90] Windows Server 2025 enhances this with live migration support for GPU-partitioned VMs and integration with NVIDIA vGPU software version 18.0, which provides optimized profiles for partitioned deployments on compatible hardware like the NVIDIA L4 or A40.[90][94]
GPU-P is suitable for inference and training tasks in cloud environments, though it may introduce minor overhead compared to full passthrough. However, for graphics-intensive applications like VDI, Hyper-V's partitioning is less flexible than specialized tools, as it prioritizes compute sharing over advanced rendering optimizations and lacks native multi-session desktop support without additional configuration.[90]