Fact-checked by Grok 2 weeks ago

Hardware performance counter

Hardware performance counters (HPCs), also referred to as performance monitoring counters (PMCs), are specialized on-chip registers within a 's performance monitoring unit (PMU) designed to count instances of predefined hardware events during software execution, such as instructions retired, hits and misses, mispredictions, and accesses. These counters capture both architectural events, which are consistent across processor generations (e.g., total cycles or instructions executed), and microarchitectural events specific to implementation details (e.g., stalls or refills). By providing granular, low-level metrics on resource utilization, HPCs enable developers and system analysts to identify performance bottlenecks, optimize code, and measure efficiency in areas like and computational throughput. Introduced as part of performance monitoring units in processors from major vendors like , , and , HPCs typically support a limited number of simultaneous counters—often 4 to 8 programmable ones per , plus fixed-function counters—necessitating techniques like to monitor a broader set of hundreds of available s over time. Operating systems and tools, such as Linux's Perf, FreeBSD's HWPMC, or the Portable Hardware Locality (PAPI) interface, abstract access to these counters, allowing sampling modes (where counters trigger interrupts at intervals for stack traces) or direct counting for precise statistics. Beyond traditional , HPCs have applications in workload characterization, power estimation, and even security analysis, such as detecting anomalous behaviors indicative of through deviations in patterns. However, challenges like non-determinism in measurements (due to factors such as interrupts or ) and overcounting on certain s can affect accuracy, requiring or statistical adjustments for reliable insights.

Fundamentals

Definition and Purpose

Hardware performance counters (HPCs), also known as performance monitoring counters (PMCs), are specialized registers integrated into modern central processing units (CPUs) that increment in response to the occurrence of predefined microarchitectural events during program execution. These events encompass low-level processor activities, allowing for precise tracking of internal operations without requiring software intervention. The performance unit (PMU), which houses these counters, serves as the overarching hardware component responsible for and counting such events across the processor core. The primary purpose of HPCs is to enable low-overhead performance analysis and optimization in software development and system tuning. By capturing metrics like (CPI), developers can identify bottlenecks, such as inefficient code paths or , with minimal perturbation to the application's . This facilitates , , and the derivation of key performance indicators, including branch misprediction rates and cache efficiency, ultimately aiding in the refinement of algorithms and utilization. Typical events tracked by HPCs include the number of retired instructions, clock cycles elapsed, branch mispredictions, and hits or misses at various levels (e.g., L1 instruction or data refills). Additional examples encompass utilization, access patterns, and pipeline stalls, providing insights into resource efficiency and potential areas for improvement. These counters support both fixed-purpose tracking (e.g., total cycles) and programmable selection of events, allowing flexibility in monitoring specific aspects of processor behavior. HPCs first appeared in commercial processors with the Pentium in 1993, where initial implementations focused on basic event counting via dedicated pins and model-specific registers to support early performance monitoring and .

Basic Components and Operation

Hardware performance counters are primarily composed of specialized registers within the processor's performance monitoring unit (PMU), including the counters themselves, which are typically programmable or fixed-purpose registers ranging from 40 to 64 bits wide to accumulate event counts without frequent overflow during typical workloads. Event select registers configure these counters by specifying which microarchitectural events—such as executions, accesses, or stalls—to monitor, allowing flexibility in tracking diverse performance metrics. Control registers complement these by enabling or disabling counter operation, setting overflow thresholds, and managing global PMU states like to minimize power overhead. In operation, counters initialize to zero at the start of a measurement session and increment atomically upon each occurrence of the configured event, providing a direct tally of hardware activity over time. To address the constraint of limited counter availability (often 4 to 8 per core), multiplexing techniques rotate event configurations mid-execution, apportioning measurement time across a larger set of events while approximating simultaneous counts through statistical sampling or time-slicing. Overflow detection occurs via hardware flags or programmable interrupts when a counter saturates its bit width, such as reaching $2^{64} in 64-bit implementations, enabling software to capture and reset values to prevent data loss. For threshold-based sampling, an interrupt triggers upon hitting a preset value in the control registers, facilitating low-overhead profiling by recording program state at regular event intervals rather than continuous polling. The fundamental counting model follows: \text{Counter_value} = \sum \text{(Event_occurrences)} over the measurement interval, where the summation aggregates discrete hardware events to yield metrics like total instructions retired or branch mispredictions. However, inaccuracies can arise from in multi-core systems, where slight timing offsets between core clocks lead to desynchronization in aggregated measurements across threads or processes. Event attribution errors also occur in environments, as speculative operations and instruction reordering complicate precise mapping of counted events back to originating instructions.

Historical Development

Origins in Early Processors

Hardware performance counters originated in the early as processors transitioned to superscalar designs and higher clock speeds, necessitating built-in mechanisms for low-overhead and performance analysis of internal events such as stalls and cache misses. introduced the first such counters in its processor in 1993, featuring two programmable special-purpose counters capable of monitoring 38 distinct hardware events, including instruction execution, branch predictions, and data cache accesses. These counters operated in parallel with the processor's execution logic, allowing real-time collection of statistics to identify bottlenecks amid the 's dual- architecture and rising frequencies up to 66 MHz. Early adoption extended beyond Intel to other RISC architectures in the 1990s. IBM's architecture in the RS/6000 systems incorporated performance monitor facilities starting with the POWER2 implementation in 1993, enabling event-based tracing and basic counting of processor activities such as instruction throughput and memory operations. Similarly, Sun Microsystems introduced event counting in its processors during the mid-1990s with the UltraSPARC series in 1995, where performance registers enabled tracking of metrics such as cache hits and floating-point operations to optimize workstation workloads. A key milestone came with Intel's P6 microarchitecture in the Pentium Pro processor in 1995, which expanded counter capabilities to include branch prediction events, reflecting the design's advanced 512-entry branch target buffer and adaptive prediction algorithm achieving approximately 90% accuracy. Limited to 2-4 counters per core, these features allowed measurement of mispredicted branches—incurring penalties of 11-15 cycles—facilitating tuning of the decoupled superscalar pipeline. Access was improved via a new non-privileged instruction, reducing overhead compared to prior model-specific register reads. Early designs faced significant challenges, including a lack of cross-architecture , which hindered portability of tools, and high access overhead due to privileged-mode requirements and the need for multiple runs to capture different events. Counters focused primarily on simple metrics like instructions executed and basic stalls, omitting complex interactions in emerging . These limitations underscored the nascent stage of the technology, prioritizing core functionality over comprehensive .

Evolution in Modern Architectures

Since the early , hardware performance counters have evolved significantly to accommodate the shift toward multi-core processors, enabling per-core monitoring to capture granular events across multiple execution units. For instance, the Intel Core 2 Duo processor, released in 2006, introduced support for programmable performance counters that operate independently on each core, allowing developers to track metrics like instruction execution and cache accesses without interference from shared resources. This per-core capability addressed the limitations of single-core designs by providing scalable in workloads, marking a pivotal advancement in handling concurrency-driven performance analysis. To support power efficiency in denser multi-core systems, later architectures incorporated counters for and thermal management. Intel's , launched in 2011, added performance monitoring events related to power consumption and thermal throttling, such as those tracking bandwidth limits during temperature-induced slowdowns, which helped optimize dynamic voltage and frequency scaling (DVFS) in power-constrained environments. These developments reflected broader trends in the , where counters expanded to include events for shared components like last-level caches, enhancing overall system-level visibility amid rising counts. In the post-2020 era, hardware performance counters have integrated with specialized accelerators to monitor emerging workloads, particularly in and . NVIDIA's GPUs, starting with the architecture in 2020, provide hardware counters accessible via tools like DCGM for tracking tensor core utilization, including metrics on matrix multiply-accumulate operations and occupancy rates, which quantify efficiency in inference and training. Similarly, AMD's cores, introduced in 2022, feature enhanced performance monitoring for wide-scalar execution paths, offering precise event counts for dispatch and retirement in hyperscalar designs to optimize throughput in vector-heavy applications. Standardization efforts have promoted interoperability across architectures, with ARM's Performance Monitoring Unit (PMU) extensions—defined in the ARMv8 architecture and refined in subsequent versions like ARMv8.1—providing a consistent for counter events such as counts and predictions, now supporting up to 32 counters in high-end implementations. This partial alignment facilitates portable profiling tools, contrasting with vendor-specific extensions in x86. Advancements driven by have enabled counters to monitor increasingly fine-grained events in sub-5nm processes, such as AVX instruction latencies in modern x86 CPUs, where dedicated events capture execution delays from vector unit pipelines to diagnose bottlenecks in SIMD-heavy code. Recent x86 processors, for example, support 4 programmable and 3 fixed counters per , scaling to dozens in multi-core configurations to handle the complexity of advanced nodes without overwhelming overhead. These evolutions underscore a focus on precision and scalability, ensuring counters remain vital for tuning performance in power-efficient, heterogeneous systems.

Implementations Across Architectures

x86 and AMD64 Implementations

In x86 and AMD64 architectures, hardware performance counters are implemented through dedicated registers that enable detailed monitoring of processor events at both core and levels. processors, such as those in the generation released in 2021, provide 8 general-purpose performance monitoring counters (PMCs) per logical core, allowing for the tracking of a wide range of microarchitectural activities. Later generations, such as Arrow Lake (released in 2024), maintain similar core PMU configurations. These counters are configured using Model-Specific Registers (MSRs) like IA32_PERFEVTSEL0–IA32_PERFEVTSEL7 (MSRs 186H–18DH), which specify the event type, unit mask, privilege levels (user/OS), and additional qualifiers such as or inversion. Additionally, supports counters for monitoring non-core components, including last-level (LLC) events like LLC misses, which are accessed via separate MSRs such as those in the UNC_CBO_* family for box performance monitoring. AMD implementations in the Zen architecture follow a similar model but include enhancements tailored to its chiplet-based design. In Zen 3 processors, released in 2020, each core features 6 programmable PMCs alongside 3 fixed-function counters, providing flexibility for event selection while dedicating fixed counters to common metrics like unhalted cycles and retired instructions. Zen 5 processors, released in 2024, retain this structure but add performance counter (PMC) virtualization to enhance security against side-channel attacks in virtualized environments. Configuration occurs through MSRs such as PERF_CTLx (C001_0200H–C001_0205H) for programmable counters and fixed counterparts like FIXED_CTR0–2. Zen architectures also incorporate Instruction-Based Sampling (IBS) extensions, which enable precise sampling of instruction fetches and operations for detailed branch tracing and latency analysis, distinct from standard counter overflow mechanisms. Both and support over 200 configurable performance events, encompassing categories such as interactions and computational operations, though the exact sets vary by . Representative events include L1 data load misses (e.g., Intel's MEM_LOAD_UOPS_RETIRED.L1_MISS, EventSel D1H with UMask 08H) and L2 demand read misses (e.g., L2_RQSTS.DEMAND_DATA_RD_MISS, EventSel 24H with UMask 21H), as well as floating-point operations like retired scalar double-precision additions (e.g., FP_ARITH_INST_RETIRED.SCALAR_DOUBLE, EventSel C7H with UMask 01H). These events facilitate the computation of key performance metrics via ratios, such as (), calculated as: \text{IPC} = \frac{\text{Instructions Retired (e.g., INST_RETIRED.ANY)}}{\text{CPU Clocks Unhalted (e.g., CPU_CLK_UNHALTED.THREAD)}} \times \text{scaling factor (typically 1)} where counter values are read from appropriate PMCs or fixed registers, and scaling accounts for any architectural multipliers. For AMD Zen 3, analogous events like Retired Instructions (PMCx0C0) and Core Cycles Not Halted (PMCx0076) support similar derivations. Zen 5 introduces updated event sets documented for Family 1Ah models. Key differences between and implementations include sampling mechanisms and counter characteristics. 's Precise Event-Based Sampling (PEBS) allows for low-latency, hardware-assisted recording of event samples directly into a memory buffer, capturing precise instruction pointers and data addresses with minimal OS intervention, as configured via IA32_PEBS_ENABLE and related MSRs. In contrast, 's IBS provides comparable precision for fetch and execution sampling but stores data in MSRs that require explicit software reads, potentially introducing slight overhead in high-frequency scenarios. Furthermore, counters exhibit variable widths across generations—typically 48 bits in but configurable in increments for handling—while maintains consistent 48-bit widths in recent cores, optimizing for 64-bit systems with indicators in MSRs.

ARM and RISC-V Implementations

In architectures starting from ARMv8 (introduced in 2011), the Performance Monitoring Unit (PMU) provides up to 31 programmable 64-bit event counters alongside a dedicated cycle counter for measuring processor activity. These counters support generic events such as instruction execution and cache references, as well as architecture-specific events including SIMD instruction usage and branch predictions, enabling detailed profiling of power-efficient mobile and systems. Event selection and counter enabling are managed through the Performance Monitor Control Register (PMCR), which allows configuration of counter reset behavior, clock division, and privilege-level access controls. ARMv9 (announced in 2021, with first implementations in 2022), introduces the Realm Management Extension (RME) for environments. RME supports PMU access in secure through event attribution rules, ensuring counters can only record events attributable to specific partitions (e.g., Realm or Non-secure) while preserving isolation from the host OS and . No dedicated events for RME operations are defined, but PMU filtering by state and exception level is enhanced. In architectures, performance monitoring relies on the base Zicntr and optional Zihpm extensions, which provide three fixed 64-bit counters (for cycles, retired instructions, and optionally time) and up to 29 additional programmable counters for events such as misaligned memory accesses and exceptions. The open-standard nature of allows vendors to extend these counters; for example, SiFive's U74 cores (announced in 2018) implement an enhanced PMU with configurable events for pipeline stalls and cache performance, supporting up to 32 total counters in multi-core configurations. Ratified in 2021, the Sscofpmf extension adds supervisor-mode support for overflow interrupts and mode-based filtering of counters, improving scalability for operating systems. The RVA23 profile, ratified in October 2024, standardizes Zicntr and Zihpm support for improved in applications like AI accelerators. Counter accessibility in is controlled via the mcounteren Control and Status Register (CSR), which enables filtering by level (e.g., , , ) to prevent unauthorized access while allowing delegated monitoring. Addressing post-2020 adoption gaps, 's integration in AI accelerators has included PMU events tailored to the RVV 1.0 extension (ratified 2021), such as instruction execution and load/store throughput, as seen in vendor implementations like SiFive's -enabled cores. (Note: RVV events are implementation-defined but standardized for .)

Access and Measurement Techniques

Direct Counter Access Methods

Direct access to hardware performance counters allows applications to read and configure counters synchronously without relying on interrupts or sampling mechanisms, enabling precise measurements at the cost of higher overhead. In user space, operating systems provide syscalls to manage counters securely. For instance, the kernel's perf_event_open() syscall, introduced in 2009, creates a linked to specific performance events, allowing user-space programs to configure, start, stop, and read counters without kernel-mode transitions for each read after initial setup. This abstracts hardware differences across architectures, supporting both hardware and software events while enforcing privilege checks to prevent unauthorized access. On x86 systems, kernel-mode access involves writing to Model-Specific Registers (MSRs) like IA32_PERFEVTSELx to select events and IA32_PMCx to read counts, often using the RDMSR and WRMSR instructions, which require ring 0 privileges. The programming steps for direct access typically begin with configuring the event select registers to specify the desired performance event, such as cache misses or branch instructions, by loading the appropriate event code and unit mask into the relevant MSR. For x86, this involves writing to IA32_PERFEVTSEL0–3 (or more on newer processors) using WRMSR, followed by enabling the counters via the CR4.PCE bit for user-mode reads or global control MSRs like IA32_PERF_GLOBAL_CTRL. Counters are then started and stopped by toggling enable bits in these registers, and values are read periodically—using RDPMC in user mode if permitted, which loads the counter specified by ECX into EDX:EAX. When the number of desired events exceeds available counters (typically 4–8 general-purpose counters per core), multiplexing is necessary, rotating configurations over time to approximate simultaneous measurement; libraries like libpfm (Portable Framework for Performance Monitoring) assist by translating event names to raw encodings and managing multiplexing schedules across platforms. Overhead is a key consideration for direct access, as frequent reads can impact performance. The RDPMC instruction incurs approximately 10 cycles of latency on modern x86 processors, making it suitable for infrequent polling but less ideal for tight loops without careful optimization. Privilege levels further constrain access: full configuration (e.g., writing MSRs) requires ring 0 ( mode), while limited reads via RDPMC are possible in ring 3 (user mode) only if the CR4.PCE bit is set by the , balancing and usability. To illustrate, the following pseudocode demonstrates setting up an x86 counter for L2 cache misses (event code 0x2E, unit mask 0x41) using inline assembly and polling its value:
#include <asm/msr.h>  // For MSR indices

void measure_cache_misses() {
    // Assume kernel mode or perf_event_open for user mode
    uint64_t start, end, config = (1ULL << 22) | (0x2EULL << 0) | (0x41ULL << 8);  // Enable bit, event, umask

    // Configure event select MSR (e.g., for counter 0: 0x186)
    wrmsr(0x186, config, 0);

    // Enable global counter (MSR 0x38F)
    wrmsr(0x38F, 1ULL << 0, 0);  // Enable PMC0

    // Read start value
    uint32_t low, high;
    asm volatile ("rdpmc" : "=a" (low), "=d" (high) : "c" (0));
    start = ((uint64_t)high << 32) | low;

    // Workload here (e.g., loop or function call)

    // Read end value
    asm volatile ("rdpmc" : "=a" (low), "=d" (high) : "c" (0));
    end = ((uint64_t)high << 32) | low;

    // Disable counter
    wrmsr(0x38F, 0, 0);

    uint64_t misses = end - start;
    printf("L2 cache misses: %llu\n", misses);
}
This approach provides exact counts for short workloads but requires handling 64-bit and ensuring freezing during reads.

Sampling-Based

Sampling-based leverages to collect statistical data on program execution by periodically interrupting the and recording its state at specific sample points. In this technique, are configured to after a predetermined number of events, such as instructions retired or cycles elapsed, triggering an that captures the (PC), register values, and other architectural state. This approach enables low-overhead monitoring of long-running applications by extrapolating overall behavior from a subset of samples, contrasting with direct access methods that poll values continuously but incur higher intrusion. Two primary types of sampling exist: time-based and event-based. Time-based sampling relies on counters or timers to generate s at fixed intervals, such as every clock cycles, providing a uniform temporal distribution of samples. Event-based sampling, in contrast, triggers on specific events like completions, allowing tuned to workload characteristics—for instance, sampling every retired instructions to focus on computational intensity. On architectures, Precise Event-Based Sampling (PEBS) enhances event-based methods by using a dedicated buffer to log precise event records autonomously, minimizing latency and "skid" (discrepancy between event occurrence and sample location) through reduced mechanisms and additional , such as memory addresses for load/store operations. The benefits of sampling-based stem from its minimal perturbation of the target application, achieving overheads typically under 5% compared to over 20% for pure software methods. Hardware-assisted features like PEBS further reduce this by offloading sample logging to the , enabling high sample without proportional costs. The sample can be calculated as: \text{Samples per second} = \frac{\text{[Clock rate](/page/Clock_rate)}}{\text{[Sample interval](/page/Interval)}} For example, on a 3 GHz with a 100,000-cycle , this yields 30,000 samples per second, balancing and overhead. Applications of sampling-based include constructing call graphs to visualize function invocation hierarchies and detecting hotspots—regions of code consuming disproportionate resources—through aggregation of PC samples across events like misses or mispredictions. These capabilities support efficient analysis in tools, aiding optimization in compute-intensive workloads.

Comparisons with Alternatives

Advantages Over Software Techniques

Hardware performance counters (HPCs) provide superior efficiency compared to software-based monitoring techniques by incurring virtually no instrumentation overhead during application execution. Software methods, such as those using code insertion like , typically add 5-30% execution slowdown due to the added probes and function calls required for tracing. In contrast, HPCs collect data transparently at the hardware level without modifying or binaries, allowing performance analysis under normal workloads. This zero-overhead approach is particularly beneficial for production environments where even modest slowdowns can distort results or impact service levels. A key strength of HPCs lies in their ability to capture hardware-specific events inaccessible or imprecisely approximated by software techniques, such as detailed contention, stalls, and subsystem interactions. For example, tools like Valgrind's Cachegrind rely on to estimate behavior, often missing nuances of actual hardware dynamics and introducing overheads up to 20x or more for comprehensive . HPCs, however, directly count these microarchitectural events with , enabling precise attribution to specific threads or cores—for instance, distinguishing shared L3 misses across multi-core systems—and supporting scalable, system-wide monitoring without per-process intervention. This hardware-level precision facilitates deeper insights into resource bottlenecks that software probes cannot reliably detect. HPCs further excel in deriving accurate derived metrics from raw event counts, such as the misprediction rate, which quantifies inefficiencies as the ratio of mispredicted branches to total branches retired: \text{Branch Misprediction Rate} = \frac{\text{MISPREDICTED\_BRANCHES}}{\text{TOTAL\_BRANCHES}} This calculation, based on direct tallies, provides reliable performance indicators that software sampling might skew due to or . Despite these benefits, HPCs are constrained by a fixed set of events predefined by the processor's monitoring unit, limiting flexibility for custom metrics beyond the architecture's offerings, and they risk saturation or in high-event-rate scenarios if not periodically read.

Specifics of Instruction-Based Sampling

Instruction-based sampling represents a specialized technique within frameworks that enables precise attribution of events to specific by sampling at instruction boundaries. This method leverages dedicated mechanisms to tag and monitor individual instructions or micro-operations (micro-ops), allowing for accurate correlation of performance events without the inaccuracies common in broader sampling approaches. For instance, AMD's Instruction Based Sampling (IBS), introduced in the Family 10h processors such as the Barcelona quad-core Opteron in 2007, operates by selecting and tagging micro-ops during dispatch or fetch phases based on a configurable sampling period, capturing detailed execution data including instruction pointers (), branch outcomes, and latencies. Similarly, Intel's Branch Trace Store (BTS) provides log-based reconstruction by recording branch records—such as from and to addresses along with prediction status—into a , facilitating the retroactive mapping of to attribute events to instruction sequences. The core mechanics involve hardware-initiated tagging to insert markers on sampled instructions, ensuring that performance data is directly associated with the triggering operation. In AMD IBS, for execution sampling (IBS Op), a micro-op is randomly tagged every N operations (where N is set via model-specific registers like IBS_OP_CTL), and as the micro-op retires, an interrupt delivers a record containing the linear IP, physical addresses, cache hit/miss status, and data linear addresses for load/store operations. Fetch sampling (IBS Fetch) similarly tags instructions during fetch, logging details like TLB misses and branch targets. Post-processing tools, such as Linux perf, decode these records via MSRs and map IPs to source code symbols using debugging information, enabling precise event attribution. For memory events, IBS and analogous Intel mechanisms like Precise Event-Based Sampling (PEBS)—which complements BTS—recover exact data addresses, allowing analysis of specific load/store behaviors without ambiguity. BTS mechanics differ by maintaining a circular buffer of the last 4 to 32 branch records, triggering an interrupt when full to dump data for reconstruction of the execution trace. A key advantage of instruction-based sampling is its elimination of skid, the displacement of sampled IPs from the actual event-causing instruction in standard performance counter sampling. In conventional methods, interrupts can cause the processor to continue executing several instructions before handling the sample, leading to attribution errors. Instruction-based sampling achieves zero attribution error for tagged events by design, as the hardware directly captures and associates data with the precise IP at the moment of tagging or retirement. \text{Attribution error} = 0 \quad \text{for tagged events in IBS/PEBS} In contrast, standard sampling incurs an error of \pm N instructions, where N typically ranges from a few to tens depending on pipeline depth and interrupt latency. This sub-instruction accuracy is particularly valuable despite the higher overhead—arising from per-sample interrupts and buffer management, often 1-5% for moderate sampling rates compared to near-zero for pure counting—making it suitable for targeted profiling rather than continuous monitoring. Use cases for instruction-based sampling center on and analyzing rare or fine-grained events where precision outweighs added overhead. For example, it excels in identifying specific load/ stalls by capturing exact addresses and latencies for sampled operations, revealing or TLB bottlenecks tied to particular instructions. In IBS, this has been applied to diagnose inefficiencies, such as mispredictions or underutilization in hotspots, enabling developers to pinpoint and optimize problematic code sequences with . BTS supports similar reconstruction for control-flow intensive applications, aiding in the attribution of events across complex paths. Overall, these techniques provide deeper insights into instruction-level performance, supporting advanced optimization in and software workflows.

Advanced Applications and Challenges

Integration with Profiling Tools

Hardware performance counters (HPCs) are integrated into tools through subsystems and that enable low-overhead access to hardware events, facilitating performance analysis for developers optimizing software on diverse architectures. The perf tool, for instance, leverages the perf_events subsystem to monitor HPCs such as misses and predictions, allowing users to record events system-wide or per-process without requiring modifications. Similarly, VTune Profiler incorporates Precise Event-Based Sampling (PEBS), a that captures detailed like pointers and memory addresses directly from the , enhancing accuracy in identifying bottlenecks in multithreaded applications. For -based systems, the Streamline Performance Analyzer uses HPCs from Arm CPUs and GPUs to profile events like pipeline stalls and usage, providing visualizations for and optimization. Integration methods often rely on standardized libraries to abstract architecture-specific details, enabling portable event access across platforms. The Performance API (PAPI), introduced in 1999, offers a cross-platform interface for querying and controlling HPCs, supporting over 30 underlying APIs and predefined metrics like floating-point operations and cache references for applications. Tools built on PAPI or similar layers parse raw counter data to generate visualizations, such as roofline plots, which plot an application's arithmetic intensity against peak hardware performance to distinguish compute-bound from memory-bound code regions using measured HPC values like instructions retired and bytes loaded. Advanced applications combine HPC data with interactive visualizations for deeper insights into bottlenecks. For example, perf samples can be processed into flame graphs, which stack-trace based representations of CPU usage derived from events, highlighting hot code paths by aggregating samples from counters like cycles and instructions. Post-2020 developments include perf's enhanced support for Performance Monitoring Units (PMUs) via the SBI PMU extension, merged into the around version 5.18 in 2022, allowing event-based sampling on emerging . Despite these advancements, challenges persist in tool portability and data management. Architecture-specific variations in HPC availability and encoding—such as differing event codes between x86 and —complicate cross-platform deployment, often requiring vendor-specific adaptations or wrappers like PAPI to maintain consistency. Additionally, counters monitoring shared resources like last-level caches generate voluminous datasets, straining tools like perf and VTune during analysis; mitigation involves selective sampling and aggregation to handle high event rates without overwhelming storage or processing.

Security and Virtualization Considerations

Hardware performance counters (HPCs) pose significant security risks when misused in side-channel attacks, particularly those exploiting behaviors to infer confidential from co-located processes. In -based side-channels such as Prime+Probe, attackers leverage HPCs to monitor eviction rates and timing variations in shared caches, thereby deducing victim activity without . This technique has been demonstrated to extract cryptographic keys or sensitive inputs by correlating counter values with state changes. Furthermore, access to HPCs can facilitate , as unrestricted reading of counters may reveal kernel-level execution patterns or bypass isolation boundaries in multi-tenant systems. The disclosure of and Meltdown vulnerabilities in 2018 highlighted HPCs' role in amplifying side-channel threats through leaks. These attacks exploit out-of-order processing to transiently access privileged data, with HPCs providing measurable indicators of such , such as mispredictions or misses, that enable . Subsequent analyses confirmed that HPC access leaks execution details exploitable for novel side-channels, including bypassing via performance characteristics. To mitigate these risks, operating systems enforce strict access controls on HPCs. In , the perf_event_paranoid parameter governs unprivileged access to performance monitoring units, with levels ranging from -1 (full access) to 3 (restricted to privileged users only), preventing non-root processes from monitoring system-wide events that could enable side-channels. Additionally, in secure boot environments, counters—those tracking non-core components like memory controllers—can be disabled to limit exposure, as they often provide aggregated metrics vulnerable to broad inference attacks. These measures balance utility with , though they require careful tuning to avoid over-restriction. In virtualized settings, HPC support introduces unique challenges for and performance. Intel's Virtual Machine Control Structure (VMCS) enables of performance s by saving and restoring guest state, allowing to access PMU registers transparently while the hypervisor manages contention. This pass-through mechanism supports guest but incurs VM-exit overhead during counter reads, potentially degrading performance in high-frequency monitoring scenarios. AMD's Secure Encrypted (SEV), launched in 2017, encrypts VM memory to isolate guests from hypervisors but leaves HPCs exposed, as events remain accessible to potentially malicious hosts, enabling side-channel . Recent evaluations reveal that up to 228 PMU events in SEV-SNP environments leak microarchitectural details, underscoring incomplete . In October 2024, disclosed a side-channel (AMD-SB-3013) in SEV environments, enabling malicious hypervisors to monitor and potentially recover information. of can introduce overhead in nested setups due to frequent context switches and , complicating accurate measurements. For architectures, ongoing proposals extend Physical Memory Protection (PMP) to enhance PMU isolation, particularly through S-mode PMP (SPMP) mechanisms introduced around 2023. These extensions aim to restrict unprivileged access to PMU control and status registers via hardware-enforced memory boundaries, supporting secure multi-domain environments without full MMU reliance. SPMP configurations enable fine-grained protection for performance monitoring in embedded and virtualized systems, addressing side-channel vectors in resource-constrained deployments.

References

  1. [1]
    [PDF] Hardware Performance Counters (HWPMC)
    Hardware performance counters are a processor facility that gathers statistics about architectural and micro- architectural performance properties of code ...
  2. [2]
    Intel® Performance Counter Monitor: Measure CPU Utilization
    Nov 30, 2022 · The Intel Performance Counter Monitor provides sample C++ routines and utilities to estimate the internal resource utilization of the latest Intel Xeon and ...
  3. [3]
    [PDF] The Challenges, Pitfalls, and Perils of Using Hardware Performance ...
    Abstract—Hardware Performance Counters (HPCs) have been available in processors for more than a decade. These counters can be used to monitor and measure ...
  4. [4]
  5. [5]
    Efficient Cross-platform Hardware Counters: Adaptive Grouping
    Jan 19, 2024 · PMU is typically implemented by a set of hardware performance monitoring counters (PMCs) to monitor various microarchitecture performance events ...
  6. [6]
    Performance Monitor Unit - Arm Developer
    The PMU is a powerful profiling feature that measures and analyzes the processor performance. The PMU hardware is able to count several events, using multiple ...
  7. [7]
    [PDF] Volume 1: Pentium Processor Data Book - Bitsavers.org
    The Pentium processor supports performance monitoring and external breakpoint indications with the following pins: BP3, BP2, PM1/BP1, and PMO/BPO. The ...
  8. [8]
    [PDF] Performance Counters and Development of SPEC CPU2006
    Introduction. Performance counters provide the means to track de- tailed events that occur on a CPU chip. These events are of interest to both performance ...
  9. [9]
    [PDF] The PAPI Cross-Platform Interface to Hardware Performance Counters
    The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors.Missing: components | Show results with:components
  10. [10]
    [PDF] Cache-Related Preemption and Migration Delays
    hardware performance counters that can be used to indi- rectly ... take into account the imperfect alignment of per-processor clock devices (clock skew).
  11. [11]
    [PDF] Using PAPI for hardware performance monitoring on Linux systems
    PAPI is a specification of a cross-platform interface to hardware performance counters on modern microprocessors. These counters exist as a small set of ...
  12. [12]
    [PDF] A Performance Analysis of Pentium Processor Systems
    Statistics gathering. The Pentium processor has a pair of special-purpose counters for performance monitor- ing. Each counter can be programmed to count any one.
  13. [13]
    [PDF] The POWER4 Processor Introduction and Tuning Guide
    enabling and using the performance monitor facilities from their applications. Use of the POWER4 Performance Monitor API is discussed in Section 5.3, “The.
  14. [14]
    Exploiting hardware performance counters with flow and context ...
    This paper addresses both concerns by exploiting the hardware counters available in most modem processors and by incorporating two concepts from data flow ...
  15. [15]
    [PDF] Intel's P6 Uses Decoupled Superscalar Design: 2/16/95
    Feb 16, 1995 · Intel's forthcoming P6 processor (see cover story) is designed to outperform all other x86 CPUs by a signifi- cant margin.
  16. [16]
    [PDF] Intel Core 2 Extreme Processor X6800 and Intel Core 2 Duo ...
    Dec 8, 2010 · Workaround: Do not use performance monitoring counters for precise event sampling when the precise event is dependent on the CPL value.
  17. [17]
    [PDF] 2nd Generation Intel® Core™ Processor Family ... - The Retro Web
    May 2, 2025 · This register defines the BW throttling at temperature. Note that the field “BW_limit_tf may not be changed in run-time. Other fields may be.<|separator|>
  18. [18]
    Feature Overview — NVIDIA DCGM Documentation latest ...
    Aug 28, 2025 · Profiling metrics in DCGM enables the collection of a set of metrics using the hardware counters on the GPU. DCGM provides access to device- ...Health And Diagnostics · Profiling Metrics · Cuda Test Generator...<|separator|>
  19. [19]
    Chapter 30. The Performance Monitors Extension - Arm Developer
    It describes version 1 and 2 of the Performance Monitor Unit (PMU) architecture, PMUv1 and PMUv2, and contains the following sections: About the Performance ...Missing: standardization | Show results with:standardization
  20. [20]
    Automatic Core Specialization for AVX-512 Applications
    May 30, 2020 · In this work, we describe a method to mitigate the frequency reduction slowdown for workloads involving AVX-512 instructions in both situations.
  21. [21]
    [PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
    This manual, Volume 4, focuses on Model-Specific Registers (MSRs) and is part of a ten-volume series.Missing: Alder IA32_PERFEVTSEL
  22. [22]
    [PDF] uProf User Guide | AMD
    Following are the performance metrics for AMD EPYCTM “Zen 3” core architecture processors: Table 16. Performance Metrics for AMD EPYCTM “Zen 3”. Metric Group.
  23. [23]
    Introduction to IBS (Instruction-Based Sampling) - 68658
    Jun 9, 2025 · IBS is a profiling mechanism that enables the processor to sample instruction fetches and instruction execution. These two sampling methods ...Missing: Zen 3 fixed programmable
  24. [24]
    [PDF] Intel® 64 and IA32 Architectures Performance Monitoring Events
    This document, document number 335279-001, revision 1.0, dated 2017 December, contains information on Intel 64 and IA32 performance monitoring events.Missing: x86 Alder IA32_PERFEVTSEL
  25. [25]
  26. [26]
    PMCR: Performance Monitors Control Register - Arm Developer
    Provides details of the Performance Monitors implementation, including the number of counters implemented, and configures and controls the counters.
  27. [27]
    "Zicntr" and "Zihpm" Extensions for Counters, Version 2.0
    We required the counters be 64 bits wide, even when XLEN=32, as otherwise it is very difficult for software to determine if values have overflowed. For a low- ...
  28. [28]
    perf_event_open(2) - Linux manual page - man7.org
    PERF_SAMPLE_WEIGHT (since Linux 3.10) Records a hardware provided weight value that expresses how costly the sampled event was. This allows the hardware to ...Missing: 2009 | Show results with:2009
  29. [29]
    [PDF] perf fuzzer: Targeted Fuzzing of the perf event open() System Call
    Introduced in 2009, the perf event interface allows creating a file descrip- tor that is linked to various types of system performance measurements: software ...Missing: API | Show results with:API
  30. [30]
    RDPMC — Read Performance-Monitoring Counters
    The RDPMC instruction reads a performance monitoring counter specified by ECX into EDX:EAX, storing the counter's high and low order bits.
  31. [31]
    libpfm(3) - Linux manual page - man7.org
    This is a helper library used by applications to program specific performance monitoring events. Those events are typically provided by the hardware or the OS ...
  32. [32]
    [PDF] 4. Instruction tables - Agner Fog
    Sep 20, 2025 · The present manual contains tables of instruction latencies, throughputs and micro-operation breakdown and other tables for x86 family ...
  33. [33]
  34. [34]
    How to Use x86 CPU Performance Counters with Inline Assembly
    Aug 20, 2020 · To enable the counters, you must write a 1 in the crresponding bit to register 0x3f with wrmsr as shown below. See “Figure 18-3. Layout of ...
  35. [35]
    [PDF] Advanced Hardware Profiling and Sampling (PEBS, IBS, etc.)
    Aug 2, 2016 · AMD IBS is a bit different than PEBS. Rather than a memory buffer, the sampled info is stored in a set of MSRs that need to be read before the ...Missing: differences variable
  36. [36]
    18.15.7 Processor Event-Based Sampling (PEBS)
    PEBS permits the saving of precise architectural information associated with one or more performance events in the precise event records buffer.
  37. [37]
    [PDF] A Portable Sampling-Based Profiler for Java Virtual Machines
    It uses hardware performance counters to attain high-frequency sampling (˜5200 samples/sec) with fairly low overhead (1-. 3% slowdown for most applications).
  38. [38]
    [PDF] gprof: a Call Graph Execution Profiler1
    The gprof design takes advantage of the fact that the programs to be ... It adds only five to thirty percent execution overhead to the program being.
  39. [39]
    [PDF] Continuous Profiling: Where Have All the Cycles Gone?
    The overhead column describes how much profiling slows down the target program; low overhead is defined arbitrarily as less than 20%. The scope column shows ...
  40. [40]
    [PDF] Valgrind Documentation
    Nov 27, 2017 · slowdown of around 4, which is the minimum Valgrind overhead. ... By default, stack profiling is off as it slows Massif down greatly.
  41. [41]
  42. [42]
    [PDF] Can Hardware Performance Counters be Trusted?
    Hardware performance counters are often used to char- acterize workloads, yet counter accuracy studies have seldom been publicly reported, bringing such counter ...<|control11|><|separator|>
  43. [43]
    [PDF] A New Performance Analysis Technique for AMD Family 10h ...
    Nov 16, 2007 · IBS is introduced by AMD Family10h processors (AMD Opteron Quad- Core processor “Barcelona.”) IBS overcomes the limitations of conventional ...
  44. [44]
    perf-amd-ibs(1) - Linux manual page - man7.org
    Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP) profiling support on AMD platforms. IBS has two independent components: IBS Op and ...Missing: Zen | Show results with:Zen
  45. [45]
    Introduction - perf: Linux profiling with performance counters
    Generic counters, Fixed counters. AMD Zen, 6, 0. ARM Cortex, 6, 0. Intel Core, 2, 3. Intel Nehalem, 4, 3. Intel SandyBridge, 4, 3. Intel Haswell, 4, 3. Intel ...
  46. [46]
    Intel® VTune™ Profiler User Guide
    This document provides a comprehensive overview of the product functionality, tuning methodologies, workflows, and instructions to use Intel VTune Profiler ...Missing: PEBS | Show results with:PEBS
  47. [47]
    Streamline Performance Analyzer
    ### Summary: ARM Streamline's Use of Hardware Performance Counters for Profiling
  48. [48]
    (PDF) PAPI: A Portable Interface to Hardware Performance Counters
    PDF | The purpose of the PAPI project is to specify a standard application programming interface (API) for accessing hardware performance counters.Missing: seminal | Show results with:seminal
  49. [49]
    [PDF] Roofline: An Insightful Visual Performance Model for Floating-Point ...
    We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for ...
  50. [50]
    CPU Flame Graphs - Brendan Gregg
    Aug 30, 2021 · Flame Graphs can work with any CPU profiler on any operating system. My examples here use Linux perf (perf_events), DTrace, SystemTap, and ktap.Description · Instructions · Examples · Java
  51. [51]
    [v6 0/9] Improve RISC-V Perf support using SBI PMU and sscofpmf ...
    Mar 21, 2022 · This series introduces a platform perf driver instead of a existing arch > specific implementation. The new perf implementation has adopted a ...[PATCH v2 10/11] RISC-V: KVM: Implement perf support without ...riscv_pmu_sbi: add support for PMU variant on T-Head C9xx coresMore results from lists.infradead.org
  52. [52]
    (PDF) Experiences and Lessons Learned with a Portable Interface ...
    These tools are employed to monitor application behavior and collect hardware-related events, aiding in the analysis of application performance. These tools ...
  53. [53]
    Analyzing Uncore Perfmon Events for Use of Intel® Data Direct I/O...
    You can use Intel VTune Profiler to monitor the count of uncore hardware performance events. Together with other performance metrics from analysis runs ...
  54. [54]
  55. [55]
    [PDF] Detecting Cache-based Side Channel Attacks using Hardware ...
    This technology is based on the availability of Performance Monitoring Counters. (PMC), which allow to analyze the behavior of applications at runtime, at a ...
  56. [56]
    Perf events and tool security - The Linux Kernel Archives
    Usage of Performance Counters for Linux (perf_events) 1 , 2 , 3 can impose a considerable risk of leaking sensitive data accessed by monitored processes. The ...Missing: 2009 | Show results with:2009
  57. [57]
    [PDF] Reading Kernel Memory from User Space - Meltdown and Spectre
    Meltdown exploits side effects of out-of-order execution on mod- ern processors to read arbitrary kernel-memory locations including personal data and passwords.
  58. [58]
    [PDF] Reviving Meltdown 3a - Michael Schwarz
    Access to these performance counters leaks information about the program execution that can be exploited for side-channel attacks [58, 8, 12]. Attack Overview.
  59. [59]
    [PDF] Heed the Rise of the Virtual Machines - PLDI 2012
    ○ Intel: VMCS (virtual machine control structure). ○ AMD: VMCB (virtual ... ○ Must virtualize performance counters. ○ Let counters run through exits ...
  60. [60]
    Performance Counter Side Channel - AMD
    Oct 14, 2024 · Performance counters are not protected by SEV, SEV-ES, or SEV-SNP. AMD recommends software developers employ existing best practices ...Missing: PMU isolation
  61. [61]
    [PDF] CounterSEVeillance: Performance-Counter Attacks on AMD SEV-SNP
    We systematically analyze performance counter events in SEV-SNP VMs and find that 228 are exposed to a potentially malicious hypervisor. CounterSEVeillance ...
  62. [62]
    how does KVM virtualize performance counter (PMC)?
    Apr 14, 2017 · I am using KVM-QEMU in Intel platform. And I am wondering how is the PMC/PMU being virtualized in KVM? My understanding of vPMC is as follows.
  63. [63]
    [PDF] RISC-V S-mode Physical Memory Protection (SPMP)
    This document describes RISC-V S-mode Physical Memory Protection (SPMP) proposal to provide isolation when MMU is unavailable or disabled.Missing: PMU | Show results with:PMU
  64. [64]
    Open-source SPMP-based CVA6 Virtualization
    May 15, 2025 · RISC-V has followed suit, extending its Physical Memory Protection (PMP) mechanism with supervisor-mode PMP (SPMP) for task isola- tion in ...