Gprof
Gprof is a performance analysis tool included in the GNU Binutils suite that generates execution profiles for C, Pascal, or Fortran 77 programs by analyzing call graph data from instrumented executions.[1] It incorporates the time spent in called routines into the profile of each caller, enabling developers to pinpoint functions and code paths that consume the most CPU time during program execution.[2] Developed as part of the GNU Project, gprof requires programs to be compiled with the -pg flag using GCC or compatible compilers, which inserts profiling code to produce a gmon.out file containing runtime statistics.[2]
The tool processes this profile data to output several report formats, including a flat profile that lists functions by time percentage and self-time, a call graph showing caller-callee relationships with estimated call counts and inclusive times, and optional annotated source code listings highlighting execution hotspots.[1] Key options allow customization, such as -p for flat profiles, -q for call graphs, -A for source annotation, and options like -k to exclude specific arcs or -E to suppress display of certain symbols.[1] Gprof handles cycles in the call graph by propagating times appropriately and supports multiple profile files for aggregated analysis, though it assumes normal program termination via exit(2) or main return to ensure data capture.[2]
GNU gprof was originally written by Jay Fenlason and has been maintained as a core component of Binutils since its early versions, with the current documentation covering releases up to version 2.45.[2] Widely used in software development for optimization, it remains a standard tool on Unix-like systems despite limitations like its focus on CPU time over I/O or memory, and lack of support for multi-threaded or dynamic linking scenarios without extensions.[2] Its integration with GNU tools makes it essential for profiling open-source and performance-critical applications.[3]
Overview and Usage
Purpose and Features
Gprof is a hybrid instrumentation and sampling profiler for Unix-like systems, integrated as part of the GNU Binutils suite of tools.[2][4] Developed originally at the University of California, Berkeley, it extends the capabilities of the earlier Unix profiler prof(1) by incorporating call graph analysis to provide more detailed insights into program execution.[5]
The primary goal of Gprof is to identify functions that account for the majority of a program's execution time and to visualize the call relationships between functions, enabling developers to pinpoint performance bottlenecks in modular software.[5] This approach attributes runtime costs not just to individual routines but also to the abstractions they implement, facilitating targeted optimizations.[5]
Key features of Gprof include the production of flat profiles, which list functions sorted by descending execution time along with call counts, and call graphs that depict caller-callee interactions while propagating time from called routines back to their callers.[5][2] These outputs help reveal unexpected call patterns and the cumulative impact of subroutine hierarchies.[5] Gprof can profile programs written in languages such as C, C++, Pascal, and Fortran 77, compiled with GCC or compatible compilers, leveraging the compiler's instrumentation to collect data within the broader binutils environment.[2] Programs are instrumented via the -pg compilation flag to generate the necessary profiling data.[2]
Basic Usage Workflow
To use Gprof for profiling a program, the first step is to compile the source code with profiling instrumentation enabled. This is achieved by including the -pg flag when using the GNU Compiler Collection (GCC), which inserts calls to profiling functions at strategic points in the code. For instance, the command gcc -pg -g -o myprogram source.c compiles the file source.c into an executable named myprogram, where -g adds debugging information for better symbol resolution in the output.[6]
Once compiled, the instrumented program is executed in the usual manner, such as ./myprogram, with any necessary input provided via standard input, arguments, or files. During runtime, the program collects profiling data on function calls and execution times, writing this information to a binary file named gmon.out upon normal termination (via return from main or an exit call). If the program crashes or is interrupted abnormally, the gmon.out file may not be generated, requiring re-execution under controlled conditions.[7]
After execution, the profiling data is analyzed using the gprof command, which processes the gmon.out file alongside the executable to produce human-readable reports. The basic invocation is gprof myprogram gmon.out > profile.txt, which generates a text file containing a flat profile of time distribution across functions and a call graph showing invocation relationships. Additional options allow customized views; for example, gprof -A myprogram gmon.out produces an annotated source listing with execution percentages overlaid on the original code lines.[8]
For scenarios involving multiple runs to accumulate more accurate data, several gmon.out files can be merged using gprof -s myprogram gmon.out1 gmon.out2, which combines the inputs into a single gmon.sum file. This summed data is then analyzed as usual, such as gprof myprogram gmon.sum > combined_profile.txt, providing aggregated statistics over repeated executions. To create multiple distinct profile files, rename or move gmon.out after each execution (e.g., [mv](/page/MV) gmon.out gmon1.out), then run the program again to generate the next file.[9]
A complete example workflow uses a simple C program (fib.c) that computes Fibonacci numbers iteratively to demonstrate time spent in a loop-heavy function:
c
#include <stdio.h>
long fib(int n) {
long a = 0, b = 1, c;
if (n <= 1) return n;
for (int i = 2; i <= n; i++) {
c = a + b;
a = b;
b = c;
}
return b;
}
int main(int argc, char *argv[]) {
if (argc > 1) {
int n = atoi(argv[1]);
printf("Fib(%d) = %ld\n", n, fib(n));
}
return 0;
}
#include <stdio.h>
long fib(int n) {
long a = 0, b = 1, c;
if (n <= 1) return n;
for (int i = 2; i <= n; i++) {
c = a + b;
a = b;
b = c;
}
return b;
}
int main(int argc, char *argv[]) {
if (argc > 1) {
int n = atoi(argv[1]);
printf("Fib(%d) = %ld\n", n, fib(n));
}
return 0;
}
The terminal sequence proceeds as follows:
$ gcc -pg -g -o fib fib.c
$ ./fib 40
Fib(40) = 102334155
$ gprof fib gmon.out > fib_profile.txt
$ cat fib_profile.txt
$ gcc -pg -g -o fib fib.c
$ ./fib 40
Fib(40) = 102334155
$ gprof fib gmon.out > fib_profile.txt
$ cat fib_profile.txt
This produces output including a flat profile (e.g., showing most time in main and fib) and call graph, with fib invoked once from main. For multiple runs, rename gmon.out (e.g., mv gmon.out gmon1.out), execute ./fib 40 again to generate a second file, then sum and analyze as described.[10]
Implementation
Code Instrumentation
Gprof achieves precise call counting and call graph construction through compile-time instrumentation of the source code, primarily facilitated by the GNU Compiler Collection (GCC). When compiling with the -pg flag, GCC inserts calls to the monitoring function mcount (or _mcount or __mcount, depending on the platform and configuration) at the entry point of every instrumented function.[11] This instrumentation enables the collection of exact caller-callee relationships without relying on statistical sampling.[12]
The mcount function serves as the core mechanism for recording dynamic call graph arcs during program execution. Upon invocation, mcount examines the program's stack frame to determine the caller (parent) routine's address and the current (callee or child) function's address. It then increments a counter in an in-memory hash table structure, using the call site as the primary key and the callee address as a secondary key to track the number of times each arc is traversed.[5] This process builds a directed call graph that captures the program's control flow, including self-calls and recursive calls, while the function also initializes necessary data structures on its first invocation.[12] Cycles in the call graph, including those from recursion, are recorded as arcs but detected and collapsed into strongly connected components during post-processing.[13]
Leaf functions, which make no outgoing calls, have their execution times propagated directly to callers in post-processing based on call frequencies, as they contribute no descendant time.[5]
Integration with GCC's profiling support occurs through linkage to the libgmon.a library, which provides the implementation of mcount, internal profiling routines like __mcount_internal, and cleanup functions such as mcleanup for dumping data to the gmon.out file at program termination.[12] This library ensures compatibility across separately compiled modules, as no special recompilation is required beyond using -pg during both compilation and linking.[5]
Unlike pure sampling-based profilers that approximate call frequencies through periodic interrupts, gprof's instrumentation approach yields exact call counts by explicitly logging each invocation, though at the cost of added runtime overhead from the inserted calls.[5]
Runtime Profiling
Gprof collects runtime profiling data by integrating code instrumentation with statistical sampling to capture both call frequencies and execution time estimates. The instrumentation aspect relies on the mcount function, which is automatically invoked at the entry of each profiled routine during compilation with the -pg flag; this function records directed arcs in the call graph by identifying the caller-callee pair (using the return address on the stack) and incrementing the call count for that arc.[14] These arc records form the structural backbone of the call graph, enabling later attribution of time to specific callers. Meanwhile, sampling provides time data: the operating system generates periodic interrupts via a clock signal, such as through the setitimer mechanism, at a default interval of 10 milliseconds (corresponding to a 100 Hz sampling rate).[15] Each interrupt triggers a handler that records the current program counter (PC) value, approximating the location of execution at that instant.[14]
The sampled PC values are aggregated into a histogram, which serves as the primary source for estimating routine execution times. This histogram is an array of fixed-size bins (typically 16-bit counters) covering the program's text segment, where each bin corresponds to a range of addresses and tallies the number of samples falling within it; the total samples multiplied by the sampling interval yield the program's overall runtime, and per-bin counts estimate self-time for routines.[16] Upon program exit, the moncontrol or _mcleanup function writes the raw data to the gmon.out file in a binary format: a header with metadata (e.g., text segment range, histogram dimensions, and clock resolution), followed by the histogram section, and then the arc section—a sequence of records each containing from-address, self-address, and call count for each arc.[16] This structure allows post-processing to propagate histogram-derived time estimates along the arcs, attributing inclusive time (self plus descendants) to callers while preserving exact call counts from mcount.[15]
Gprof provides partial support for dynamic linking and shared libraries, but with limitations stemming from runtime loading. Both the main executable and shared libraries must be compiled with -pg to include profiling code, yet dynamically loaded libraries may lack initialized profiling structures, leading to incomplete arc recording or segmentation faults; symbol resolution across libraries relies on the runtime linker's address mapping, but accurate call graphs often require static linking (e.g., via -static-libgcc) or explicit profiling of all dependencies to avoid missing data.[17]
To manage output in environments with multiple processes or parallel runs, the GMON_OUT_PREFIX environment variable allows customization of the gmon.out filename; if set, the prefix is prepended to the default name, and the process ID is appended (e.g., GMON_OUT_PREFIX=myprof yields myprof.gmon.out.1234), preventing overwrites while facilitating per-process data isolation.[18]
Output Analysis
Gprof generates profiling reports in two primary formats: the flat profile and the call graph, which users analyze to identify performance bottlenecks in their programs.[2] The flat profile provides a straightforward summary of time spent in each function, independent of calling relationships, while the call graph illustrates the hierarchical call structure and time attribution across functions.[19] These outputs are derived from sampling-based measurements collected during program execution, allowing developers to pinpoint functions consuming the most runtime.[20]
The flat profile is organized as a table sorted in decreasing order of self seconds (time spent executing the function itself, excluding time in called subroutines), followed by call count and then alphabetically by function name.[19] Key columns include:
- % time: The percentage of total runtime spent in the function (self time relative to overall execution).
- cumulative seconds: The cumulative time up to and including this function, ordered by descending self time.
- self seconds: The direct execution time in the function.
- calls: The number of times the function was invoked (or blank for unprofiled functions).
- self ms/call: Average self time per call, in milliseconds.
- total ms/call: Average total time (self plus descendants) per call, in milliseconds.
- name: The function name or symbol.
This format helps users quickly identify hotspots, such as functions with high % time, without considering call hierarchies. For instance, in a sample output from a simple program, the flat profile might highlight a file I/O function like open consuming 33.34% of runtime across 7208 calls, indicating a potential bottleneck in repeated system interactions.[19]
% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.34 0.02 0.02 7208 0.00 0.00 open
% cumulative self self total
time seconds seconds calls ms/call ms/call name
33.34 0.02 0.02 7208 0.00 0.00 open
The call graph complements the flat profile by depicting the program's dynamic call structure as a directed graph of arcs between functions. Each function entry begins with an index number (a consecutive integer for reference), followed by the primary line showing the function's total time (self plus children), self time, and call count from its parent. Lines above the primary line represent callers, detailing how much time and calls originated from each, while lines below list callees (children) with arc descriptions like "calls=1/1" (actual calls out of estimated total). Inclusive time encompasses the function and all its descendants, whereas exclusive time (self) excludes them.[19] This visualization reveals not just individual hotspots but also how time propagates through the call stack, aiding in understanding indirect performance impacts. Cycles are collapsed into single entries during this analysis, with times propagated appropriately.[13]
Time propagation in the call graph attributes execution time from leaf functions upward to their callers based on call frequencies. Specifically, a function's total time is the sum of its self time and the propagated times from its children, weighted by the proportion of calls to each child; recursive cycles are treated as a single unit to avoid infinite loops in attribution.[19] For example, if function A calls B 10 times and B's self time is 0.1 seconds, then 0.1 seconds propagates to A as part of B's contribution, added to A's total unless adjusted for other siblings. This mechanism ensures the call graph reflects the full cost of a function, including subroutine overhead, enabling users to trace bottlenecks back to root causes like excessive calls to costly routines.[2]
Users can customize output via command-line options to focus analysis. The -p option prints only the flat profile, suppressing the call graph, while -q prints only the call graph, omitting the flat profile; by default, gprof outputs both.[21] The -z option includes functions with zero usage (never called or zero time) in the flat profile, useful for verifying completeness or spotting dead code.[22] These options, combined with symbol specifications (e.g., -p symspec), allow targeted reports for specific functions or patterns.
In practice, interpreting these reports involves cross-referencing the flat profile for top consumers and the call graph for context; for the toy example above, the call graph might show open invoked heavily from a loop in main, confirming it as the primary bottleneck and guiding optimizations like batching I/O operations.[19]
History
Berkeley Origins
Gprof was developed in the early 1980s at the University of California, Berkeley, by Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick, all affiliated with the Computer Science Division of the Electrical Engineering and Computer Science Department.[23] The tool emerged from efforts to profile and optimize a code generator, addressing the challenges of evaluating abstractions in large, modular programs composed of small routines.[5] This academic work built on the Unix tradition of execution profiling, aiming to provide insights into both routine-level execution times and inter-routine call relationships.
The foundational description of gprof appeared in the 1982 paper "Gprof: A Call Graph Execution Profiler," presented at the 1982 SIGPLAN Symposium on Compiler Construction.[23] A key innovation was its hybrid approach, combining instrumentation for precise call graph tracing with sampling of the program counter to estimate execution times, enabling the attribution of a routine's time to its callers in a hierarchical manner.[5] This method allowed for low-overhead profiling that reflected the program's logical structure, distinguishing it from purely statistical tools by incorporating dynamic call information.[23]
Gprof was released as part of the 4.2BSD Unix distribution in 1983, serving as an extension to the existing prof(1) tool for basic flat profiling.[24] Its initial implementation was tailored specifically for the VAX architecture, with compiler support for inserting monitoring routines into C, Fortran 77, and Pascal programs at compile time.[5] Integration with the BSD kernel was facilitated through configuration options like -p in config(8), allowing kernel profiling via utilities such as kgmon(8), which collected data into a mon.out file for post-processing with gprof.[25] This setup enabled detailed analysis of kernel performance, such as optimizing pathname translation routines, while keeping overhead to 5-25% of execution time.[25]
GNU Development
The GNU implementation of gprof was developed by Jay Fenlason in 1988 as a profiling tool compatible with the original Berkeley Unix version, specifically designed to work with the GNU C compiler.[26][27] This effort addressed the need for performance analysis in the emerging GNU ecosystem, enabling developers to profile execution times and call graphs in programs compiled with GNU tools.[26]
Integration with the GNU Compiler Collection (GCC) was achieved through the -pg flag, which instruments source code during compilation to generate profiling data files compatible with gprof; this support has been available since early GCC versions, including the 1.x series released around the same period.[11] As part of the GNU Binutils suite, gprof's development aligned with the broader binary utilities project, ensuring its distribution and maintenance under the GNU umbrella.[2]
Key enhancements in the GNU version focused on portability beyond BSD systems, allowing gprof to operate on diverse Unix-like platforms and architectures supported by GNU tools. Improvements to output formatting provided more detailed and configurable reports, including enhanced call graph visualizations and support for basic-block counting in the profiling data format.
The version history of gprof is closely tied to GNU Binutils releases, with incremental updates adding support for new architectures; for instance, the Binutils 2.x series from the early 1990s extended compatibility to platforms like SPARC and MIPS.[2] From its inception, gprof has been released under the GNU General Public License (GPL), version 2 or later, promoting free software principles within the GNU project.[4]
Limitations and Accuracy
Sampling Errors
Gprof's time measurements rely on statistical sampling of the program counter at fixed intervals, introducing inherent inaccuracies due to the probabilistic nature of the process. The primary source of error stems from the variance in sample counts, where the expected relative error for a function's runtime estimate is approximately $1/\sqrt{n}, with n representing the number of samples taken during its execution (calculated as the function's total runtime divided by the sampling interval).[28] This Poisson-like distribution means that shorter executions yield larger relative errors; for instance, with the default sampling interval of 10 milliseconds (0.01 seconds), a 1-second run produces about 100 samples, resulting in an expected error of roughly 10% of the measured time.[28] In contrast, longer runs, such as 100 seconds, increase n to 10,000, reducing the error to approximately 1%.[28]
A notable bias affects short functions, which execute in less time than the sampling interval and are thus likely to be underrepresented or entirely missed in the profile. If a function's execution duration is comparable to or shorter than 10 milliseconds, it may receive zero or few samples, leading to systematic underestimation of its time contribution, even if called frequently.[28] The GNU gprof manual emphasizes that runtime figures are unreliable when not substantially larger than the sampling period, highlighting this limitation for fine-grained code regions.[28]
These sampling errors propagate and amplify within the call graph, where self-times (directly sampled) are attributed to parents via post-processing that distributes child times upward along call arcs. Errors in a callee's estimated time directly influence the propagated times to its callers, potentially magnifying inaccuracies in higher-level functions, especially in deep call chains or cycles where times are aggregated without inter-cycle propagation.[5] This attribution mechanism, while enabling a hierarchical view, compounds statistical variance from leaves to roots, reducing precision in inclusive time metrics for complex programs.[5]
To mitigate these errors, users can extend program runtime by increasing input sizes, thereby boosting sample counts and narrowing confidence intervals without altering the sampling mechanism.[28] Accumulating data across multiple independent runs using the gprof -s option merges gmon.out files to effectively increase n, improving accuracy for the same total execution effort.[28] Finer sampling intervals are possible through system-level adjustments, such as configuring higher-frequency hardware timers (e.g., modifying kernel clock ticks), though this introduces trade-offs like increased profiling overhead and potential compatibility issues on certain platforms.[29]
Overhead and Compatibility Issues
Gprof introduces notable performance overhead primarily through its instrumentation mechanism, which inserts calls to the mcount() function at each routine entry to record call graph data. This can result in execution slowdowns ranging from 30% to over 260% in call-intensive programs, such as those with frequent small function invocations or object-oriented designs, where the added cost distorts timings significantly.[30] In contrast, the sampling component— which captures program counter values at approximately 100 Hz via operating system interrupts—imposes minimal additional overhead, typically a few microseconds per sample, though it accumulates over extended runs and is more pronounced in signal-based implementations compared to kernel-assisted ones.[29][5]
Compatibility challenges further limit Gprof's applicability. It provides poor support for multi-threaded programs, as the mcount() implementation in libraries like glibc is not thread-safe, leading to inaccurate or missing per-thread data and potential race conditions in call counts.[31] Similarly, profiling kernel-mode code is unsupported, as Gprof targets user-space applications and lacks mechanisms for kernel instrumentation. For fully dynamic shared libraries, symbol mismatches and segmentation faults can occur if profiling executes before library initialization, often requiring static linking (-static or -static-libgcc) as a workaround.[31]
Platform constraints are most evident outside Unix-like systems. Gprof performs best on Unix environments such as Linux and Solaris, where it integrates seamlessly with GCC and the binutils suite. On Windows, support is limited to POSIX-emulating layers like Cygwin or MinGW, which may introduce additional inaccuracies due to differing runtime behaviors. Embedded systems, such as ARM Cortex-M, require custom adaptations like modified toolchains and no-OS environments to enable profiling.[31][32]
To mitigate these issues, Gprof is best suited for single-threaded prototype analysis during development, where overhead can be tolerated for initial hotspot identification; it should be avoided in production or multi-threaded scenarios to prevent distortion or crashes.[31]
Legacy and Modern Context
Historical Reception
Upon its release in the early 1980s, gprof received significant recognition within the programming languages community for introducing call graph profiling, a method that attributes execution time across calling relationships in programs. The original paper presenting gprof, "gprof: A Call Graph Execution Profiler" by Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick (presented at the 1982 SIGPLAN Symposium on Compiler Construction), was selected as one of 50 most influential papers in a retrospective of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) spanning 1979 to 1999, based on nominations, voting, and evaluation for impact and technical excellence by a committee of past PLDI chairs. This inclusion highlighted its enduring contribution to performance analysis tools, placing it among four standout papers from 1982.[33]
By the 1990s, gprof had achieved widespread adoption as a standard component of Unix toolchains, integrated into Berkeley Software Distribution (BSD) systems and the GNU Project's binutils, enabling developers to routinely profile C programs for optimization.[3] It extended and largely superseded earlier flat profilers like AT&T's prof by providing hierarchical call graph insights, influencing subsequent profiling methodologies in Unix-like environments. This integration facilitated its use across academic, research, and commercial software development, particularly in performance tuning for systems software.
Contemporary literature praised gprof for its innovative simplicity in delivering actionable call graph data without excessive complexity, despite acknowledged inaccuracies in handling shared subroutines and recursive calls. A 1993 analysis noted these limitations, such as erroneous time attribution in programs with common subroutines, but emphasized that gprof's straightforward design made it invaluable for initial performance investigations in production settings.[34] Overall, through the 1990s and into the early 2000s, gprof remained a cornerstone tool in BSD and GNU ecosystems, balancing utility with ease of use amid growing software complexity.
Successors and Alternatives
Over time, Gprof has become outdated for many modern applications due to its lack of native support for multi-threaded programs and its reliance on instrumentation, which introduces significant runtime overhead depending on the workload.[35][36] These limitations make it unsuitable for profiling shared libraries or concurrent code without custom modifications, leading to incomplete or inaccurate results in contemporary environments.[35]
A direct successor addressing these issues is gprofng, first integrated into the GNU Binutils suite with version 2.39 in August 2022 as the next-generation GNU profiler.[37] Derived from Oracle's Sun Studio profiler lineage, gprofng uses sampling-based techniques without requiring program recompilation, enabling low-overhead analysis of production binaries written in C, C++, Java, or Scala.[38] It provides full support for shared libraries and multi-threaded applications, generating call graphs and performance metrics that resolve Gprof's threading and instrumentation shortcomings.[37]
Beyond gprofng, several alternatives have gained prominence for their advanced capabilities. Perf, a Linux kernel-based sampler, offers low-overhead profiling via hardware performance counters, capturing events like cache misses and branch predictions without code modifications.[39] Valgrind's Callgrind tool provides deterministic call-graph profiling with detailed instruction-level data, ideal for cache and branch analysis in single- or multi-threaded code.[40] Intel VTune Profiler leverages hardware acceleration for comprehensive analysis, including threading efficiency and GPU utilization, making it suitable for high-performance computing workloads.[41]
Gprof remains included in the latest Binutils release (2.45.1 as of November 2025), but it is now primarily recommended for legacy single-threaded analysis where minimal setup is needed.[4] For ongoing development, migration to gprofng is advised to handle modern binaries effectively while maintaining compatibility with GNU ecosystems.[37]