Graph edit distance
Graph edit distance (GED) is a measure of similarity between two graphs, defined as the minimum cost required to transform one graph into the other through a sequence of edit operations, including the insertion, deletion, and substitution of nodes and edges.[1] These operations can be assigned costs based on attributes of the nodes and edges, allowing GED to handle both unlabeled and attributed graphs in an error-tolerant manner for inexact graph matching.[1] Formally, for two graphs G_1 = (V_1, E_1) and G_2 = (V_2, E_2), GED is the cost of the least expensive edit path, often modeled as an assignment problem where mappings between nodes and edges minimize the total edit cost.[2] The concept of GED was first formalized by Sanfeliu and Fu in 1983 as a distance measure between attributed relational graphs for pattern recognition tasks. It built on earlier ideas from string edit distance and was further developed by Bunke and Allermann in 1983 for inexact graph matching in structural pattern recognition, emphasizing its role in handling distortions and noise in graph representations. Subsequent work by Messmer and Bunke in the 1990s introduced extensions like error-tolerant subgraph matching and probabilistic interpretations, establishing GED as a foundational metric in graph-based machine learning.[1] Over time, GED has been generalized to include various cost functions, such as those forming convex polygons in labeling spaces, which influence the shape and properties of the edit surface.[2] Computing the exact GED is NP-hard, equivalent to finding the minimum-cost edit path in a search space that grows factorially with graph size, often reduced to a quadratic assignment problem or bipartite matching for approximations.[1] Exact algorithms, such as branch-and-bound or A* search, are feasible only for small graphs (typically under 20 vertices), while heuristics like bipartite graph matching, self-organizing maps for cost learning, or neural network-based approximations (e.g., via graph neural networks) enable efficient computation for larger structures. Key challenges include defining appropriate edit costs, handling attribute mismatches, and scaling to real-world datasets, with recent advances incorporating machine learning to estimate distances in sub-quadratic time.[3] GED finds broad applications in pattern recognition, such as offline handwritten character recognition, shape and image analysis, and video indexing, where graphs represent structural relationships tolerant to deformations.[1] In bioinformatics, it measures similarity between molecular structures or protein interaction networks; in computer vision, it supports sketch-to-photo matching and scene recognition; and in machine learning, it underpins tasks like graph classification and clustering by providing a flexible dissimilarity metric.[1] Its adaptability to different domains, combined with ongoing research into faster approximations and learned edit costs, continues to drive its relevance in graph-based data analysis.[4]Fundamentals
Definition
Graph edit distance (GED) is a measure of similarity between two graphs, defined as the minimum total cost of a sequence of edit operations required to transform a source graph G_1 into a target graph G_2. This edit path allows for structural and label differences to be quantified in a flexible manner, making GED particularly useful for inexact graph matching in scenarios where perfect isomorphism is unlikely. The concept was introduced to address error-tolerant comparisons in structural pattern recognition, where graphs represent relational data with potential noise or variations. Formally, let G_1 = (V_1, E_1, \ell_1) and G_2 = (V_2, E_2, \ell_2) be two undirected labeled graphs, where V_i denotes the vertex set, E_i the edge set, and \ell_i the labeling function assigning labels from finite alphabets to both vertices (\ell_{i_V}: V_i \to \Sigma_V) and edges (\ell_{i_E}: E_i \to \Sigma_E). The GED is given by \text{GED}(G_1, G_2) = \min \left\{ \sum_{k=1}^m c(o_k) \;\middle|\; (o_1, \dots, o_m) \in \Psi(G_1, G_2) \right\}, where \Psi(G_1, G_2) is the set of all edit paths transforming G_1 into G_2, and c(o_k) is the cost of the k-th edit operation o_k. Each edit operation incurs a non-negative cost via functions c_V: (\Sigma_V \cup \{\epsilon\}) \times (\Sigma_V \cup \{\epsilon\}) \to \mathbb{R}_{\geq 0} for vertices and c_E: (\Sigma_E \cup \{\epsilon\}) \times (\Sigma_E \cup \{\epsilon\}) \to \mathbb{R}_{\geq 0} for edges, with \epsilon as a dummy label for insertions or deletions; typically, c_V(\alpha, \alpha) = 0 for no substitution, insertions and deletions have unit cost (e.g., 1), and substitutions reflect label dissimilarity (e.g., c_V(\alpha, \beta) = d(\alpha, \beta) for some distance d). To illustrate, consider two simple undirected graphs each with two vertices connected by a single edge and no edge labels. Let G_1 have vertices labeled "A" and "B", while G_2 has vertices labeled "A" and "C". Assuming unit costs for insertions/deletions and substitution cost equal to 1 if labels differ (0 otherwise), the minimum edit path substitutes "B" with "C" (cost 1), leaving the structure intact, yielding GED(G_1, G_2) = 1. If labels were identical, GED would be 0, indicating isomorphism. This basic case highlights how GED captures label-based differences without altering connectivity.Edit Operations
The primary edit operations in graph edit distance (GED) are vertex insertion, vertex deletion, edge insertion, and edge deletion, which allow for the addition or removal of structural elements between two graphs. These operations are complemented by substitution operations, specifically vertex relabeling—where the label of a vertex is changed from l_1 to l_2 at a cost determined by a label distance function \delta(l_1, l_2)—and edge relabeling, which similarly adjusts edge labels based on their dissimilarity.[5] Costs are assigned to each operation to quantify the transformation effort, with insertions and deletions typically incurring a unit cost of 1, though higher values are sometimes used in specific applications. Substitution costs are 0 when labels match exactly, promoting mappings of identical elements, and exceed 0 otherwise to reflect the degree of change required. These cost functions ensure the total GED reflects a meaningful structural dissimilarity while allowing flexibility for domain-specific adaptations.[5] If two graphs are isomorphic, GED is zero, as an edit path exists consisting solely of zero-cost substitutions that preserve all labels and structure.[5] As an illustrative example, consider transforming a path graph with three unlabeled vertices connected as a - b - c (two edges) into a cycle graph with the same vertices but three edges (a - b - c - a). Under unit costs for insertions and deletions, with zero-cost substitutions for matching vertices and existing edges, the minimum edit path substitutes the vertices and the two existing edges at no cost, then inserts the closing edge c - a at cost 1, yielding a total GED of 1.[5]Properties and Variants
Key Properties
Graph edit distance (GED) exhibits several key mathematical properties that underpin its utility as a measure of structural similarity between graphs. It is inherently non-negative, satisfying GED(G₁, G₂) ≥ 0 for any two graphs G₁ and G₂, since edit costs are defined to be non-negative and the distance represents the minimum such cost over all possible edit paths.[6] This non-negativity holds regardless of the specific cost assignment, as long as individual edit operations do not incur negative costs. A fundamental property is the identity of indiscernibles: GED(G₁, G₂) = 0 if and only if G₁ and G₂ are isomorphic. If the graphs are isomorphic, an edit path consisting solely of zero-cost substitutions (via a bijection preserving labels and edges) achieves this minimum. Conversely, a zero-cost path implies no insertions, deletions, or non-trivial substitutions are required, confirming structural equivalence.[6] This direct link to graph isomorphism positions GED as an error-tolerant extension of exact matching, where non-zero distances quantify the minimal structural deviations.[7] Symmetry holds under the condition of symmetric edit costs, where the cost of an operation equals the cost of its inverse (e.g., insertion cost equals deletion cost for vertices and edges). In this case, GED(G₁, G₂) = GED(G₂, G₁). To see this, consider any minimum-cost edit path π from G₁ to G₂ with cost c(π). Reversing π yields an edit path from G₂ to G₁, and symmetry ensures the reversed operations have identical costs, preserving the minimum. If costs are asymmetric, symmetry may fail.[6] The triangle inequality, GED(G₁, G₃) ≤ GED(G₁, G₂) + GED(G₂, G₃), does not generally hold for GED, particularly when computed via edit paths (GED_path) and with arbitrary non-unit costs. For instance, edit paths may not compose in a way that bounds the direct path cost, even if individual edit functions satisfy the inequality. This property is satisfied only under restrictive conditions, such as unit costs for all operations or when GED is reformulated via bijections (GED_bij) with edit functions that themselves obey the triangle inequality. Consequently, GED is a pseudo-metric rather than a full metric in many settings, which can affect its use in certain geometric or optimization contexts.[6] GED also displays intuitive behavior relative to graph sizes. For vertex edits, it is lower bounded by the difference in vertex counts: GED(G₁, G₂) ≥ ||V₁| - |V₂|| ⋅ min(c_ins^v, c_del^v), where c_ins^v and c_del^v are the minimum costs for vertex insertion and deletion, respectively. This bound follows directly from the necessity of at least |V₁| - |V₂| (assuming |V₁| ≥ |V₂|) insert or delete operations to balance the vertex sets before any substitutions or edge edits can occur. Similar bounds apply to edge counts, providing quick estimates for dissimilarity in unbalanced graphs.[8]Common Variants
The concept of graph edit distance was first formalized by Sanfeliu and Fu in 1983, with initial variants for inexact matching developed by Bunke and Allermann in the same year, to enable tolerance for distortions and noise through a sequence of edit operations with associated costs.[9][10] Subsequent extensions in the 2000s adapted these variants for bioinformatics, incorporating domain-specific edit costs for molecular and genomic graphs. Error-tolerant graph edit distance, inherent to the foundational model, permits approximate matches by assigning costs to edit operations and bounding the total error to reflect practical dissimilarities in noisy data. Bounded graph edit distance restricts the maximum allowable edit cost or path length, facilitating efficient computation while ensuring the distance remains within a predefined threshold for applications requiring scalability.[11] Variants for directed graphs extend the standard operations to handle edge directionality, such as inserting, deleting, or substituting directed edges while preserving orientation in the edit sequence.[12] For weighted graphs, adaptations incorporate edge weights into substitution costs beyond mere label matching, allowing the distance to account for quantitative differences in edge properties.[13] Subgraph edit distance modifies the objective to find the minimum edit cost transforming the source graph into any subgraph of the target graph, rather than an exact isomorphic match, which is useful for partial structure detection.[14] An example of a practical variant is the size-normalized graph edit distance, which scales the raw distance by the average graph size—such as dividing by (|V_1| + |V_2|)/2, where |V_i| denotes the number of vertices in graph G_i—to enable fair comparisons between graphs of differing scales.[15]Applications
Pattern Recognition and Computer Vision
Graph edit distance (GED) was pioneered in the 1980s for syntactic pattern recognition, enabling error-tolerant matching of structural representations in visual data. Introduced by Bunke and Allermann in 1983 as a method for inexact graph matching, it addressed the limitations of exact isomorphism in handling noisy or distorted patterns common in images.[16] By the 1990s, GED had become a cornerstone for inexact graph matching, with Bunke's 1997 work establishing theoretical links between GED and maximum common subgraph approximations, facilitating its adoption in computer vision tasks.[7] In object recognition, GED facilitates graph matching by representing shapes as attributed graphs, where nodes capture features like curvature or position, and edges encode spatial relations, allowing similarity measurement through minimal edit costs. This approach excels in error-tolerant scenarios, such as recognizing deformed or partially occluded objects, by quantifying the cost of insertions, deletions, and substitutions to align graphs. For instance, Sebastian et al. (2003) applied GED to shock graphs—skeletal representations of shape boundaries—for robust object recognition, demonstrating high accuracy in matching silhouettes from databases like MPEG-7, where traditional feature-based methods falter under viewpoint variations. A practical example is handwritten character recognition, where stroke sequences form graphs, and GED measures shape similarity despite variations in writing style. For image analysis, GED supports attributed relational graphs (ARGs) in tasks like scene labeling, where nodes represent regions or objects with attributes (e.g., color, texture), and edges denote relations (e.g., adjacency, inclusion). This enables error-tolerant matching for segmenting and interpreting complex scenes, accommodating segmentation errors or noise. Robles-Kelly and Hancock (2005) utilized GED on ARGs derived from image adjacency matrices for content-based image retrieval, achieving lower intra-class distances on datasets like COIL compared to spectral methods alone. Unlike string edit distance, which is limited to linear sequences, GED captures 2D structural relations, making it superior for hierarchical or relational visual patterns, as highlighted in early applications to syntactic image understanding.[16]Chemoinformatics and Bioinformatics
In chemoinformatics, graph edit distance (GED) serves as a measure of structural similarity between molecular graphs, where atoms are represented as nodes and chemical bonds as edges, facilitating tasks in drug discovery such as virtual screening of compound libraries to identify potential therapeutic agents. The edit operations—node and edge insertions, deletions, and substitutions—directly model chemical modifications, including atom additions, replacements, or alterations in bonding patterns, thereby quantifying the minimal transformations required to align two molecules. This approach has proven effective for retrieving structurally similar compounds from databases, outperforming some fingerprint-based methods in enrichment factors for active molecules in pharmacological screens.[17] In bioinformatics, GED is applied to compare protein structures by converting them into contact map graphs, in which nodes denote amino acid residues and edges signify close spatial contacts (typically within a predefined distance threshold), allowing for the assessment of conformational similarities and differences. Such graph representations enable GED to estimate evolutionary distances between proteins by calculating the edit cost to transform one contact map into another, providing insights into structural divergence over phylogenetic time scales. For example, GED computations on contact maps have been used to align and classify protein domains, revealing conserved motifs in families like kinases or globins. As of 2025, GED has also been applied in genomics for constructing edit-distance graphs to cluster large sets of short reads from sequencing data, aiding in genome assembly and variant detection.[18][19] A specific instance of GED's utility involves approximating the maximum common edge subgraph (MCES) for pharmacophore matching, where pharmacophores—simplified graphs capturing key molecular features like hydrogen bond donors or hydrophobic regions—are compared to identify shared interaction patterns with biological targets. By optimizing GED to minimize edit costs between these pharmacophore graphs, MCES approximations highlight overlapping substructures essential for activity, supporting ligand-based virtual screening and scaffold hopping in medicinal chemistry.[20] The use of GED in chemoinformatics and bioinformatics emerged prominently in the early 2000s, with foundational evaluations by Raymond and Willett demonstrating its viability for similarity searching in chemical databases containing thousands of 2D structures. Subsequent advancements have incorporated GED into machine learning pipelines via specialized kernels that approximate edit distances, enabling supervised classification of biological graphs such as protein-protein interaction networks or metabolic pathways in datasets from sources like the Protein Data Bank. These GED kernels, often based on random walk or convolution formulations, enhance accuracy in tasks like predicting protein function or molecular bioactivity without requiring exact distance computations.[21] Recent applications as of 2024 include using GED for clustering travel-activity patterns in transportation analysis, where graphs represent weekly mobility sequences to identify behavioral similarities tolerant to variations in data collection.[22]Algorithms
Exact Methods
Exact methods for computing graph edit distance (GED) aim to find the minimum cost sequence of edit operations that transforms one graph into another, guaranteeing the optimal solution without approximation. These approaches are essential in scenarios where precision is paramount, such as verifying structural similarities in small datasets from pattern recognition tasks. However, due to the inherent complexity of the problem, exact methods are typically feasible only for graphs with a limited number of vertices, often up to 10-20 nodes depending on the graph density and edit costs.[16] One prominent exact method employs a tree search strategy based on the A* algorithm, which explores the space of possible edit paths in a best-first manner. The search tree is constructed by sequentially assigning nodes from the source graph G_1 = (V_1, E_1) to nodes in the target graph G_2 = (V_2, E_2), including dummy nodes to account for insertions and deletions. At each step, the algorithm selects the partial mapping with the lowest estimated total cost, where the total cost is the sum of the actual cost incurred so far and an admissible heuristic lower bound on the remaining cost. Admissible heuristics, such as those derived from bipartite matching on unmatched nodes or lower bounds on edge edit costs, ensure that the first complete path found is optimal. This approach was formalized in early works on inexact graph matching and has been refined in subsequent implementations to reduce memory usage, such as through depth-first variants like DF-GED.[16][23] The GED problem can be formulated as finding an optimal bijection \phi: V_1 \cup \epsilon_1 \to V_2 \cup \epsilon_2, where \epsilon_1 and \epsilon_2 are sets of dummy (null) nodes representing insertions and deletions, respectively (typically |\epsilon_1| = |V_2|, |\epsilon_2| = |V_1| to balance sizes). The total edit cost under this mapping is the sum of vertex edit costs \sum_{v \in V_1 \cup \epsilon_1} c_v(v, \phi(v)), where c_v denotes substitution, insertion, or deletion costs based on whether \phi(v) is a real node or dummy, plus edge edit costs determined by the induced mapping: for each edge (u, v) \in E_1, a cost c_e((u, v), (\phi(u), \phi(v))) for substitution (if both images are real and edge exists in E_2), deletion (if image edge missing), etc.; symmetrically, insertion costs for edges in E_2 without preimages under \phi. This formulation captures the interdependence between node and edge edits, making the search space combinatorial. For the heuristic in A*, a common lower bound is the cost of optimal bipartite matching on the remaining nodes using vertex substitution costs alone, ignoring edge dependencies to ensure admissibility.[16][23] To illustrate the A* approach, consider two small undirected graphs without labels for simplicity (unit costs: 1 for insert/delete/substitute node or edge). Let G_1 have vertices \{a, b\} with edge (a, b), and G_2 have vertices \{x, y, z\} with edges (x, y), (y, z). The goal is to transform G_1 to G_2 with minimum cost.- Initial state: Empty partial mapping, g = 0 (cost so far), heuristic h based on matching 2 nodes to 3 (e.g., h = 1 for one deletion/insertion imbalance, plus edge costs).
- Step 1: Assign a \to x (substitute cost 1, partial edge match possible), new g=1, h=2 (remaining b to y or z, plus edge inserts/deletes). Alternative: a \to \epsilon (delete cost 1), but higher h.
- Step 2: From a \to x, assign b \to y (g=2), check induced edge: (a,b) \to (x,y) matches, no edge cost; h=1 (insert z and edge (y,z)). Total f = g + h = 3.
- Step 3: Complete by inserting z and edge (y,z) (g=4), h=0, total cost 4. Explore alternatives like b \to z (would require edge substitute/delete, higher cost). The search prunes branches where f > 4, confirming optimality (actual minimum: substitute a \to x, b \to y, insert z and edge).