Fact-checked by Grok 2 weeks ago

Duplicate code

Duplicate code, also known as code clones, refers to identical or similar fragments of source code that appear in multiple places within a software codebase, typically resulting from copy-and-paste programming practices. In recent years, the use of AI-assisted code generation tools has further contributed to the rise in duplicate code, with empirical data indicating a significant increase in clone prevalence.^[1]^[2] This phenomenon is widely recognized as a code smell in software engineering—a surface indication of deeper design issues that can complicate long-term development and evolution.^[3] The presence of duplicate code often arises during rapid prototyping or when developers reuse snippets without abstraction, leading to redundant implementations of the same logic.^[2] Key impacts include heightened maintenance costs, as modifications to functionality in one location require corresponding changes elsewhere to maintain consistency, increasing the risk of bugs and inconsistencies if updates are overlooked.^[2] Empirical studies have shown that while duplicate code may not always be modified as frequently as unique code, its persistence can still hinder software evolution by amplifying error propagation during refactoring or feature additions.^[4] To mitigate these issues, the DRY (Don't Repeat Yourself) principle, introduced in foundational software development literature, emphasizes abstracting duplicated logic into single, reusable representations—such as functions, classes, or modules—to ensure knowledge is expressed unambiguously once within the system.^[5] Detection techniques, including traditional token-based, tree-based, and graph-based analysis tools, as well as emerging AI and large language model-based approaches, are commonly used to identify clones at various levels of similarity (e.g., exact matches or semantically equivalent variants), enabling targeted refactoring strategies like extraction to methods or inheritance hierarchies.^[6] By addressing duplicate code proactively, developers can enhance code readability, reduce technical debt, and improve overall system reliability, though some contexts—such as performance-critical sections—may tolerate limited duplication for optimization purposes.^[2]

Fundamentals

Definition

Duplicate code refers to identical or similar code fragments that appear in multiple locations within a software codebase, resulting in redundancy that can complicate maintenance and evolution.^[7] This phenomenon, also known as code cloning, arises when developers copy and paste segments of source code rather than reusing abstractions, leading to repeated implementations of the same logic or structure. Key characteristics of duplicate code include exact copies, which are bit-for-bit identical except possibly for whitespace or comments; near-duplicates, featuring minor syntactic variations such as renamed variables or parameters; and functionally equivalent code, where the underlying logic performs the same operation but uses different syntactic constructs or algorithms.^[8] These distinctions highlight the spectrum from superficial similarities to deeper semantic overlaps, all of which contribute to the challenges of managing codebase consistency.^[7] The term duplicate code gained prominence in the late 1990s alongside the widespread adoption of object-oriented programming in large-scale software projects, where modular design principles emphasized reuse but often encountered practical duplication issues.^[3] Within this context, it was formalized as a specific type of code smell—a surface indication of deeper design problems—in Martin Fowler's seminal taxonomy of refactoring opportunities.^[9]

Types

Duplicate code, often referred to as code clones in software engineering literature, is categorized into several types based on the degree of similarity and structural differences between code fragments. These categories help in understanding the spectrum of duplication from verbatim copies to semantically equivalent implementations. The most widely adopted classification distinguishes four types, primarily syntactic for the first three and semantic for the fourth. Exact duplicates, also known as Type-1 clones, consist of identical code blocks except for differences in whitespace and comments. These arise commonly from direct copy-paste operations during development and represent the simplest form of duplication, making them the easiest to detect through straightforward textual comparison. Despite their detectability, exact duplicates are frequently overlooked in large codebases because developers may not immediately recognize the redundancy during initial coding phases.^[10]^[8] Near-duplicates, or parameterized duplicates (corresponding to Type-2 clones), involve code fragments that are syntactically identical except for superficial variations such as renamed variables, constants, or literals. These differences often result from adapting copied code to fit a new context, like changing parameter names to match local naming conventions. Identifying near-duplicates requires normalization techniques, such as renaming identifiers to a standard form, to align the fragments for comparison and reveal the underlying similarity.^[10] Structural duplicates (Type-3 clones) are code segments with syntactic modifications such as added, removed, or reordered statements beyond mere renaming. Functional duplicates (Type-4 clones) achieve the same functional outcome through semantically equivalent but syntactically different implementations, such as using alternative data structures or algorithms. These are the most challenging to detect automatically, as they demand analysis beyond surface-level syntax, often involving program transformation or behavioral equivalence checks.^[10]^[11] Classification of these types typically relies on metrics assessing similarity at the string, token, or structural levels. For instance, edit distance measures, such as the Levenshtein distance, quantify the minimum number of insertions, deletions, or substitutions needed to transform one code fragment into another, proving particularly useful for distinguishing near- and structural duplicates from exact ones. Token-based comparisons, which break code into lexical units (e.g., keywords, identifiers) and evaluate sequence similarity, provide a more robust criterion for handling syntactic variations in Type-2 and Type-3 clones. These approaches ensure precise categorization by balancing granularity and computational feasibility in clone analysis.^[10]

Origins

Causes

Duplicate code often arises from practical decisions made during software development to prioritize speed and functionality over long-term structure. One primary cause is copy-paste programming, where developers duplicate existing code fragments to reuse tested logic rather than implementing new solutions from scratch, particularly to save time and effort under tight constraints.^[7] This practice is common in prototyping phases or when adapting code as templates, as it allows rapid iteration but frequently results in scattered repetitions across the codebase.^[12] In modern software development, AI-assisted coding tools, such as large language models, have introduced a new source of duplication. These tools generate code snippets quickly but often produce similar implementations without encouraging abstraction, leading to increased clones and technical debt in projects relying on AI for productivity.^[13] Evolving requirements, such as feature creep, contribute significantly to code duplication by necessitating the replication of similar logic in different modules to accommodate context-specific variations without immediate refactoring. In software product lines, for instance, new customer-driven features with overlapping semantics are often added informally, leading to unintended duplicates in domain or application models unless verified against existing structures.^[14] This occurs because requirements evolve through natural language specifications that lack formal checks, causing developers to implement parallel functionalities independently to meet immediate needs. Team collaboration challenges, especially in large or siloed development environments, exacerbate duplication when multiple developers implement the same functionality without awareness of each other's work. Lack of knowledge sharing or cooperation in distributed teams results in inadvertent replication, as unfamiliar team members unknowingly recreate features already present elsewhere in the system.^[15] Such issues stem from organizational silos that limit visibility into the broader codebase, prompting isolated efforts that accumulate similar code segments. Integrating legacy systems during mergers or expansions introduces duplicate code when similar logic, originally developed in separate eras, languages, or projects, is combined without thorough consolidation. In industrial settings like software product lines, the organic growth of applications—such as expanding from dozens to over 70 interconnected systems—leads to cross-project duplications that persist due to delayed refactoring efforts and the complexity of aligning disparate historical implementations.^[16] Finally, a failure to identify and abstract reusable patterns early in development fosters ad-hoc code repetitions, as developers resort to direct copying when language features or design choices make factoring out common functionality difficult. This lack of abstraction is compounded by efforts to avoid inter-module dependencies, which encourage self-contained replications rather than shared components, thereby embedding duplication into the system's architecture from the outset.^[12]

Emergence

Duplicate code frequently originates during the initial stages of software development, particularly in proof-of-concept phases where developers emphasize speed and functionality over architectural elegance, often resorting to copy-and-paste techniques to quickly implement similar logic.^[17] This practice is common when exploring ideas or building prototypes, as refactoring for reusability is deferred in favor of immediate progress.^[18] As projects evolve, duplicate code proliferates through ongoing maintenance activities, such as integrating new features or applying bug fixes, where developers inadvertently replicate existing code segments rather than leveraging shared abstractions, leading to widespread dissemination across modules.^[19] Over time, these clones can evolve inconsistently if changes are not synchronized, further entrenching the duplication within the codebase.^[20] The emergence of duplicate code is notably more pronounced in large-scale, long-lived projects like enterprise software, compared to small scripts or prototypes, owing to the increased complexity, distributed contributions, and involvement of multiple developers.^[20] Empirical analyses of open-source systems reveal that duplicates often constitute 5-20% of the codebase in mature projects, with prevalence rising alongside project scale and team size due to coordination challenges.^[20] A historical example is the Linux kernel, where duplicate code has arisen from modular contributions across diverse architectures and subsystems, with studies tracking cloning ratios of 14-16% across 19 releases from versions 2.4.0 to 2.4.18, often starting with new components cloned from established ones before gradual refinement.^[21]

Impacts

Costs

Duplicate code imposes substantial maintenance overhead on software development teams, as modifications to shared logic must be replicated across all instances to ensure consistency. Failure to update every duplicate can introduce inconsistencies or bugs, with empirical studies showing that cloned code requires more maintenance effort than non-cloned code in approximately 61% of analyzed cases across multiple open-source systems.^[22] Testing duplicated code exacerbates challenges by necessitating redundant test cases for each instance, which inflates overall testing effort and increases the likelihood of overlooked defects in one or more copies. This redundancy contributes to higher development costs, as teams must verify and maintain parallel tests rather than a single, centralized implementation. Code bloat from duplicates significantly enlarges the codebase, with empirical analyses indicating that 5-20% of code in typical software systems consists of clones, leading to slower compilation times, more complex navigation for developers, and prolonged onboarding for new team members.^[16] Economically, duplicate code elevates maintenance expenses, with research demonstrating that refactoring such duplication can reduce maintenance effort by up to 7% in large-scale projects, implying a corresponding increase from its presence.^[16] From a security standpoint, duplicates heighten vulnerability risks through uneven patching, where fixes applied to one instance may leave others exposed, propagating latent bugs; one study identified 145 confirmed unpatched code clones as real security vulnerabilities across an operating system distribution comprising over 2 billion lines of code.^[23]

Benefits

In performance-critical systems, such as embedded software, duplicate code can offer advantages by avoiding the overhead associated with abstractions like function calls, enabling more direct and efficient execution paths. Compiler inlining, a technique that deliberately duplicates function bodies at call sites, eliminates call-return overheads and facilitates further optimizations like dead code elimination, which can improve runtime performance by up to 20-30% in resource-constrained environments. This is particularly valuable in real-time systems where predictability and low latency are paramount, as inlining reduces instruction fetch latencies and enhances instruction-level parallelism in architectures like VLIW processors.^[24]^[25] For one-off tools, prototypes, or small scripts, code duplication can enhance readability and development speed by sidestepping the complexity of generic abstractions or modular designs that might be overkill for limited-scope projects. In such contexts, copying code segments allows developers to maintain straightforward, self-contained logic without introducing unnecessary dependencies or parameterization, which can make the code more accessible for quick iterations or non-expert users. This approach is especially practical in scripting environments where the emphasis is on rapid prototyping rather than long-term maintainability.^[26] Duplicate code also supports safe experimentation by providing isolated copies of logic that can be modified without altering the core codebase, serving as a low-risk sandbox for testing changes or integrating new features gradually. This isolation minimizes the potential for unintended side effects during development branches, allowing teams to evaluate variations in behavior before committing to broader refactoring. Such practices have been observed in small-scale reuse scenarios where experimental modifications to subsystems are common.^[26] Historically, duplicate code was employed in early compilers and real-time systems due to the inefficiencies of early reuse mechanisms, such as limited support for procedures or macros that incurred significant overhead. In the 1990s, for instance, code duplication techniques were used in superscalar processors to assist global instruction scheduling in VLIW-based embedded designs. These methods were essential when abstraction tools were nascent, helping achieve performance targets in constrained hardware without viable alternatives.^[25] However, these benefits are largely contextual and tend to diminish in larger, scalable projects where maintenance demands outweigh any initial gains, as duplication amplifies error propagation risks and complicates evolution.^[27]

Detection

Manual Methods

Manual methods for detecting duplicate code rely on human expertise and systematic processes to identify similarities in source code without the aid of specialized algorithms or automated tools. These approaches leverage developers' understanding of the codebase and programming patterns to spot redundancies, often integrated into established software development workflows. While effective in certain contexts, they demand significant time and are prone to oversight due to the subjective nature of visual inspection. Code reviews serve as a primary manual method, where peer inspections occur during pull requests or change submissions to uncover visual similarities in code structure. Reviewers use checklists to systematically examine proposed changes, flagging instances of duplicated logic, such as repeated conditional statements or method bodies, to ensure consistency and reduce maintenance overhead. An empirical study of modern code reviews in large open-source projects like OpenStack and Qt found that duplicated code was the most frequently identified code smell, appearing in 709 instances across 1,539 smell-related reviews, with reviewers commonly suggesting refactoring techniques like Extract Method to address them. In these reviews, 79% of identified duplications were fixed, often within a week, highlighting the method's effectiveness when integrated into collaborative development processes.^[28] Refactoring audits involve periodic manual scans conducted by experienced developers, targeting high-risk areas such as utility functions or shared libraries where duplication is likely to accumulate. These audits typically focus on modules with frequent modifications, using walkthroughs to compare code segments for identical or near-identical implementations that could propagate errors if altered inconsistently. In one study of refactoring changes in OpenStack, code reviews encompassing audit-like inspections revealed discussions around eliminating duplicate code in driver implementations, emphasizing the need to consolidate redundancies to improve modularity. Such audits are particularly valuable in mature codebases, where developers draw on domain knowledge to prioritize areas prone to cloning during feature additions or bug fixes. Metrics-based screening supplements visual inspection by employing simple counts, such as line similarity percentages, facilitated through integrated development environment (IDE) features like built-in diff viewers. Developers manually compare files or functions side-by-side, calculating rough similarity metrics (e.g., matching lines exceeding 70%) to flag potential duplicates without invoking full clone detection engines. For instance, IDE diff tools allow pairwise comparisons that highlight structural overlaps, enabling quick assessments in focused sessions. This approach aids in verifying suspected duplications identified during reviews, providing a lightweight quantitative layer to human judgment. Best practices for manual detection include training developers to recognize common duplication patterns, such as replicated loops, conditionals, or data processing routines, through workshops or guideline documentation. Programs emphasizing the DRY (Don't Repeat Yourself) principle equip teams to proactively scan for these during development, fostering a culture of vigilance. In code review contexts, structured training has been shown to increase the detection rate of smells like duplication by encouraging explicit discussions on code reuse.^[28] Despite their merits, manual methods are time-intensive and subjective, relying on individual expertise that can lead to inconsistent results across teams. They prove most effective for small codebases under 10,000 lines of code, where comprehensive coverage is feasible, but scale poorly to larger systems without automation.

Automated Methods

Automated methods for detecting duplicate code, also known as code clones, leverage algorithms and computational techniques to identify similarities in source code at scale, enabling efficient analysis of large codebases without manual intervention. These approaches typically preprocess code into normalized representations to handle variations in formatting, identifiers, and minor syntactic differences, followed by similarity comparison using data structures or statistical models. Early techniques focused on exact matches, while contemporary methods address near-duplicates and even functional equivalents through advanced structures like trees or learning-based models.^[29] Token-based analysis forms a foundational technique in automated clone detection, where source code is tokenized—breaking it into lexical units such as keywords, operators, and identifiers—and sequences of these tokens are compared to identify duplicates. This method is tolerant to superficial differences like whitespace, comments, or layout variations, as normalization steps remove or abstract such elements prior to comparison. A seminal implementation, CCFinder, employs multilinguistic tokenization and a longest common subsequence algorithm to detect Type-1 and Type-2 clones across languages like C, Java, and COBOL, processing millions of lines of code efficiently.^[30]^[31] Fingerprinting enhances efficiency for exact match detection by computing compact hash values, or fingerprints, for code blocks and using string-matching algorithms to locate duplicates. The Rabin-Karp algorithm, adapted for substrings, rolls hashes over tokenized sequences to identify matching blocks in linear time relative to the input size, making it suitable for large-scale scans. Pioneered in tools like those developed by Johnson, this approach hashes lines or blocks while ignoring insignificant differences, allowing rapid indexing and retrieval of potential clones in repositories exceeding millions of lines.^[32]^[33] More advanced clone detection algorithms extend beyond tokens to handle structural and semantic similarities. Suffix trees, constructed from abstract syntax trees (ASTs), enable efficient discovery of near-duplicates (Type-3 clones) by representing code as compressed tries of all suffixes, allowing linear-time searches for common subsequences that account for reorderings or modifications. For functional similarity (Type-4 clones), machine learning models, such as deep neural networks, encode code into vector representations capturing control and data flow, then measure similarity via cosine distance or classification. Techniques like DeepSim use graph-based embeddings from program dependence graphs to achieve high detection rates for semantically equivalent but syntactically varied code fragments. Recent advancements as of 2025 incorporate large language models (LLMs) for clone detection, utilizing zero-shot or few-shot prompting to identify semantic clones effectively, often achieving high F1 scores on benchmarks while handling cross-lingual cases, though performance varies with prompt complexity.^[34]^[35]^[36]^[6] These methods integrate seamlessly into integrated development environments (IDEs) and continuous integration (CI) pipelines for real-time or periodic analysis. Tools like SonarQube employ token-based detection, flagging duplicates as sequences of at least 10-20 tokens (configurable per language) and reporting clone classes—groups of related duplicate blocks—with metrics on coverage and density, often visualized in dashboards to guide developers during commits or builds. Accuracy in automated detection involves trade-offs between precision (avoiding false positives) and recall (capturing true duplicates), with modern tools typically achieving 80-95% rates on benchmark datasets like BigCloneBench, depending on clone type and codebase scale. For instance, token- and tree-based methods excel in syntactic clones with recalls over 90%, while ML approaches boost functional detection but may introduce variability from training data.^[37]^[38]

Remediation

Refactoring Techniques

Refactoring techniques for duplicate code, often referred to as code clones, involve transforming existing code to consolidate redundancies while preserving behavior. These methods draw from established practices in software engineering to improve maintainability and reduce error-prone repetitions. A key approach is the extract method technique, which identifies duplicate code fragments and encapsulates them into a reusable method, adjusting parameters to generalize the logic for broader applicability. This refactoring is particularly effective for Type-1 (exact) and Type-2 (renamable) clones, as it promotes single responsibility and eases future modifications.^[39] For near-duplicates or Type-3 clones where code varies slightly, the template method pattern abstracts the common algorithmic structure into a superclass method, allowing subclasses to override variable steps while inheriting the shared skeleton. This technique is suitable when clones share a high-level flow but differ in implementation details, such as conditional branches or computations, enabling polymorphism to handle variations without proliferation. It has been applied in object-oriented languages to unify behaviors across related classes, reducing the need for manual synchronization.^[39] In languages supporting metaprogramming, code generation techniques mitigate duplication by automating the creation of similar code segments through macros, templates, or preprocessors. For instance, in C++, macros can expand to generate boilerplate code for repetitive blocks, avoiding manual copying while ensuring consistency. This approach is ideal for structural duplicates arising from language constraints, such as low-level operations, and prevents divergence over time as the generated code remains synchronized with the source template.^[39] Type-specific strategies address functional duplicates by leveraging language paradigms: in object-oriented contexts, polymorphism redesigns clones into overridden methods or interfaces to encapsulate variations; in functional programming, higher-order functions or lambdas abstract common operations, passing differing behaviors as parameters. These methods transform ad-hoc repetitions into idiomatic, extensible constructs, enhancing code reuse without altering semantics.^[39] A structured step-by-step process underpins safe application of these techniques: first, identify clones using detection outcomes to pinpoint locations and types; second, assess risks such as dependency impacts or behavioral changes through precondition checks; third, apply transformations incrementally, refactoring one clone pair at a time; and finally, verify correctness with comprehensive tests to ensure no regressions. This iterative workflow minimizes disruptions in large codebases and supports gradual improvement.^[39]

Tools and Best Practices

Several popular tools exist for detecting and managing duplicate code, each tailored to specific languages or offering broad support. PMD's Copy/Paste Detector (CPD) is a widely used open-source tool for Java and related languages like JSP, Kotlin, and Groovy, capable of scanning large projects via command-line or build tool integrations to identify duplicated blocks based on token similarity.^[40] Simian, a commercial similarity analyzer, supports multi-language detection including Java, C#, JavaScript, Python, and HTML/XML, processing entire codebases quickly to report duplicate lines and facilitate refactoring.^[41] For JavaScript and TypeScript projects, jscpd serves as an effective integrated option, supporting over 150 formats and generating reports in HTML, XML, or JSON while ignoring comments and whitespace for accurate clone identification.^[42] Integrating duplicate code checks into CI/CD pipelines ensures ongoing enforcement and prevents accumulation during development. Tools like SonarQube can be embedded in pipelines using plugins for GitHub Actions, Jenkins, or GitLab CI, where scans run automatically on pull requests and fail builds if duplication exceeds configurable thresholds, such as 5% of the codebase.^[43] This approach, often combined with PMD or Simian via Maven or npm scripts, allows teams to maintain quality gates without manual intervention.^[40] Best practices for addressing duplicate code emphasize proactive policies and processes. Establishing duplication limits, such as no more than 3-5% in style guides, promotes the DRY principle and reduces maintenance overhead, as recommended in secure coding standards.^[44] Teams should prioritize refactoring duplicates during sprint planning, allocating time for extracting shared utilities into libraries or modules, and document these utilities clearly to encourage reuse across projects.^[18] Recent advancements post-2020 incorporate AI and natural language processing (NLP) for detecting semantic duplicates, beyond syntactic matches. Deep learning models, such as those using tree-based convolutions or graph neural networks, analyze code semantics to identify functionally similar fragments, improving accuracy on varied clones in benchmarks like BigCloneBench.^[45] Tools like CodeAnt leverage AI for cross-file semantic analysis in pull requests, supporting languages including Python and JavaScript with low false positives.^[46] Concurrently, the widespread use of AI coding assistants has contributed to a rise in duplicate code; a 2025 study analyzing repositories from major tech firms reported that duplicates comprised 12.3% of changed lines in 2024, up from 8.3% in 2021, with projections indicating fourfold growth.^[1] When selecting tools, key evaluation criteria include language support to match project needs, false positive rates to minimize developer disruption (e.g., via configurable ignore options in PMD), and ease of integration with existing workflows like IDEs or CI/CD systems.^[40] Prioritizing tools with proven scalability on large repositories ensures reliable performance without excessive configuration overhead.^[27]

Examples

Functional Duplicates

Functional duplicates refer to code fragments that exhibit identical input-output behavior despite differences in syntax, structure, or underlying algorithms, often classified as Type-4 or semantic clones in code duplication research.^[47] These duplicates arise when developers implement the same functionality using alternative approaches, leading to logically equivalent but structurally distinct code.^[48] A classic illustration involves computing the factorial of a number, where one implementation uses recursion while another employs iteration, yet both yield the same results for valid inputs. The recursive version defines factorial recursively as factorial(n) = n * [factorial](/page/Factorial)(n-1) with a base case of factorial(0) = 1, whereas the iterative approach uses a loop to multiply from 1 to n. This equivalence holds for positive integers, demonstrating how varied algorithmic strategies can produce functionally identical outcomes.^[47] In real-world applications, such duplicates commonly appear in validation logic, such as email address checking in web forms, where one module might rely on regular expressions to match patterns while another implements a finite state machine to parse and verify structure, both ensuring the same set of valid inputs pass. These variations often stem from evolving requirements or team preferences, resulting in scattered but equivalent checks across a codebase. Detecting functional duplicates poses significant challenges, as it demands semantic analysis to verify behavioral equivalence rather than mere syntactic or textual matching, which traditional tools overlook due to issues like project-specific data types, external dependencies, and incomplete execution coverage.^[48] For instance, dynamic testing strategies may fail on 60-87% of chunks involving custom types or I/O operations, limiting reliable identification.^[48] Refactoring functional duplicates typically involves consolidating the logic into a single polymorphic function that abstracts the common behavior, allowing callers to select implementations via parameters or interfaces, thereby reducing redundancy while preserving flexibility. This approach, such as extracting to a strategy pattern, enhances maintainability by centralizing updates to shared functionality.

Structural Duplicates

Structural duplicates, classified as Type 2 code clones in established detection frameworks, consist of code fragments that are syntactically identical except for variations in identifiers such as variable names, function names, or literal constants. This form of duplication typically emerges from copy-paste practices where developers adapt the code minimally to the new context without altering the overall structure or logic.^[49] A representative example in Python involves processing collections with similar summation logic but differing variable nomenclature:

python
def calculate_user_scores(user_scores):
    total_score = 0
    for score in user_scores:
        if score > 0:
            total_score += score
    return total_score

def calculate_item_values(item_values):
    total_value = 0
    for value in item_values:
        if value > 0:
            total_value += value
    return total_value
def calculate_user_scores(user_scores):
    total_score = 0
    for score in user_scores:
        if score > 0:
            total_score += score
    return total_score

def calculate_item_values(item_values):
    total_value = 0
    for value in item_values:
        if value > 0:
            total_value += value
    return total_value

In this case, the loops and conditional checks mirror each other precisely, differing only in identifier choices like user_scores versus item_values.^[49] In real-world scenarios, such as building RESTful API services, structural duplicates frequently occur in error-handling blocks across endpoints; for example, validation and logging routines may be copied with tweaks to variable names for request-specific fields like user IDs or timestamps. Detection of these duplicates is straightforward via token-based or textual comparison after identifier normalization, enabling tools to identify matches with high precision.^[49] Refactoring them, however, requires parameterization techniques, such as extracting the common structure into a parameterized method to handle varying identifiers uniformly.^[50] To illustrate impact, consider a sample project like the CTAGS open-source tool, where structural duplicates (Type 2 clones) were analyzed; maintenance tasks on these clones demanded an average of 2007 effort units compared to 1659 for non-cloned code, representing over 20% additional effort and heightening risks of inconsistent updates across instances.^[51]

References

[1]
https://www.gitclear.com/ai_assistant_code_quality_2025_research
[2]
Refactoring
### Summary of Duplicate Code as a Code Smell from Refactoring by Martin Fowler
[3]
Is duplicate code more frequently modified ... - ACM Digital Library
Various kinds of research efforts have been performed on the basis that the presence of duplicate code has a negative impact on software evolution.
[4]
[PDF] The Pragmatic Programmer
In this section we'll outline the problems of duplication and suggest general strategies for dealing with it. DRY is More Than Code. Let's get something out of ...
[5]
Code Clones: Detection and Management - ScienceDirect.com
Code clones are similar or identical code fragments in software, often created by copy-pasting existing code. They can be both advantageous and disastrous.
[6]
Practical Guide to Code Clones (Part 1) - Teamscale
Jul 15, 2014 · Type 1: Exact copy, only differences in white space and comments. · Type 2: Same as type 1, but also variable renaming. · Type 3: Same as type 2, ...
[7]
[PDF] Refactoring - Improving the Design of Existing Code
In this book, Martin describes the benefits of refactoring with respect to Java programs. We expect to hear more testimonials from those who read this book ...
[8]
https://teamscale.com/blog/en/news/blog/practical-guide-to-code-clones-part1
[9]
[PDF] Tracking Code Clones in Evolving Software
Code clones are groups of source code regions that match each other. This paper proposes a technique using abstract clone region descriptors (CRD) to track ...
[10]
Duplication Detection When Evolving Feature Models of Software ...
After the derivation of specific applications from a software product line, the applications keep evolving with respect to new customer's requirements.
[11]
Evaluating Code Clone Detection and Management
Jun 6, 2025 · Lack of Knowledge: Unfamiliar developers may duplicate features they are unaware of in the source. (3). Avoiding Dependencies: Replicating code ...
[12]
Refactoring cross-project code duplication in an industrial software ...
The objective of this work was to investigate the impact of refactoring cross-project duplicated code on maintainability, code reuse, and technical debt.
[13]
[PDF] Archeology of Code Duplication - Richard Wettel
Duplicating code, while easy and cheap during the development phase, moves the burden towards the al- ready overloaded and much more expensive mainte- nance ...<|separator|>
[14]
Duplicate Code - Refactoring.Guru
Duplication usually occurs when multiple programmers are working on different parts of the same program at the same time.
[15]
An Empirical Study on the Impact of Code Duplication-aware ... - arXiv
Feb 6, 2025 · Duplicating a code fragment involves the process of copying and pasting it, with or without minor modifications, into another section of the ...
[16]
An Empirical Study on the Impact of Duplicate Code - ResearchGate
Aug 9, 2025 · It is said that the presence of duplicate code is one of the factors that make software maintenance more difficult.
[17]
[PDF] Analyzing cloning evolution in the Linux kernel - <SDML>
Nineteen releases, from 2.4.0 to 2.4.18, were processed and analyzed, identifying code duplication among Linux subsystems by means of a metric-based approach.
[18]
[PDF] Does Cloned Code Increase Maintenance Effort? - Chanchal Roy
Abstract—In-spite of a number of in-depth investigations regarding the impact of clones in the maintenance phase there is no concrete answer to the long ...
[19]
ReDeBug: Finding Unpatched Code Clones in Entire OS Distributions
Unpatched code clones represent latent bugs, and for security-critical problems, latent vulnerabilities, thus are important to detect quickly. In this paper we ...<|control11|><|separator|>
[20]
Optimizations in C++ Compilers - ACM Queue
Nov 12, 2019 · This removes the overhead of the call and often unlocks further optimizations, as the compiler can optimize the combined code as a single unit. ...
[21]
Code duplication: an assist for global instruction scheduling
Soft errors are becoming a critical concern in embedded system designs. Code duplication techniques have been proposed to increase the reliability in multi- ...Missing: benefits | Show results with:benefits
[22]
(PDF) Semi-automating small-scale source code reuse via structural ...
... code duplication seems to be a reasonable or even beneficial design option. For example, a method of introducing experimental changes to core subsystems is ...
[23]
Evaluating Code Clone Detection and Management
Jun 6, 2025 · Reports clones according to token sequences, along with the location in the code, clone type (Type 1, 2, and 3), and similarity score.
[24]
[PDF] Code Smells Detection via Modern Code Review: A Study of the ...
In this example, the developer replied to the reviewer that they had removed most of the duplicate code (i.e., the smell was fixed). ... manual detection of code ...<|separator|>
[25]
[PDF] Feature-Based Detection of Bugs in Clones - Teamscale
Index Terms—Software quality, code clones, bug detection. I. INTRODUCTION. The ... manual detection in an unfamiliar code base where clones are inspected ...
[26]
[PDF] A Survey on Software Clone Detection Research∗
Sep 26, 2007 · There are several studies that use clone detection techniques for observing cloning behavior in software evolution and are reviewed in Section ...
[27]
CCFinder: a multilinguistic token-based code clone detection system ...
This paper proposes a new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison.
[28]
[PDF] A Token-based Code Clone Detection Tool - CCFinder and Its ...
A new clone detection technique, which consists of transformation of input source text and token-by-token comparison, is proposed, which extracts code ...
[29]
Identifying redundancy in source code using fingerprints
A prototype implementation of a mechanism that uses fingerprints to identify exact repetitions of text in large program source trees has been built and ...
[30]
[PDF] Substring Matching for Clone Detection and Change Tracking - PLG
Sep 19, 1994 · [6]. J. H. Johnson, “Identifying Redundancy in Source. Code using Fingerprints”, Proceedings of the 1993. CAS Conference, pp. 171–183 (October ...
[31]
Clone Detection Using Abstract Syntax Suffix Trees - IEEE Xplore
This paper describes how we can make use of suffix trees to find clones in abstract syntax trees. This new approach is able to find syntactic clones in linear ...
[32]
Clone Detection Using Abstract Syntax Suffix Trees - Semantic Scholar
This paper describes how to make use of suffix trees to find clones in abstract syntax trees and empirically compares the new technique to other techniques ...
[33]
DeepSim: deep learning code functional similarity
In this paper, we propose a novel approach that encodes code control flow and data flow into a semantic matrix in which each element is a high dimensional ...
[34]
[PDF] Evaluating Modern Clone Detection Tools
In this paper, we evaluate and compare the recall of eleven modern clone detection tools using four benchmark frameworks, including: (1). Bellon's Framework, (2) ...Missing: methods | Show results with:methods
[35]
[PDF] Comparison and Evaluation of Clone Detection Techniques with ...
The first three types are syntactic code clone and the last type belongs to semantic code clone. Studies [10], [11] have shown that Type-3 clones account for ...
[36]
A survey on clone refactoring and tracking - ScienceDirect.com
Chen et al. Tool support for managing clone refactorings to facilitate code review in evolving software ... Most of the research focused on detecting code clones ...
[37]
Finding duplicated code with CPD | PMD Source Code Analyzer
Duplicate code can be hard to find, especially in a large project. But PMD's Copy/Paste Detector (CPD) can find it for you! CPD works with Java, JSP, C/C++, C#, ...
[38]
Simian Similarity Analyzer
Simian identifies duplicate code across multiple languages, improving software quality and preventing costly bugs. Analyze any codebase in seconds.Features · Download · Docs · Changelog
[39]
jscpd - NPM
Jul 3, 2024 · The jscpd gives the ability to find duplicated blocks implemented on more than 150 programming languages and digital formats of documents.
[40]
Integrating Quality Gates into Your CI/CD Pipeline - Sonar
Jun 14, 2024 · The percentage of duplicated lines of code is above the set threshold. Code coverage for newly added code is below the set threshold. By ...
[41]
Secure Coding Guidelines for Java SE - Oracle
Guideline 0-2 / FUNDAMENTALS-2: Avoid duplication. Duplication of code and data causes many problems. Both code and data tend not to be treated consistently ...
[42]
On the Use of Deep Learning Models for Semantic Clone Detection
Dec 19, 2024 · In this paper, we propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark ...
[43]
5 Best Duplicate Code Checker Tools for Dev Teams (2025)
Rating 5.0 (7) · $10.00 · DeveloperMay 17, 2025 · Tried and tested duplicate code checker tools every dev team should try. Real pros, real pain points, and which ones's worth using.
[44]
[PDF] Neural Detection of Semantic Code Clones via Tree-Based ...
To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree- based convolution to detect semantic clones, ...
[45]
[PDF] Challenges of the Dynamic Detection of Functionally Similar Code ...
Such cloned code is considered harmful for two reasons: (1) multiple, possibly unnecessary, duplicates of code increase maintenance costs [1], [2] and, (2) ...
[46]
[PDF] FCCA: Hybrid Code Representation for Functional Clone Detection ...
The existing clone detectors using shallow textual or syntactical features to identify code similarity are still ineffective in accurately finding sophisticated.
[47]
Clone maintenance through analysis and refactoring
The removal of duplicate code associated with code clones provides a mechanism to improve code clone maintenance by eliminating redundant code and reducing ...<|control11|><|separator|>
[48]
An Empirical Study on the Maintenance of Source Code Clones
Aug 7, 2025 · Code cloning is not only assumed to inflate maintenance costs but also considered defect-prone as inconsistent changes to code duplicates ...
[49]
[PDF] Chanchal K. Roy and James R. Cordy School of Computing ...
Clone Types: The definition of clone is inher- ently vague in the literature [8]. However, the fol- lowing four types can roughly be defined [3, 8]. Type 1: ...
[50]
(PDF) Does cloned code increase maintenance effort?
PDF | In-spite of a number of in-depth investigations regarding the impact of clones in the maintenance phase there is no concrete answer to the long.
[51]
[PDF] Refactoring Clones: A New Perspective - Concordia University
Type-2 clones can be refactored by mapping the differences among the clones of a clone group and introducing a parameter of appropriate type in the extracted ...