Fact-checked by Grok 2 weeks ago

SCIgen

SCIgen is a computer program that automatically generates random, nonsensical research papers in computer science, including realistic graphs, figures, and citations, using a context-free grammar to mimic the structure and style of legitimate academic publications. Developed in 2005 by three graduate students in MIT's Parallel and Distributed Operating Systems (PDOS) group within the Computer Science and Artificial Intelligence Laboratory (CSAIL)—Jeremy Stribling, Dan Aguayo, and Max Krohn—SCIgen was created primarily for amusement but quickly became a tool to expose flaws in academic peer review processes. The program's inaugural use involved submitting a generated paper titled "Rooter: A Methodology for the Typical Unification of Access Points and Redundancy" to the 9th World Multiconference on Systemics, Cybernetics and Informatics (WMSCI) in Orlando, Florida, which accepted it as a non-reviewed paper in April 2005, prompting the students to attend the conference and present it in a mock session. This incident, which garnered media attention including from Nature, highlighted concerns about the conference's quality control and led IEEE to withdraw its technical co-sponsorship of WMSCI shortly thereafter. Over the years, SCIgen has been widely adopted by researchers and students to test the rigor of conferences and journals, resulting in numerous acceptances of its output by questionable venues. Notable examples include a paper under the pseudonym "Herbert Schlangemann" accepted to the 2008 International Conference on Computer Science and Software Engineering (CSSE), and submissions by students at Sharif University in Iran that were accepted to other events. In 2013–2014, French computer scientist Cyril Labbé identified over 120 SCIgen-generated papers published in more than 30 conference proceedings between 2008 and 2013, primarily in computer science and engineering fields; this led major publishers Springer (16 papers) and IEEE (over 100 papers) to retract them from their subscription databases. The open-source code, released under the GNU General Public License and available on , received around 600,000 pageviews annually as of 2015, underscoring SCIgen's enduring role in critiquing practices and inspiring related tools like SCIpher for embedding hidden messages in fake calls for papers. In 2025, marking its 20th anniversary, reflections on SCIgen emphasized its early demonstration of AI-like text generation, paralleling modern tools like in raising concerns about . Despite its satirical origins, SCIgen has influenced discussions on , with its outputs occasionally cited in real literature due to superficial plausibility, further emphasizing the need for robust review mechanisms in .

Development and Functionality

Origins and Purpose

SCIgen was created in April 2005 by MIT graduate students Jeremy Stribling, Max Krohn, and Dan Aguayo, who were members of the Parallel and Distributed Operating Systems (PDOS) group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The program originated as a short-term project, taking about a week or two to develop amid their coursework, building on Krohn's earlier experience with automated text generation from co-founding the study guide platform SparkNotes. The primary purpose of SCIgen was to produce humorous, nonsensical research papers as a form of , aimed at exposing flaws in the peer-review processes of low-tier conferences rather than for outright . This motivation stemmed from the creators' frustration with receiving unsolicited invitations—often described as —from organizers of questionable conferences, such as the World Multiconference on , , and (WMSCI), which they viewed as lacking rigorous standards. By generating papers that mimicked legitimate but contained meaningless content, the tool served as a demonstration of vulnerabilities in conference acceptance practices. SCIgen was released as open-source software under the GNU General Public License version 2.0 (GPL-2.0), allowing free use, modification, and distribution to encourage community experimentation and further highlight issues in academic publishing. This licensing choice aligned with its satirical intent, enabling others to generate and test similar hoax submissions while promoting transparency about the tool's artificial nature.

Technical Implementation

SCIgen is implemented as a script, originally developed by Jeremy Stribling, Max Krohn, and Dan Aguayo at 's Parallel and Distributed Operating Systems (PDOS) group. The program is hosted on the MIT PDOS website at pdos.csail.mit.edu/archive/scigen/ and released under the GNU General Public License (GPL), with its source code available on . It operates by parsing a set of hand-written rules defined in a configuration file (scirules.in), which specify production rules for generating syntactically valid but semantically meaningless text. The core mechanism relies on these context-free grammars to randomly assemble paper components from a lexicon of computer science terminology, ensuring the output mimics the structure of real academic papers in fields like distributed systems and algorithms. For instance, the grammar defines non-terminals for sections such as the abstract, introduction, related work, methodology, evaluation, and conclusion, where sentences are probabilistically expanded using terminals like buzzwords—"scalable," "Byzantine fault tolerance," and "red-black trees"—drawn from a database of authentic CS phrases. Figures and graphs are generated automatically via integrated tools, such as plotting random data points with Perl scripts to create plausible-looking diagrams (e.g., bar charts or line graphs labeled with fabricated metrics), while citations are fabricated by randomly selecting and formatting entries from a predefined list of real or pseudo-references. Users interact with SCIgen through command-line options in the Perl executable (scigen.pl), enabling customization such as specifying output length (via verbosity parameters), incorporating user-provided keywords to influence grammar expansions, generating presentation slides instead of full papers, or including fake author names and affiliations to complete the document metadata. The primary output format is source code, compiled via a provided Makefile (make-latex.pl), which produces a compilable .tex file ready for PDF generation; this allows for easy integration with standard workflows. However, the system has inherent limitations: it produces text that appears coherent at a superficial level due to grammatical correctness and domain-specific jargon but lacks genuine logical connections or novel contributions, often resulting in repetitive or nonsensical arguments. Additionally, it supports only Latin-1 character encoding, excluding full , and is optimized for environments such as and .

Early Hoaxes and Demonstrations

2005 WMSCI Acceptance

In 2005, three MIT graduate students—Jeremy Stribling, Daniel Aguayo, and Maxwell Krohn—utilized the newly developed SCIgen program to generate a nonsensical computer science paper titled "Rooter: A Methodology for the Typical Unification of Access Points and Redundancy." The paper, which featured fabricated claims about unifying voice-over-IP with public-private key pairs using stochastic models, was produced as a demonstration of vulnerabilities in academic publishing standards. This marked the first major real-world test of SCIgen, created earlier that year to highlight lax peer-review processes at certain conferences. The students submitted the paper to the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI), an event held in , in 2005. WMSCI accepted it as a non-reviewed paper shortly after submission, without providing any reviewer despite the approaching deadline, and demanded a $390 advance registration fee to include it in the proceedings and schedule a presentation. This acceptance process exemplified the conference's loose standards, where papers were often approved based on minimal scrutiny to fill sessions. The hoax was publicly revealed in April 2005 through media reports, including and , which detailed the generation and acceptance of the paper to critique predatory conference practices. In response, WMSCI organizers disinvited the students from attending and presenting, rejecting the paper's inclusion in the official agenda and stating they would review their acceptance procedures. The revelation garnered widespread media attention from outlets including , , , and , amplifying discussions on the quality control issues in academic conferences. The incident immediately spotlighted concerns over low-barrier conferences that prioritized volume over rigor, contributing to early awareness of models, though WMSCI issued no formal retraction of the acceptance at the time. Despite the disinvitation, the students crowdfunded over $2,400 from 165 donors to attend WMSCI incognito, where they staged an impromptu "presentation" in a hotel room using fake identities to further mock the event.

Schlangemann Pseudonym Publications

The pseudonym "Herbert Schlangemann" was employed in a series of targeted hoaxes to submit SCIgen-generated papers to international computer science conferences, aiming to expose weaknesses in peer review practices. The name combines "Herbert," a common German first name, with "Schlangemann," which translates to "snake man" in German and originates from a character in the 2000 satirical short film Der Schlangemann. This recurring fake identity allowed for repeated testing of submission processes under a consistent author profile purportedly affiliated with Umeå University in Sweden. In December 2008, the SCIgen-generated paper "Towards the Simulation of " by Herbert Schlangemann was accepted for presentation at the inaugural International Conference on and (CSSE 2008), held December 12–14 in , , and sponsored by the IEEE Computer Society. The submission underwent and was included in the , with the fictional even selected as for a session on topics. The was publicly revealed shortly after acceptance via an anonymous blog detailing the experiment, leading to the paper's prompt removal from the on grounds of violating IEEE publishing policies. This incident marked one of the earliest documented uses of SCIgen beyond its initial 2005 demonstration, underscoring review lapses in the conference's broad topical scope. A similar hoax occurred in 2009, when another SCIgen-produced paper under the Schlangemann pseudonym, titled "PlusPug: A Methodology for the Improvement of Local-Area Networks," was accepted for oral presentation at the International Conference on E-Business and Security (EBISS 2009), held in , , on May 23–24, with intended publication in IEEE proceedings. The acceptance followed submission to a call for papers on and e-business, but post-revelation exposure prompted the conference organizers to retract the paper and at least one other suspicious submission, citing insufficient reviewer expertise and procedural oversights. EBISS organizers acknowledged these vulnerabilities in communications following the incident, highlighting challenges in managing review quality for emerging regional events. These Schlangemann publications demonstrated how automated nonsense could infiltrate supposedly rigorous venues, prompting the conferences to admit flaws in their evaluation mechanisms and reinforcing broader concerns about integrity in less-established international gatherings. The hoaxes built on earlier SCIgen demonstrations, such as the 2005 WMSCI acceptance, but focused on IEEE-affiliated events to amplify scrutiny of sponsorships.

Notable Acceptances and Proliferation

Conference Acceptances

One notable instance of SCIgen's use in conference submissions occurred in 2008, when a pseudonymous author named Herbert Schlangemann submitted a generated paper titled "Towards the Simulation of " to the International Conference on and (CSSE 2008), an IEEE-sponsored event held in , . The paper was accepted for presentation and publication in the proceedings, highlighting vulnerabilities in the peer-review process of certain conferences. In 2010, students from in submitted an SCIgen-generated paper to the International Conference on Computational Aspects of Social Networks (CASoN 2010), another IEEE-sponsored . The submission was accepted, demonstrating how the could infiltrate even focused on specialized topics like algorithms. Such cases often involved paying registration fees without attending or presenting, underscoring patterns of exploitation in lower-tier venues. A broader proliferation emerged between 2008 and 2013, with over 120 SCIgen-generated papers identified in conference proceedings published by IEEE and Springer. IEEE proceedings alone contained more than 100 such papers across over 30 events, primarily in computer science subfields like software engineering and networks, while Springer identified 16 in its proceedings, mostly in engineering and computing conferences. These acceptances, detected by French researcher Cyril Labbé through automated analysis, typically featured nonsensical content on themes such as e-commerce methodologies or sensor network simulations, and were later removed by the publishers. The incidents revealed systemic issues in low-impact or regional conferences, where submission volumes were high and review rigor low.

Journal Publications

SCIgen-generated papers have appeared in various academic journals, though far less frequently than in conference proceedings, due to the typically more rigorous peer-review processes in journals. These publications often featured fabricated topics, such as improbable algorithms for data processing or network optimization, generated through SCIgen's context-free grammar to mimic legitimate research. For instance, multiple nonsensical papers were published in the International Journal of Innovative Technology and Exploring Engineering (IJITEE), a Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP) journal, including works on unrelated technical jargon combinations that evaded initial scrutiny. Another notable example occurred in 2008, when a SCIgen-produced paper titled "Rooter: A for the Typical Unification of Access Points and ," translated into and supplemented with local references, was accepted and published in a nationally accredited just one day after submission. This case highlighted vulnerabilities in rapid-review journals, where the paper discussed fictional unification of "red-black trees" with "superpages" in operating systems. Similar instances appeared in Trans Tech Publications' and Materials, with at least 27 detected SCIgen outputs by 2021, often on pseudo-engineering topics like material simulations devoid of empirical basis. Unlike conference acceptances, which proliferated in ephemeral proceedings, journal publications like these were leveraged for credential enhancement in academic hiring or promotions, exploiting metrics-driven evaluation systems. The relative scarcity in journals—estimated at around 75 SCIgen papers per million publications in computing literature overall—stemmed from slower submission cycles and higher detection risks, allowing some to persist longer without notice. These cases underscored patterns in predatory or low-impact journals, where pay-to-publish models incentivized minimal review, contrasting with the higher volume in conference settings.

Impacts on Academic Metrics and Publishing

Google Scholar Spoofing

In 2010, researcher Cyril Labbé conducted an experiment to demonstrate vulnerabilities in 's citation tracking by generating 102 papers using SCIgen under the "Ike Antkare." Each fabricated paper included citations to all the others in the set, along with references to a single real, indexed article to facilitate discovery and indexing by . To introduce these papers into the database, Labbé created simple web pages hosting the PDF files and submitted their URLs directly to via the site's "addurl" tool, bypassing traditional channels. This self-referential citation network rapidly inflated the metrics for the fictional Ike Antkare, achieving an h-index of 94 within months and ranking the pseudonym 21st among the most cited scientists on Google Scholar at the time—surpassing Albert Einstein, who held an h-index of 84 and ranked 36th. Tools like Scholarometer and Publish or Perish quickly indexed the profile, illustrating how automated systems could propagate unverified content without human oversight. The experiment was publicly exposed in Labbé's research report (RR-LIG-008), prompting media attention and discussions on the reliability of such metrics. The Ike Antkare hoax underscored critical flaws in automated citation aggregation, where self-citations and low-quality sources could artificially boost rankings used in evaluations. It highlighted the risks of relying on such metrics for decisions on tenure, , and institutional , as unverified could mislead assessments of scholarly without rigorous validation. This demonstration contributed to broader awareness of manipulation techniques, though Google Scholar's underlying algorithms remained susceptible to similar exploits in subsequent years.

2013 Retractions

In 2013–2014, French computer scientist Cyril Labbé uncovered over 120 papers generated by SCIgen in academic databases, primarily in conference proceedings published by and the Institute of Electrical and Engineers (IEEE). These papers, dating from 2008 to 2013, were detected using software methods that identified linguistic and structural anomalies characteristic of SCIgen output, such as repetitive phrasing and nonsensical technical terms. Labbé informed in 2013, prompting the publisher to retract 16 papers from conference proceedings (with two additional papers flagged shortly after, bringing the total to 18). IEEE was notified in 2013 and retracted over 100 SCIgen-generated papers from more than 30 in early 2014. This marked one of the largest coordinated withdrawals in history. Both publishers emphasized that the incidents exposed flaws in peer-review processes for certain conferences, leading to statements committing to enhanced editorial oversight and detection tools. In response to these events, released SciDetect in 2015, a software tool developed by Labbé's group to automatically detect SCIgen-generated papers. The retractions sparked widespread debate on systemic issues in , including the proliferation of low-quality conferences and the pressures driving researchers to inflate publication counts. SCIgen was frequently cited in editorials and analyses as a cautionary example of how automated tools could exploit weak review mechanisms, fueling discussions on practices without resulting in any legal actions against the authors. This event heightened global scrutiny of submission practices and prompted publishers to implement stricter verification protocols for .

Detection and Analysis

SciDetect Tool

SciDetect is an open-source software tool released in March 2015 by Springer in collaboration with Cyril Labbé, a computer scientist at Université Joseph Fourier (now Université Grenoble Alpes), to automatically detect fake scientific papers generated by SCIgen and similar programs like Mathgen and Physgen. Developed by Labbé's PhD student Tien Nguyen, the tool operates under the GNU General Public License version 3.0 and is freely available to the scientific and publishing communities via a public repository hosted by Université Grenoble Alpes. The tool's functionality relies on and n-gram analysis to identify SCIgen's distinctive linguistic patterns, including repetitive phrases, limited vocabulary richness, and grammatical artifacts such as uniformly short sentences with near-Gaussian length distributions. It processes input files in PDF or XML format by converting them to , tokenizing the content, and computing intertextual distances through metrics like ROUGE-N, which measure overlap in n-grams (unigrams, bigrams, and trigrams) against a reference of known generated texts; high overlap flags the paper as suspicious, with adjustable thresholds for customization. In evaluations, SciDetect demonstrated high effectiveness by detecting all known SCIgen-generated papers in test corpora, including over 120 instances previously published in and databases between 2008 and 2013. integrated the tool into its production workflow for pre-publication checks, enabling automated scanning of submissions to prevent the inclusion of papers and providing an additional layer of assurance beyond . Despite its strengths, SciDetect is tuned specifically to SCIgen's output characteristics and performs less reliably against evolved generators that incorporate more contextual variety or against human-edited fakes that obscure original artifacts; it cannot replace comprehensive for evaluating scientific merit.

2021 Study Findings

In 2021, researchers Guillaume Cabanac and Cyril Labbé conducted a scientometric analysis to assess the prevalence of SCIgen-generated papers in the scientific literature, identifying 243 such nonsensical publications spanning 2005 to 2020 across 19 publishers. Their study utilized a "search and prune" methodology, querying the Dimensions database with 258 fingerprint phrases derived from SCIgen's grammar to retrieve 3,755 candidate results, which were then manually pruned to confirm 243 papers with 83.6% precision. The analysis estimated the prevalence of these SCIgen papers at 75 per million publications in the fields of and Sciences, representing less than 0.01% of the total output in these areas. Notably, only 19% of the 243 papers (46 in total) had been addressed by publishers: 12 through formal retractions and 34 via silent removal, with all retractions occurring after 2013. The remaining 197 papers continued to be hosted and sometimes sold without any warnings or corrections, predominantly in low-impact journals and . Author affiliations revealed regional patterns, with 64.2% of the papers (156) linked to , 22.2% (54) to , and smaller shares from (1.2%) or other countries, while 10.3% had undetermined or no affiliations. Motivations appeared tied to , including padding curricula vitae, inflating h-indexes, and manipulating citations through edited SCIgen bibliographies that incorporated genuine references. The study highlighted ongoing gaps in detection and response, advocating for routine screening before and improved education on the risks of algorithmically generated content amid the "publish or perish" pressures. It underscored the persistence of such fraud despite tools like SciDetect, emphasizing the need for proactive measures to mitigate in published works.

References

  1. [1]
    SCIgen - An Automatic CS Paper Generator - PDOS-MIT
    SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations.About · Examples · Talks · Code
  2. [2]
    How three MIT students fooled the world of scientific journals
    Apr 14, 2015 · A decade ago, three MIT students created a program that randomly generates nonsensical computer-science papers. Since then, researchers have ...
  3. [3]
    Computer conference welcomes gobbledegook paper - Nature
    Apr 20, 2005 · 'Rooter' passed the test: the WMSCI accepted it, albeit without peer review. ... WMSCI 2005 advertises itself as “trying to bridge analytically ...
  4. [4]
    Hoax-detecting software spots fake papers | Science | AAAS
    The program—dubbed SCIgen—soon found users across the globe, and before long its automatically generated creations were being accepted by scientific conferences ...Missing: notable | Show results with:notable
  5. [5]
    Publishers withdraw more than 120 gibberish papers - Nature
    **Summary of SCIgen Paper Retractions:**
  6. [6]
    Springer, IEEE withdrawing more than 120 nonsense papers
    Feb 24, 2014 · Two major publishers will remove more than 120 papers created with random paper generator SCIgen, according to Nature.
  7. [7]
    Prank research paper makes the grade | MIT News
    Apr 15, 2005 · Graduate students Jeremy Stribling, Max Krohn and Dan Aguayo had doubts about the standards of some conference organizers, who they say "spam ...Missing: PDOS | Show results with:PDOS
  8. [8]
  9. [9]
  10. [10]
    Gaming the Metrics
    tic Context- Free Grammars), the Dada Engine has been used to generate ... SCIgen is based on a hand- written, context- free grammar and has been developed ...<|separator|>
  11. [11]
    [PDF] Rooter: A Methodology for the Typical Unification of Access Points ...
    Jeremy Stribling, Daniel Aguayo and Maxwell Krohn. ABSTRACT. Many physicists would agree that, had it not been for congestion control, the evaluation of web ...Missing: group | Show results with:group
  12. [12]
    Americas | Prank fools US science conference - BBC NEWS
    Apr 15, 2005 · It was accepted for the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI), due to be held in the city of Orlando in July.
  13. [13]
    Online Registration - IIIS Conferences - PDOS-MIT
    Cardholder's E-mail: XXXX ; Conference Fees Paid: Advance Registration: US$ 390 ; Extra Page Fee (0 pages):, US$ 0 ; Total Amount: US$ 390.
  14. [14]
    Computer-Generated Paper Accepted for Prestigious Technical ...
    Dec 24, 2008 · You can see the accepted paper posted on the IEEE's website. Here you can see where Schlangemann is named chair of a panel [PDF]. The abstract ...
  15. [15]
    An Open Access Trash Heap | Science | AAAS
    Oct 4, 2013 · "A paper titled "Towards the Simulation of E-Commerce" by Herbert Schlangemann got accepted as a reviewed paper at the "International ...
  16. [16]
  17. [17]
    Software-Generated Paper Accepted At IEEE Conference - Slashdot
    Dec 23, 2008 · ' Even better, fake author Herbert Schlangemann has been selected as session chair (PDF) for that conference. ... The paper was generated by the ...
  18. [18]
  19. [19]
    Prevalence of nonsensical algorithmically generated papers in the ...
    May 25, 2021 · The SCIgen designers submitted a generated nonsensical paper to a 2005 conference in computer science (Ball, 2005). Surprisingly, it was ...
  20. [20]
    Maestros of the random paper | New Scientist
    Aug 12, 2009 · Researchers Jeremy Stribling, Max Krohn and Dan Aguayo at the Massachusetts Institute of Technology were so annoyed at what they dubbed ...Missing: April | Show results with:April
  21. [21]
    More Computer-Generated Nonsense Papers Pulled From Science ...
    Mar 3, 2014 · In 2005, MIT students developed (the super fun to use) SCIgen, a ... acceptance standards of science conferences. They wrote in 2005 ...
  22. [22]
    [PDF] Ike Antkare one of the great stars in the scientific firmament - HAL
    Scigen is based on hand-written contex-free grammar and has been developed in the PDOS research group at MIT CSAIL. It was initially aimed at ...
  23. [23]
    how many SCIgen papers in computer science? | Scientometrics
    Jun 22, 2012 · The three largest tools referencing scientific texts are: Scopus (Elsevier) ... This is mainly because they store only publications in journals and ...<|separator|>
  24. [24]
  25. [25]
    [PDF] Detection of computer generated papers in scientific literature
    An example of these effects is given by the "Ike. Antkare" experiment [20]. Recently, automatically-generated fake scientific papers have been found in several ...