Recoll
Recoll is a free and open-source full-text search tool for desktops that indexes and retrieves documents based on their content and filenames, enabling users to locate files such as PDFs, Microsoft Word documents, emails, and attachments across complex storage structures like archives and nested folders.[1] Developed initially for Unix and Linux systems around 2007, Recoll originated as a personal search application to enhance file retrieval in desktop environments, later expanding to support Microsoft Windows and macOS for broader compatibility.[2][3] It is built on the Xapian search engine library, providing a Qt-based graphical interface that is intuitive and feature-rich, with ongoing development tracked on Framagit since 2020.[1] The software is licensed under the GNU General Public License (GPL), ensuring it remains freely available and modifiable within the free software ecosystem.[3] Key features of Recoll include support for numerous document formats through built-in and external filters, multilingual indexing (including complex scripts like Tibetan via n-grams), and advanced capabilities such as optical character recognition (OCR) for images, real-time indexing, and integration with desktop environments like GNOME Shell.[1] It handles large collections efficiently with multithreaded indexing and can access content in diverse locations, including email attachments and archive members, while offering options like dark mode and customizable result presentations.[1] As of November 2025, the latest version is 1.43.7, reflecting continuous updates to improve performance and usability across platforms including Linux, macOS, and even Android via a companion server.[1]Development and History
Origins and Development
Recoll was initiated as a personal project by Jean-François Dockes around 2005, aimed at developing a full-text desktop search tool that leverages the Xapian search engine library for efficient document retrieval.[4] This endeavor sought to provide users with a lightweight, privacy-oriented alternative to emerging desktop search solutions, emphasizing local processing without reliance on external cloud services or network dependencies.[1] From its inception, Recoll targeted Unix-like systems, reflecting Dockes' focus on open-source environments where users value control over their data and indexing processes. The project's core technical foundations centered on Xapian's capabilities for inverted indexing and probabilistic relevance ranking, which enable fast and accurate full-text searches across local files. Additionally, integration of the Qt framework facilitated the creation of a cross-platform graphical user interface, enhancing accessibility while maintaining compatibility with diverse operating systems.[1] Over time, Recoll evolved from a primarily command-line interface tool into a comprehensive application featuring a full GUI, with initial public releases appearing in the mid-2000s. This progression allowed for broader adoption among desktop users seeking configurable, resource-efficient search functionality on Linux and similar platforms.[5][6]Release History and Versions
Recoll's development has followed a steady progression of stable releases since its inception, with major versions introducing enhancements in performance, cross-platform compatibility, and user experience. The project maintains a focus on incremental improvements, typically releasing new major versions every 1-2 years, alongside minor updates for bug fixes and refinements.[6] Key milestones include version 1.19, which introduced multithreaded indexing to accelerate the processing of large document sets.[6] Version 1.22 marked the initial Windows port, built using MinGW and Qt, enabling broader adoption beyond Unix-like systems.[6] In 1.24, support for Xapian 1.4 was added, along with partial real-time indexing capabilities that allow monitoring and updating indexes without full rescans.[6] More recent major releases, such as 1.43, incorporated power-based indexer control—such as suspending operations on battery power—and various GUI improvements, including better preview handling and style sheet separation.[7] The stable release lineage culminated in version 1.43.7, issued on November 6, 2025, which includes safeguards for handling large or pathologic files through configurable size limits and enhanced error handling for XML processing and buffer issues.[1] This update builds on prior minors like 1.43.0, emphasizing stability for diverse environments.[7] Development trends reflect an ongoing emphasis on performance optimizations, such as faster indexing via Xapian merges in version 1.38, and platform expansions, including Flatpak packaging introduced in 2024 for easier distribution on Linux.[8] In the same year, an Android application, RecollDroid, emerged as a companion front-end for mobile access to Recoll indexes, extending usability beyond desktops.[9] Recoll remains open-source software licensed under the GNU General Public License (GPL), with its source code hosted on the official Framagit repository.[1][10]Core Features
Indexing Process
Recoll's indexing process begins with scanning user-specified directories to identify files for inclusion in the search database. The system recursively traverses these paths, extracting textual content and metadata from supported file types using a combination of native libraries, such as libxml2 for XML-based formats, and external helper tools like pdftotext for PDFs or antiword for Microsoft Word documents. Extracted data is then stored in a Xapian database, enabling efficient full-text search capabilities, with the process designed to be incremental—processing only files modified since the last indexing run to minimize resource usage. A full index rebuild can be initiated manually if needed.[11][12] The indexing mechanism supports substantial scale, with Xapian's backend allowing for databases containing over 11 million documents and sizes exceeding 550 GB, as reported in practical deployments. For ongoing maintenance, Recoll offers real-time indexing through filesystem event monitoring, utilizing inotify on Linux systems to detect changes like file creations, modifications, or deletions and trigger immediate updates. On macOS, similar real-time functionality is achieved via FSEvents integration. In cases where continuous monitoring is impractical due to resource constraints, batch indexing serves as a fallback, often scheduled via cron jobs for periodic execution. To optimize for portable devices, the configuration includes an option to suspend indexing when running on battery power, preventing excessive drain on laptops.[13][14][15][16] Recoll extends its coverage to complex file structures by recursively indexing contents within nested archives and containers, such as ZIP, TAR, and RAR files, as well as email attachments embedded in these or directly in mailbox formats like Maildir and MH. This recursive handling ensures that text from inner documents is extracted and indexed without manual intervention, though limits can be set on archive member sizes to manage processing overhead.[12][16] Customization of the indexing behavior is primarily managed through the recoll.conf configuration file, which allows specification of top-level directories (via the topdirs parameter), exclusion of particular MIME types (using excludedmimetypes), and skipping of files or paths based on wildcard patterns (skippedNames) or absolute locations (skippedPaths). Multiple independent indexes can be created by defining separate configuration directories, each with its own recoll.conf and database, facilitating segmented indexing for different data sets or users.[16][17]Search Capabilities
Recoll supports a variety of query types to enable precise and flexible searches across indexed documents. Users can employ Boolean operators such as AND, OR, and NOT to combine terms logically within the query language, which is accessible through the graphical interface, command line, or web UI.[18] Phrase searches are performed by enclosing terms in double quotes for exact matches, while proximity searches use the NEAR operator to find words within a specified number of terms of each other, such as "term1 NEAR/5 term2" for terms up to five positions apart.[18] Wildcards including * (for multiple characters) and ? (for single characters) allow pattern matching, and stemming expands query terms to include morphological variants at query time, supporting multiple languages based on user selection or configuration, such as English or French.[12][18] Filtering options refine search results by various criteria without altering the core query. Searches can be limited by file date using ISO8601 format in the query language (e.g., "date:2001-03-01/2002-05-01" for a range) or through the advanced search interface's date fields for minimum and maximum modification times.[19][20] Size-based filtering uses operators like > or < with units such as k (kilobytes), m (megabytes), or g (gigabytes), as in "size>1m" for files larger than one megabyte.[19][20] Directory filtering restricts results to specific paths or subtrees (e.g., "dir:/home/user/docs"), including support for wildcards and negation, while MIME type filtering targets formats like "mime:text/plain" or categories such as media or text, with wildcard expansion against the index.[19][20] Additionally, Recoll enables searching within text extracted via OCR from image-based PDFs and images using tools like Tesseract, provided the content was indexed with OCR enabled.[12] Search results are processed to enhance usability and accuracy. Relevance ranking is performed using the Xapian search engine library, ordering documents by estimated query match quality, which can be viewed in the result list.[12] Snippet previews provide contextual excerpts around matching terms, accessible via a dedicated Snippets window for formats like PDFs, showing page-specific extracts with hyperlinks.[21] Duplicate collapsing hides exact duplicates based on the MD5 hash of the document container, with an indicator link to reveal hidden paths if enabled in preferences.[22] Results can be sorted by relevance (default) or date (ascending or descending) using toolbar controls or column headers, with the sort state persisting across sessions if configured.[22] Query history maintains the last 100 advanced searches for quick recall via arrow keys or menus, while simple searches retain recent entries for autocompletion.[23][24] Advanced tools facilitate complex query management and index analysis. The term explorer allows browsing and searching the full list of indexed terms using wildcards, regular expressions, stem expansion, or phonetic matching with Aspell, helping users identify variants or statistics without prior knowledge of exact terms.[25] Query fragments enable reusable components, defined in a customizable XML file, where buttons insert predefined query language elements like directory filters or MIME exclusions directly into the current search.[26] Complex queries can be saved to files via the GUI menu for later restoration, preserving parameters like filters and external indexes, though compatibility with current preferences is noted upon loading.[27]User Interfaces
Graphical User Interface
The Recoll graphical user interface (GUI) is built using the Qt framework, providing a desktop application for searching and managing indexed documents. It features a main window with a central search entry field by default, supporting basic queries through the simple search mode. Users can switch to advanced search via the Tools menu or a toolbar icon, which opens a dialog with dedicated tabs for constructing complex queries without needing to recall the underlying search language syntax.[28] The interface includes a tools menu that grants access to search history, preferences configuration, and index maintenance options such as scheduling incremental passes or purging the index.[29] The simple search tab presents a single input field at the top, where users enter terms, phrases (enclosed in double quotes), or query language expressions, with modes selectable via a dropdown for matching any terms, all terms, file names, or full query language. Results appear below in a list view, displaying document titles, relevance scores, file paths, and clickable links for previewing or opening files; double-clicking a word in the results inserts it into the search field for refinement. The advanced search tab divides into "Find" and "Filter" sections: the "Find" allows multiple clauses for terms, phrases, and proximity searches (e.g., postfixing phrases with 'p' for near matches), while the "Filter" enables restrictions by modification date, file size, MIME type, or directory location. A clock button next to the search field provides access to recent queries for quick recall, and up/down arrow keys navigate advanced search history.[29][28][30] Visual elements emphasize usability, with the result list rendered as customizable HTML paragraphs showing snippets of matching context, metadata like size and date, and right-click menus for actions including preview, open with a specific application, copy file name, or find similar documents. Previews open in an internal window for quick viewing, integrated with external document viewers based on MIME types (configurable via preferences); snippets windows display paginated extracts for multi-page files like PDFs, highlighting search terms. Keyboard shortcuts enhance navigation, such as Ctrl+D for preview, Ctrl+O to open, Shift+Up/Down to browse previews, and up/down arrows for recalling previous queries in the search field. Real-time feedback includes autocompletion from search history or index terms while typing, which can be disabled in preferences. The GUI supports multiple result views, toggled via Ctrl+T between list and table formats for better organization of metadata.[31][32][29] Qt's cross-platform nature ensures a consistent layout and behavior across Linux, Windows, and macOS, with the GUI adapting to system themes where supported and allowing customization of fonts, colors, and result formats through the Preferences → GUI configuration dialog. Index maintenance tools, accessible from the File menu, include options to trigger real-time updates or view missing helper applications for file processing.[33][34][35]Command-Line and Web Interfaces
Recoll provides command-line tools for indexing, querying, and managing the search index, enabling scriptable and automated access without a graphical interface. The primary querying tool isrecollq, which executes searches specified on the command line and outputs results to standard output in a structured format, making it suitable for integration into scripts or environments lacking Qt libraries.[36] For example, recollq "query terms" retrieves matching documents with metadata such as titles, URLs, and abstracts, supporting the full Recoll query language including phrases, exclusions, and field-specific searches like title:example.[37] Options such as -n limit the number of results, -F specifies output fields in base64 encoding for programmatic parsing, and -A includes snippet abstracts to highlight query matches.[36]
Indexing and maintenance are handled by recollindex, a versatile command for creating, updating, or purging the index. It performs incremental updates by default but can erase the entire index with -z or reset it without deletion using -Z, and supports targeted operations like indexing specific files (-i), recursively updating directories (-r), or purging removed documents (-P).[15] For cache management, particularly the web history cache, recollindex --webcache-compact recovers space by removing deleted entries, while --webcache-burst <destdir> extracts all cache entries to files for backup or analysis.[15] These tools facilitate automation in scripts, such as periodic indexing via cron jobs, and piping outputs for further processing.[38]
The Recoll WebUI offers a browser-based interface for searching indexes over a local network, extending access to non-desktop environments or multiple users without installing the full Recoll application. Built on the Bottle Python framework and typically served via the Waitress WSGI server, it supports the same advanced query features as other interfaces and can be deployed standalone or integrated with Apache for improved multi-user performance.[39] Installation involves cloning the repository from GitHub or Framagit, ensuring the Python Recoll bindings are present, and configuring the server to point to the index directory; it listens on a specified port for HTTP requests.[40] This setup is ideal for server-like use cases, such as shared document repositories, though it lacks built-in access controls and is not designed for high-traffic web applications.[39]
Daemon control for real-time indexing is managed through recollindex -m, which starts a background monitor detecting file changes in configured directories, or via the rclmon.sh script for easier session integration on Unix-like systems.[14] The rclmon.sh start command launches the daemon during user sessions, while rclmon.sh stop halts it, ensuring low-overhead updates without manual intervention.[14] For programmatic access, the Python recoll module provides an API to connect to indexes, perform queries via Db objects, and update documents, allowing embedding of search functionality in custom applications.[41] These interfaces support use cases like automated workflows, remote querying in headless environments, and integration with other tools via scripting or APIs.[41]
Supported Formats and Platforms
File Formats
Recoll supports a wide range of file formats for indexing, enabling full-text search across diverse document types on desktops. These formats are categorized into those indexed natively without external dependencies and those requiring helper programs for processing. The tool emphasizes efficient extraction of content and metadata, with internal handling of UTF-8 encoding to support multilingual documents, including recent enhancements for scripts like Tibetan using n-gram techniques introduced in version 1.41.[12][42] Natively indexed formats include plain text files, HTML documents, email formats such as Maildir, mh, and mailbox (compatible with clients like Mozilla, Thunderbird, and Evolution), Gaim/Pidgin logs, Scribus documents (.sla), man pages (requiring groff for rendering), and various XML-based files like OpenOffice/LibreOffice ODT/ODS/ODP, Microsoft Office Open XML (DOCX/XLSX/PPTX), AbiWord, KWord, FictionBook 2 (FB2) ebooks, SVG images, Gnumeric spreadsheets, and Okular annotations. Additionally, Recoll natively handles ZIP and TAR archives (TAR disabled by default), Joplin notes, Dia diagrams, and web archives from Konqueror or macOS (using textutil from version 1.42.2 onward). These formats are processed directly using built-in libraries like libxml2 and libxslt for XML parsing, allowing seamless indexing of text content and basic metadata without additional software.[12] For more complex or proprietary formats, Recoll relies on external helper tools to extract indexable content. PDF files are supported via pdftotext from Poppler, including attachments (using pdfdetach), XMP metadata (via pdfinfo), annotations (via poppler-glib), and optional OCR with Tesseract or ABBYY FineReader for scanned documents. Microsoft Word (.doc) files use antiword or fallbacks like LibreOffice or wvWare; RTF files require unrtf (version 0.21.8 or later); CHM files leverage chmlib (with bundled Python bindings); and EPUB ebooks use an integrated epub module. Audio file tags (e.g., MP3 ID3) are extracted with mutagen, while image metadata (e.g., JPEG EXIF) uses exiftool. Other supported formats include Outlook PST/OST files with libpff (bundled on Windows), Hancom Hanword (.hwp) via pyhwp, WordPerfect files using libwpd (wpd2html), Jupyter notebooks with the Jupyter package, DjVu via DjVuLibre, RAR archives with unrar or rarfile, 7z archives via py7zr (version 0.22 recommended) or pylzma, iCalendar (.ics) files with the icalendar module, PostScript via Ghostscript and pdftotext, TeX via untex or detex, DVI via catdvi, and MIDI karaoke (.kar/.mid) files using chardet for encoding detection. Compressed variants like gzip and bzip2 are handled transparently during indexing, and metadata extraction is prioritized where possible, such as XMP in PDFs, to enrich search results.[12][43] This dual approach to format support ensures broad compatibility while maintaining performance, with native processing for common types and modular helpers for specialized ones, all integrated during the indexing process to create a unified searchable corpus.[12]Operating Systems and Installation
Recoll is primarily designed for Unix-like operating systems, with Linux serving as the main development and testing platform. It supports various Linux distributions through native packages, including Debian, Ubuntu, and Fedora-based systems via apt and yum/dnf repositories, respectively. Unix-like systems such as FreeBSD and OpenBSD are also compatible, with Recoll available in their ports and packages collections.[8][44] Support extends to other platforms, including Windows, where Recoll runs on Windows 10 and 11 using a MinGW-based build, though with limitations such as single-threaded indexing. On macOS, Recoll has been available since early versions. Android support was introduced in 2024 through the RecollDroid application, which operates in both client app mode for direct searches and server mode via a REST API for remote access to indexed databases. Historically, Recoll was ported to OS/2 and its successor ArcaOS around 2007, but active development and binaries for these platforms are no longer maintained.[45][46][47][9][48] Installation methods vary by platform to accommodate different environments. On Linux, users can install via distribution packages (e.g.,sudo apt install recoll on Debian/Ubuntu or sudo dnf install recoll on Fedora), portable AppImage bundles for quick deployment without system changes, or Flatpak from Flathub since 2024 for sandboxed execution across distros. Source compilation is supported on Unix-like systems, requiring tools like Meson and Ninja, with commands such as ./meson setup build followed by ninja -C build install. For Windows, a self-contained MSI installer bundles the core application and essential helpers like poppler for PDF processing and Python extensions for various formats, downloadable from the official site for a nominal fee to cover build costs. macOS installations can use Homebrew (brew install recoll) or MacPorts, or download a universal DMG bundle that includes the GUI application for both Intel and ARM architectures. On Android, the RecollDroid APK is available via F-Droid, with server mode setup involving a companion Python backend using FastAPI.[8][49][45][46][50][9]
The core dependencies for Recoll are minimal, centering on the Xapian search library for indexing and querying, Qt5 or later for the graphical interface, and Python 3 for scripting and extensions. Zlib and X11 are also required for basic compression and display functions on Unix-like systems. Optional dependencies enhance support for advanced file formats, such as libwpd for WordPerfect documents or poppler-utils for PDFs, which must be installed separately if not bundled. On Windows and macOS bundles, many helpers like unrtf and antiword are pre-included to reduce setup complexity.[47][45][46]
Post-installation configuration involves editing the recoll.conf file, typically located in ~/.recoll/, to specify indexed directories via the topdirs parameter and set paths for external helpers with recollhelperpath. Users should install any missing format-specific tools, such as sudo apt install poppler-utils on Linux for PDF support. Initial indexing can then be triggered through the GUI by selecting "Index" or via the command-line tool recollindex -z for a full rebuild, ensuring the index covers the configured paths.[16][15]
Integrations and Extensions
Desktop Environment Support
Recoll provides native integrations with major Linux desktop environments to enable seamless access to its indexed search results within the system's global search interfaces. These integrations allow users to query Recoll's full-text index directly from desktop search overlays, such as GNOME Shell's overview or KDE Plasma's KRunner, without launching the standalone application. For GNOME, Recoll offers a GNOME Shell Search Provider (GSSP) plugin that embeds search results into the desktop's global search functionality, accessible via the Activities overview or the Alt+F2 shortcut. This plugin, available as a separate package, supports recent GNOME versions on distributions like Ubuntu and Fedora, where indexed files from Recoll appear alongside other search categories like applications and web history. Installation typically involves adding the Recoll PPA for Ubuntu-derived systems (sudo add-apt-repository ppa:recoll-backports/recoll-1.15-on followed by sudo apt [update](/page/Update) && [sudo](/page/Sudo) apt install recoll) or manual compilation from source for other setups, after which the plugin is enabled automatically upon desktop restart.[8][51]
In KDE Plasma, Recoll integrates through a KIO (KDE Input/Output) slave module, which allows searching indexed content directly from Dolphin file manager or other KDE applications via the recoll:/ protocol, and a KRunner plugin for quick queries in the Plasma search bar. These components are bundled with the main Recoll package in most distributions, such as openSUSE and Ubuntu via PPA, enabling results to surface in KRunner's overlay when typing search terms. Users can configure the integration during Recoll setup to index specific paths while excluding others, such as system directories or external drives, through the application's configuration files.[52][8]
For lighter desktop environments like XFCE and LXDE, Recoll lacks native plugins but remains fully functional through its standalone graphical user interface (GUI), which can be launched via menu entries or hotkeys. Similarly, on Cinnamon and MATE—common in Linux Mint—no dedicated search provider or runner plugins exist, but Recoll operates effectively via its command-line interface (CLI) for scripting or terminal-based queries, with the GUI available as a fallback. A global hotkey script, using Python and libwnck, can be set up across any environment to toggle the Recoll GUI visibility, enhancing accessibility without deep integration.[52][53]
Key features of these integrations include the display of Recoll's ranked, snippet-highlighted results in desktop overlays, supporting filters by file type or date, and options to exclude certain applications or paths from indexing to optimize performance and privacy. Setup generally requires installing environment-specific packages like recoll-gnome or enabling components through desktop settings panels, ensuring Recoll's index is updated periodically for current results.[52]
Browser and API Integrations
Recoll supports browser integrations primarily through a dedicated Firefox extension known as Recoll WE, which enables the indexing of visited web pages for local archiving and search within the Recoll database.[54] The extension automatically downloads and enqueues web pages, including their metadata, to a designated directory in the Firefox downloads folder, from where Recoll processes them into a separate web cache index.[54] This feature indexes web history and cached pages, allowing users to search offline for previously viewed content while excluding pages from private browsing mode to respect user privacy settings.[54] Configuration options include URL inclusion/exclusion rules based on domains, wildcards, or regular expressions, as well as limits on cache size to manage storage, with the default web cache location at~/.recoll/webcache.[54]
For email-related browser integrations, Recoll provides support for indexing Thunderbird mailboxes through its built-in handlers for mbox and Maildir formats.[12] The mhmboxquirks parameter in the Recoll configuration file enables specific handling of Thunderbird's mbox format quirks, such as accepting naked '^From ' separators, ensuring accurate extraction of email content, subjects, and metadata for full-text search.[55] This treats web-linked emails or attachments as indexable sources, complementing browser history by incorporating email-based web activity into the searchable corpus.[12]
The Python API, accessible via the recoll module, facilitates developer integrations by allowing the creation of custom indexers and searchers for extending Recoll's functionality.[56] Key functions include connect() to establish a connection to the index (returning a Db object), Db.query() to initiate searches, Query.execute(query_string) for running queries, and methods like Query.fetchone() or Query.fetchmany() for parsing results into Doc objects containing fields such as url, text, and title.[56] This API requires Python 3 and supports writable connections for updating indexes programmatically, enabling scripts to integrate Recoll with other applications.[57]
Common use cases for these integrations include local archiving of web activity to preserve offline access to history and bookmarks without relying on cloud services, as well as building custom search applications that query Recoll indexes via Python scripts for result parsing and display in external tools.[54][56] For instance, developers can use the API to execute queries on web-indexed data and extract structured results for integration into desktop or web-based apps.[56] Limitations include the separate maintenance of the Recoll WE extension outside the core Recoll project, potentially leading to compatibility issues with newer Firefox versions, and the API's restriction to Python 3 environments without backward support for Python 2.[54][57] Additionally, the web cache is not designed as a permanent archive, with older content subject to deletion based on size limits.[54]