Fact-checked by Grok 2 weeks ago

Scrapy

Scrapy is a free and open-source web crawling and framework written in , designed to efficiently extract structured data from websites for applications such as , information processing, and archival purposes. It operates as a high-level that supports asynchronous processing via the Twisted networking engine, enabling fast and scalable crawls while handling features like , user-agent spoofing, management, and compliance. Built-in tools include CSS and selectors for data extraction, an interactive shell for testing, and export formats such as , , and XML, making it extensible through signals, middlewares, and pipelines for custom data processing. Originally conceived in at Mydeco for aggregating furniture data, Scrapy was open-sourced in 2008 under the BSD license by developers including Shane Evans and Pablo Hoffman, with Insophia playing a key role in its early stewardship. The project gained momentum after Zyte (formerly Scrapinghub), founded in 2010, became its primary maintainer in 2011, fostering a collaborative community that has contributed nearly 11,000 commits on . Key milestones include the 1.0 stable release in June 2015, which marked full maturity; Scrapy 1.1 in 2015 introducing experimental 3 support; and Scrapy 2.0 in March 2020 adding asyncio integration for enhanced concurrency. As of November 2025, the latest stable version is 2.13.3, requiring 3.9 or higher and boasting over 97 million downloads, underscoring its widespread adoption in production environments. At its core, Scrapy's architecture revolves around a central crawler engine that dispatches requests to spiders—user-defined scripts that define crawling behavior and data extraction rules—while managing downloads, item pipelines for data cleaning, and extensions for advanced functionality like auto-throttling and politeness delays to respect server loads. This modular design allows developers to focus on writing extraction logic rather than for networking or concurrency, positioning Scrapy as a powerful tool for large-scale beyond simple parsing libraries.

Overview

Purpose and Scope

Scrapy is an open-source framework designed for large-scale crawling and extracting structured data from websites. It serves as an specifically for crawling sites and processing the extracted for applications such as , processing, and historical archival. The core applications of Scrapy include handling HTTP/ requests and responses asynchronously to enable efficient , following links to navigate through website structures like , and extracting targeted data using selectors such as CSS or . Once extracted, the data can be stored in various formats, including , , or XML, through built-in feed export mechanisms, facilitating integration with databases or further analysis tools. Scrapy's scope is primarily limited to web crawling and scraping tasks, distinguishing it from broader automation tools that handle dynamic interactions like rendering. It supports standard protocols such as HTTP/ and can interface with , but does not extend to real-time user simulations or browser-based automation. Compared to manual scraping methods, Scrapy provides automation advantages in through concurrent request processing, increased speed via asynchronous operations, and built-in handling of anti-scraping techniques, including cookie management, user-agent rotation, compliance, and automatic throttling to avoid detection or bans.

Design Philosophy

Scrapy's design philosophy centers on creating a flexible, efficient for that prioritizes extensibility and reliability in handling dynamic web environments. At its core, the adopts a modular built around key extensible components such as spiders for defining crawling logic, items for structuring extracted data, and pipelines for post-processing, enabling users to customize behavior without modifying the underlying . This ensures that developers can tailor Scrapy to specific scraping needs while maintaining a clean , where extraction logic in spiders remains distinct from data storage or validation in pipelines. To achieve high performance, Scrapy employs asynchronous processing powered by the Twisted networking framework, which facilitates non-blocking, event-driven handling of concurrent requests and responses. This approach allows for efficient resource utilization, enabling the framework to manage thousands of simultaneous downloads without blocking on I/O operations, thus optimizing speed and for large-scale data extraction tasks. Complementing this is a strong emphasis on robustness, with built-in mechanisms for retries on failed requests, automatic handling of redirects, and comprehensive error management to cope with the inherent unreliability of sources, such as temporary errors or timeouts. These features are configurable through dedicated middlewares, ensuring resilient operation in production environments. Further embodying its principles, Scrapy is released under the permissive BSD-3-Clause license, fostering an open-source ethos that invites widespread community involvement in developing spiders, extensions, and improvements. Configurability is a cornerstone, achieved via a centralized settings system that allows fine-grained control over behaviors like retry policies, download delays, and activation through project-specific files or command-line overrides. This design not only promotes reusability and maintainability but also aligns with the framework's goal of empowering users to build robust, scalable scraping solutions efficiently.

Architecture

Core Components

Scrapy is built around several core components that form the foundation of its and crawling capabilities. These components include spiders, items, item loaders, request and response objects, and settings, each serving a distinct role in defining, extracting, and configuring the scraping process. Spiders are custom classes that encapsulate the crawling logic for a specific or set of sites. They define the starting points for scraping through the start_urls attribute, which specifies the initial URLs to request, or via the start_requests() method that yields Request objects asynchronously. Spiders implement parsing rules primarily through the parse() callback method, where responses are processed to extract data using selectors or other tools, and to generate follow-up requests for extraction and . Key attributes include the required name for , and optional allowed_domains to restrict crawling scope. Specialized subclasses like CrawlSpider use rules with LinkExtractor instances to automate following based on patterns. Items serve as containers for holding the scraped data in a structured format, facilitating its export and further processing. They can be simple dictionaries for basic use, or more robust custom classes derived from scrapy.Item, dataclasses, or attrs classes via the itemadapter library for and field validation. When using scrapy.Item, fields are declared with Field() objects, which populate the fields attribute and enable features like preventing invalid field assignments through KeyError exceptions. Items support a dict-like interface, including copy and deepcopy methods, and include reference tracking to aid in detection during development. Item Loaders provide an optional mechanism for populating and items with extracted , emphasizing cleaning and validation to ensure . They apply input processors to raw extracted values—such as stripping whitespace or converting types—and output processors to format the final item fields before storage. Built on the itemloaders library, they integrate and CSS selectors for targeted extraction from responses, and support nested loaders for handling complex, hierarchical structures. This component is particularly useful for standardizing handling across spiders without embedding logic directly in parsing methods. Request and Response objects manage the HTTP interactions central to Scrapy's operation, enabling the framework to fetch and handle . A Request object, typically generated in spiders, specifies the target (read-only, modifiable via replace()), HTTP method (defaulting to GET), headers as a dictionary-like structure, body as bytes, and a meta dictionary for passing contextual data like cookies or custom flags between requests. The corresponding Response object, returned after downloading, includes the HTTP status code (defaulting to 200), the final (potentially redirected), and the body as bytes, with access to the original request's meta via response.meta. These objects ensure seamless of state and metadata throughout the scraping process. Settings act as the global configuration module, allowing customization of Scrapy's behavior through key-value pairs that influence core components, , pipelines, and extensions. Common configurations include USER_AGENT for browser emulation, DOWNLOAD_DELAY to introduce pauses between requests for , and CONCURRENT_REQUESTS to limit parallel downloads and avoid overwhelming servers. Settings are structured hierarchically, with defaults in scrapy.settings.default_settings, project-level overrides in settings.py (e.g., enabling ROBOTSTXT_OBEY = True), spider-specific adjustments via custom_settings, and command-line options holding highest precedence. The SCRAPY_SETTINGS_MODULE designates the active settings , ensuring flexible adaptation to different environments.

Execution Flow

The Scrapy execution flow is orchestrated by the , which serves as the central coordinator managing interactions between components such as the Scheduler, Downloader, and Spiders through dedicated queues for requests and items. Upon initiation, the Engine retrieves the initial Requests from the Spider's start_urls or start_requests and enqueues them in the Scheduler for prioritization and queuing. This orchestration ensures a continuous, asynchronous data flow, leveraging Twisted's non-blocking I/O to handle high concurrency without blocking on network operations. In the request cycle, the Engine signals the Scheduler to provide the next Request, which is then dispatched to the Downloader after passing through Downloader Middlewares for potential modifications like adding headers or handling retries. The Downloader fetches the HTTP response from the target website, processes it through the same middlewares in reverse, and returns it to the Engine, which subsequently forwards it to the via Spider Middlewares for parsing. This cycle repeats as long as Requests remain available, enabling efficient crawling of linked pages. During parsing, the Spider processes the Response by extracting data and yielding Items or new Requests; yielded Items are routed through Spider Middlewares to the and then to Item Pipelines for validation, , and , while new Requests are enqueued back into the Scheduler to continue the . Middlewares at both the downloader and spider levels allow and of this yielding process, such as user-agent rotation or data , without altering the core flow. This step-by-step yielding mechanism supports recursive crawling and structured data extraction in a modular fashion. Duplication filtering occurs primarily within the Scheduler using the default RFPDupeFilter, which generates a unique for each Request based on attributes like , method, and body to prevent revisiting identical resources and avoid infinite loops or redundant ing. This filter maintains a set of seen fingerprints, discarding duplicates before enqueuing, which optimizes resource usage during large-scale crawls. The execution concludes with a shutdown triggered when the Scheduler's is exhausted, indicating no further Requests are pending, or upon receiving system signals like SIGTERM for graceful termination. In graceful shutdowns, the Engine completes in-flight downloads and processes pending items before stopping the Twisted reactor, ensuring ; errors or manual interruptions invoke signals like engine_stopped to log the reason and halt operations cleanly.

Key Features

Selectors and Extraction

In Scrapy, selectors provide the primary mechanism for extracting data from or XML responses received during web crawling. The framework supports two main selector types: and CSS selectors. selectors use a to navigate and select nodes in XML documents, allowing precise targeting of elements based on their structure, attributes, and content; for instance, response.xpath('//div[@class="title"]/text()').get() retrieves the text content of the first <div> element with a class attribute equal to "title". CSS selectors, on the other hand, employ a stylesheet syntax to match elements by tag names, classes, IDs, and other attributes, such as response.css('span::text').get() to extract text from <span> elements. These selectors are built as a thin wrapper around the parsel , ensuring seamless integration with Scrapy's Response objects for efficient data selection. Selector objects in Scrapy, returned by methods like response.[xpath](/page/XPath)() or response.css(), offer several extraction methods to handle matched data. The get() method returns the first matched string or None if no match exists, making it suitable for single-value extractions. For multiple values, getall() yields a list of all matching strings, which is essential for scraping lists or repeated elements. Additionally, the re() method applies regular expressions to extract substrings from matches, such as response.[xpath](/page/XPath)('//a/text()').re(r'Name:\s*(.*)') to capture text following "Name:" in anchor elements, returning a list of results; re_first() provides the first such match. These methods support flexible post-processing directly on selectors, enhancing their utility in parse methods within spiders. XPath expressions distinguish between absolute and relative paths to accommodate dynamic content and nested structures. Absolute paths begin with / to select from the document root, like //p to match all <p> elements anywhere in the HTML, which is robust for global searches but less efficient in large documents. Relative paths, prefixed with ., operate within a specific context, such as .//p to find <p> elements only under the current selector's node, aiding in drilling down through hierarchical or dynamically generated content without rescanning the entire response. This distinction is particularly valuable for handling JavaScript-rendered pages or complex DOM trees, where context-aware selection prevents over-matching. Scrapy extends selector functionality beyond HTML to non-HTML formats like and XML through the Selector class's type parameter. For XML responses, instantiate with Selector(text=body, type="xml") to enable queries on structured markup, preserving element hierarchies. Scrapy extends selector functionality to via the type="json" parameter in the Selector class, enabling JMESPath queries for structured extraction, such as response.jmespath('key'). Manual parsing with the json library remains an option for custom needs. These capabilities ensure versatility across response types encountered in diverse scenarios. For link extraction, Scrapy facilitates resolving relative URLs using the response.urljoin() method on Response objects, which combines a relative path with the response's base URL to form an absolute URL. This is invoked as response.urljoin(relative_url), leveraging urllib.parse.urljoin under the hood and accounting for any <base> tag in HTML responses via TextResponse. It is particularly useful when selectors yield relative links from href attributes, enabling seamless follow-up requests in spiders without manual URL normalization.

Item Processing

In Scrapy, item processing occurs after data extraction, where scraped items—served as structured data containers—are passed through a series of components known as pipelines to clean, validate, and prepare them for storage or further use. These pipelines act as sequential processors, allowing developers to perform operations such as filtering out duplicate items based on unique fields or converting data types to ensure consistency across the dataset. Each pipeline is implemented as a class with three key s that define its lifecycle stages: open_spider(self, spider) for initialization and setup when the spider starts, process_item(self, item, spider) for handling individual items during the crawl, and close_spider(self, spider) for cleanup and teardown once the spider finishes. The process_item method processes each item in sequence across enabled pipelines, returning the modified item, a Deferred for asynchronous operations, or raising an exception to halt further processing. Scrapy includes several built-in pipelines to handle common tasks, such as the FilesPipeline, which downloads and stores media files referenced in item fields like file_urls while avoiding redundant downloads through verification. Similarly, pipelines can integrate utilities like MailSender to send notifications upon completion or encountering specific conditions, leveraging Twisted's non-blocking I/O for efficient dispatch. To enable pipelines, they are configured in the project's settings.py file using the ITEM_PIPELINES dictionary, where each entry maps a pipeline class to an integer priority between 0 and 1000, determining the execution order from lowest to highest value. For instance, a might prioritize before storage by assigning lower numbers to earlier stages. Error handling in pipelines focuses on robustness, where invalid items—such as those with missing required fields or duplicates—can be dropped by raising the DropItem exception, preventing them from proceeding to subsequent pipelines or storage. Alternatively, issues can be logged using the spider's logger for , ensuring the crawl continues without interruption while maintaining .

Installation and Setup

System Requirements

Scrapy requires 3.9 or later, compatible with both the standard implementation and . Support for Python 2.7 and earlier versions was officially dropped with the release of Scrapy 2.0 in March 2020. The framework depends on several core libraries for its functionality: Twisted provides the asynchronous networking and I/O capabilities essential for concurrent request handling; lxml serves as the XML and parser; parsel enables CSS and selectors for data extraction; w3lib provides utilities for handling and web encoding; handles secure connections; and pyOpenSSL supports additional security features for network interactions. Optionally, the service_identity library enhances TLS certificate verification when using Twisted, preventing warnings during secure connections. Scrapy is designed to run cross-platform on Windows, macOS, and distributions. On Windows, users may encounter installation challenges, such as the need for Microsoft Visual C++ Build Tools to compile dependencies like via , and it is recommended to use Anaconda or Miniconda for a smoother setup. Windows-specific path handling issues can arise in file operations or command-line usage, often resolved by using forward slashes or raw strings in code to avoid backslash escaping problems. For resource requirements, Scrapy operates efficiently on standard hardware for small to medium projects. In large-scale crawls, performance scales with concurrency settings; optimal configurations target 80-90% CPU utilization, with memory usage increasing proportionally to manage high concurrency without bottlenecks.

Project Initialization

To initialize a Scrapy project, first install the framework using in a dedicated to isolate dependencies and prevent conflicts with other packages. The recommended command is [pip](/page/Pip) install scrapy, which handles the installation of core dependencies such as Twisted for operations. environments can be created using tools like 's built-in venv module (e.g., python -m venv scrapy_env followed by activation) or Conda (e.g., conda create -n scrapy_env python=3.9 and conda activate scrapy_env). After installation, create a new Scrapy project by running the command scrapy startproject <project_name> from the desired parent directory, where <project_name> is replaced with the chosen project identifier (e.g., scrapy startproject myproject). This command generates a standard directory structure for the project, including a top-level scrapy.cfg file for configuration and a module directory named after the project containing essential files and subdirectories. The resulting structure appears as follows:
myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py          # Defines structured data items
        middlewares.py    # Custom [middleware](/page/Middleware) configurations
        pipelines.py      # Item processing pipelines
        settings.py       # Project settings
        spiders/          # Directory for [spider](/page/Spider) files
            __init__.py
The settings.py file initializes with default configurations tailored to the , such as BOT_NAME set to the project name (falling back to 'scrapybot'), USER_AGENT set to "Scrapy/VERSION (+https://scrapy.org)" for identifying requests, and ROBOTSTXT_OBEY set to True to respect directives by default. To verify the installation and project setup, launch the Scrapy shell using the command scrapy shell <url>, where <url> is a target (e.g., scrapy shell https://scrapy.org). This opens an interactive console within the Scrapy environment, allowing immediate testing of response handling and basic extraction without running a full spider.

Basic Usage

Creating Spiders

In Scrapy, are the core classes responsible for defining the crawling behavior, starting from initial URLs and processing responses to extract data or follow further links. To create a basic , one subclasses the scrapy.Spider class, which provides the foundational structure for handling requests and responses. The must define a unique name attribute, serving as an identifier for the crawler, and a start_urls attribute, which is a list of initial URLs from which the begins scraping. The primary method to implement in a spider is parse(self, response), which receives the response object from each request and processes it to yield extracted or additional requests. For extraction, the method typically uses Scrapy's selectors—such as CSS or —to parse the content and yield Python dictionaries representing items, for example:
python
def parse(self, response):
    for quote in response.css("div.quote"):
        yield {
            "text": quote.css("span.text::text").get(),
            "author": quote.css("small.author::text").get(),
        }
This approach allows the spider to return structured incrementally without blocking the crawling process. To enable multi-page crawling, the parse method can yield new scrapy.Request objects for discovered links, specifying a callback like self.parse to handle subsequent responses, as in:
python
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
    yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
This mechanism follows links recursively while respecting the spider's configuration. To restrict the spider's scope and prevent unintended crawling outside a target site, the allowed_domains attribute can be set as a list of domain names, ensuring requests are only followed within those domains. For scenarios requiring dynamic generation of starting URLs—such as from a database or API—the start_requests() method can be overridden to yield custom scrapy.Request objects instead of relying on the static start_urls. An example implementation might look like:
python
def start_requests(self):
    urls = ["https://example.com/page/1/", "https://example.com/page/2/"]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)
This flexibility supports more complex initialization logic while maintaining the spider's core parsing workflow.

Running Scrapes

To execute a Scrapy project, the primary command is scrapy crawl <spider_name>, which starts the specified spider and initiates the crawling process from the project's root directory. This command requires the project to be activated via the scrapy shell script or by running it from the project directory containing the scrapy.cfg file. Scrapy provides flexible output options to capture scraped items during execution. The -o FILE:FORMAT flag appends extracted items to a specified file in formats such as JSON, CSV, or XML, while -O FILE:FORMAT overwrites the file instead. For instance, scrapy crawl myspider -o items.json exports items as JSON lines. Additionally, the --loglevel LEVEL option controls output verbosity, with levels like INFO for standard messages or DEBUG for detailed traces, allowing users to monitor progress without overwhelming logs. For interactive debugging, the scrapy shell [url] command launches an console with the response from the given preloaded, enabling direct inspection of page content, selectors, and /CSS queries. This mode supports spider-specific contexts via --spider=SPIDER and one-off code execution with -c 'code', such as checking response.status. Performance evaluation is facilitated by scrapy bench, which runs a standardized against sample websites to measure Scrapy's crawling speed and efficiency under default settings. Monitoring runs involves Scrapy's built-in logging and statistics collection. Logs display real-time events like requests sent and items processed, with levels adjustable via --loglevel. The scrapy.stats object, accessible through the Crawler as crawler.stats, tracks key metrics such as the total number of requests ("request_count"), successfully downloaded responses ("downloader/response_count"), and extracted items ("item_scraped_count/<spider_name>"). During execution, stats can be updated programmatically with methods like stats.inc_value("custom_metric"); post-run, they are available via the MemoryStatsCollector for analysis, including timestamps and counters that quantify scrape scale.

Advanced Usage

Custom Pipelines

Custom pipelines in Scrapy enable developers to extend item with tailored logic, such as validation, , or , allowing for sophisticated handling of scraped data after extraction. These components are integrated into the item's flow post-, providing a modular way to clean, enrich, or persist items without altering code. By defining classes, users can address specific requirements like checks or external integrations, ensuring robust and scalable scraping workflows. To implement a custom pipeline, create a Python class that defines the process_item(self, item, spider) method, which processes each incoming item and returns the modified item, a Deferred for asynchronous handling, or raises scrapy.exceptions.DropItem to halt further processing. Supporting methods include open_spider(self, spider) for setup (e.g., resource allocation) when the spider initializes and close_spider(self, spider) for teardown upon completion. For access to project settings during instantiation, implement the classmethod from_crawler(cls, crawler), which receives the crawler context and can retrieve configuration values. This interface allows pipelines to operate statelessly per item while maintaining spider-wide state if needed. A representative example is a validation pipeline that inspects items for required fields and drops incomplete ones, preventing invalid data from advancing. The following code defines such a pipeline using the itemadapter library for consistent item handling across types like dicts or custom Items:
python
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class ValidationPipeline:
    required_fields = ('name', 'price')  # Define required fields

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        for field in self.required_fields:
            if not adapter.get(field):
                raise DropItem(f"Missing required field: {field}")
        return item
This pipeline raises DropItem for items lacking 'name' or 'price', effectively filtering out deficient records while logging the reason via Scrapy's default exception handling. For database integration, custom pipelines facilitate direct storage of items into SQL or NoSQL systems, decoupling persistence from spiders for better maintainability. Initialization occurs in open_spider to establish connections, processing in process_item to insert or update records, and cleanup in close_spider to release resources, minimizing overhead during high-volume crawls. The from_crawler method pulls database credentials from settings for secure configuration. An example for integration with uses Py to store items as documents:
python
import pymongo
from itemadapter import ItemAdapter

class MongoDBPipeline:

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get("MONGO_URI"),
            mongo_db=crawler.settings.get("MONGO_DATABASE", "scrapy_items")
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db["items"].insert_one(ItemAdapter(item).asdict())
        return item
For SQL databases like or , the structure mirrors this: use libraries such as SQLAlchemy to create an engine in open_spider, execute inserts via a session in process_item (e.g., session.add(ItemModel(**ItemAdapter(item).asdict()))), and commit/close in close_spider. This approach supports transactional integrity and handles connection pooling for efficiency. Enabling custom pipelines requires registering them in the ITEM_PIPELINES setting as a dictionary mapping pipeline path to an (0 to 1000, with lower values executing earlier; values outside this range disable the ). Multiple items sequentially based on , allowing ordered operations like validation before . Example configuration in settings.py:
python
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 200,
    'myproject.pipelines.MongoDBPipeline': 400,
}
This setup runs the validation first (priority 200), followed by database insertion (priority 400), ensuring prior to . To test custom , configure isolated environments using scrapy.cfg or test-specific settings.py files to enable only the target , then run partial scrapes or unit tests that invoke process_item with fabricated items and mock objects, verifying outputs and side effects like database writes. This modular testing isolates logic from full crawls, facilitating and validation.

Middleware and Extensions

Downloader middleware in Scrapy provides a low-level framework for globally altering the requests and responses during the downloading phase. It allows developers to implement custom logic, such as modifying request headers, handling proxies, or implementing retry mechanisms, by defining classes that inherit from scrapy.downloadermiddlewares.DownloaderMiddleware. Key methods include process_request(request, spider), which is invoked for each request before it is sent and can return None to continue processing, a Response object, a Request object to redirect, or raise IgnoreRequest to drop the request; and process_response(request, response, spider), which processes responses after download and similarly returns a modified response, request, or raises an exception. Built-in downloader middlewares include HttpCacheMiddleware, which caches HTTP requests and responses to avoid redundant downloads, configurable via settings like HTTPCACHE_STORAGE (defaulting to scrapy.extensions.httpcache.FilesystemCacheStorage for file-based storage), and UserAgentMiddleware, which rotates or sets user agents based on the spider's user_agent attribute or global settings. These middlewares are enabled by adding their paths to the DOWNLOADER_MIDDLEWARES dictionary in the project's settings, where the value represents a integer—lower numbers execute closer to the , higher ones closer to the downloader. For example, {'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900} activates caching at a high . Spider middleware, in contrast, intercepts the spider's input and output to enable pre- and post-processing of responses, requests, and items during spider execution. Developers create spider middlewares by defining classes that implement methods such as process_spider_input(response, spider), which filters responses before they reach the spider and returns None or raises DropItem; process_spider_output(response, result, spider), which modifies the iterable of requests or items yielded by the spider; and process_spider_exception(response, exception, spider), which handles exceptions during spider processing. Since Scrapy 2.13, an optional base class scrapy.spidermiddlewares.base.BaseSpiderMiddleware is available to simplify by providing default behaviors for middleware methods, though subclassing is not required. These middlewares are activated via the SPIDER_MIDDLEWARES setting, using a similar priority-based to control execution order, with lower priorities running nearer the . Extensions in Scrapy offer a flexible way to insert custom functionality tied to global events, distinct from middlewares by focusing on signals rather than flows. They are implemented as classes with a from_crawler class method that initializes the extension and connects it to Scrapy's signals system, which dispatches events like spider_opened or spider_closed. For instance, an extension might connect a callback to spider_closed for tasks such as exporting statistics or closing database connections, as shown in this example:
python
from scrapy import signals

class MyExtension:
    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, [spider](/page/Spider)):
        # Custom logic, e.g., log stats
        pass
Extensions are enabled through the EXTENSIONS setting with priority values, such as {'myproject.extensions.MyExtension': 100}. Built-in extensions include CoreStats for collecting core statistics like request counts and StatsMailer for emailing scrape stats upon completion. The signals system, briefly, enables event-driven behaviors across the framework without tight coupling to specific components.

History and Development

Origins and Early Versions

Scrapy originated in 2007 as an internal tool at Mydeco, a London-based furniture startup, where it was initially developed by Shane Evans, the company's head of , to efficiently collect from various websites. Pablo Hoffman, a software engineer from Insophia in , soon joined as a co-developer, collaborating with Evans to refine the 's architecture for scalability and usability. This partnership addressed the limitations of existing tools at the time, which lacked robust support for asynchronous web crawling and data extraction at scale, motivating the creation of a dedicated for such tasks. The first public release, Scrapy 0.7, occurred in August 2008, marking its debut as an open-source project under the permissive BSD license. Hosted initially on platforms like Google Code before migrating to , the early version emphasized a built on , leveraging the Twisted library for asynchronous networking to handle concurrent requests without blocking, and lxml for efficient HTML and XML parsing. Internally at Mydeco, Scrapy powered projects for competitive analysis and product cataloging, demonstrating its value in real-world scenarios before broader dissemination. Following open-sourcing, Scrapy saw its first significant community contributions around , as developers began submitting patches and enhancements, particularly for expanding parser support to handle diverse web formats like and additional XML dialects. These early inputs helped iterate on core components, fostering a growing of users who adapted it for custom scraping needs beyond . In the pre-1.0 era, a key development challenge was stabilizing the asynchronous engine's integration with Twisted, ensuring reliable handling of high-volume crawls while mitigating issues like connection pooling and error recovery in dynamic web environments. This period involved iterative refactoring to balance performance with maintainability, laying the groundwork for Scrapy's reputation as a production-ready tool.

Major Releases and Evolution

Scrapy was initially developed in 2007 by Shane Evans at Mydeco, a London-based startup, to automate from websites, and it was open-sourced in August 2008 under the BSD license by Pablo Hoffman, marking its first public release as version 0.7. Early versions, such as 0.16 released on October 18, 2012, introduced key features like Spider Contracts for formal spider testing and the extension for adaptive download delays, while dropping support for outdated 2.5 and Twisted 2.5. The project reached a significant with Scrapy 1.0.0 on June 19, 2015, which established a stable by allowing spiders to return dictionaries instead of rigid Items, adding per-spider custom settings, and switching to Python's built-in system from Twisted's, with maintained. This release emphasized simplicity in the Crawler and included numerous bug fixes, solidifying Scrapy's role as a mature framework for large-scale scraping. Subsequent 1.x versions built on this foundation; for instance, Scrapy 1.1.0 on May 11, 2016, enabled compliance by default and introduced Python 3 support (requiring Twisted 15.5+), addressing growing demands for cross-version . By Scrapy 1.5.0 on December 29, 2017, enhancements included better integration and refined item pipeline behaviors, improving efficiency for cloud-based deployments. Scrapy 2.0.0, released on , , represented a major evolution by fully embracing asynchronous programming with initial asyncio support, alongside changes to scheduler handling for better customization. This version dropped Python 2 support entirely, aligning with the ecosystem's shift to 3, and introduced features like improved HTTP to mitigate credential exposure risks in later patches, such as 2.5.1 on , 2021. Security-focused updates continued, with Scrapy 2.6.0 on March 1, 2022, enhancing handling to prevent exploits via redirects. In recent years, Scrapy has adapted to modern runtimes and performance needs. Scrapy 2.11.0 on September 18, 2023, allowed spiders to modify settings dynamically via the from_crawler method and added periodic stats logging for better monitoring. The 2.12.0 release on November 18, 2024, dropped Python 3.8 support while adding Python 3.13 compatibility and introducing JsonResponse for streamlined handling. Most notably, Scrapy 2.13.0 on May 8, 2025, made the asyncio reactor the default (replacing Twisted's), deprecated non-async middlewares, and adjusted defaults like increasing DOWNLOAD_DELAY to 1 second and reducing CONCURRENT_REQUESTS_PER_DOMAIN to 1 for more ethical scraping out-of-the-box, with subsequent 2.13.x patches refining callback precedence and engine stability. These changes reflect Scrapy's ongoing evolution toward async-native, secure, and user-friendly web extraction, driven by community contributions exceeding 11,000 commits.

References

  1. [1]
    Scrapy at a glance
    Scrapy (/ˈskreɪpaɪ/) is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful ...
  2. [2]
    The rise of Scrapy: How an open-source scraping framework ... - Zyte
    May 14, 2025 · Scrapy 1.1 (December 2015) introduced experimental Python 3 support, and full support became standard in subsequent releases, ensuring the ...
  3. [3]
    Download Scrapy
    Stable Release. The recommended version for most users. Fully tested and production-ready. Download v2.13.3. Changelog. v2.13.3. New. July 02, 2025.
  4. [4]
    Frequently Asked Questions — Scrapy 2.13.3 documentation
    Scrapy is an application framework for writing web spiders that crawl websites and extract data, unlike HTML/XML parsing libraries.
  5. [5]
    Architecture overview — Scrapy 2.13.3 documentation
    The following diagram shows an overview of the Scrapy architecture with its components and an outline of the data flow that takes place inside the system.Missing: philosophy | Show results with:philosophy
  6. [6]
    Downloader Middleware — Scrapy 2.13.3 documentation
    This middleware provides low-level cache to all HTTP requests and responses. It has to be combined with a cache storage backend as well as a cache policy. ...
  7. [7]
    Settings — Scrapy 2.13.3 documentation
    ### Summary of Scrapy's Configurability and Robustness Features
  8. [8]
    Scrapy, a fast high-level web crawling & scraping ... - GitHub
    Scrapy is a web scraping framework to extract structured data from websites. It is cross-platform, and requires Python 3.10+.
  9. [9]
    Spiders — Scrapy 2.13.3 documentation
    Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to ...Missing: philosophy | Show results with:philosophy
  10. [10]
    Items — Scrapy 2.13.3 documentation
    ### Summary of Items in Scrapy
  11. [11]
    Item Loaders — Scrapy 2.13.3 documentation
    Item Loaders are an extension of the itemloaders library that make it easier to work with Scrapy by adding support for responses.
  12. [12]
    Requests and Responses — Scrapy 2.13.3 documentation
    ### Summary of Request and Response Objects in Scrapy
  13. [13]
    Signals — Scrapy 2.13.3 documentation
    After the spider has finished closing, the spider_closed signal is sent. You may raise a DontCloseSpider exception to prevent the spider from being closed.
  14. [14]
    Selectors — Scrapy 2.13.3 documentation
    Scrapy Selectors is a thin wrapper around parsel library; the purpose of this wrapper is to provide better integration with Scrapy Response objects.
  15. [15]
    Item Pipeline — Scrapy 2.13.3 documentation
    After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.
  16. [16]
    Downloading and processing files and images - Scrapy Docs
    Scrapy provides reusable item pipelines for downloading files attached to a particular item (for example, when you scrape products and also want to download ...
  17. [17]
    Sending e-mail — Scrapy 2.13.3 documentation
    The MailSender components is the preferred class to use for sending emails from Scrapy, as it uses Twisted non-blocking IO, like the rest of the framework.
  18. [18]
  19. [19]
    Installation guide — Scrapy 2.13.3 documentation
    Scrapy requires Python 3.9+, either the CPython implementation (default) or the PyPy implementation (see Alternate Implementations). Installing Scrapy . If you' ...
  20. [20]
    Release notes — Scrapy 2.13.3 documentation
    Highlights: Added Python 3.12 support, dropped Python 3.7 support. The new add-ons framework simplifies configuring 3rd-party components that support ...<|control11|><|separator|>
  21. [21]
    missing service_identity module · Issue #2163 · scrapy/scrapy - GitHub
    Aug 2, 2016 · When I run scrapy, I got warning: You do not have a working installation of the service_identity module: 'No module named pyasn1_modules.rfc2459 ...
  22. [22]
    Broad Crawls — Scrapy 2.13.3 documentation
    For optimum performance, you should pick a concurrency where CPU usage is at 80-90%. Increasing concurrency also increases memory usage.Missing: RAM | Show results with:RAM
  23. [23]
    Command line tool — Scrapy 2.13.3 documentation
    The first thing you typically do with the scrapy tool is create your Scrapy project: scrapy startproject myproject [project_dir]. That will create a Scrapy ...startproject · genspider
  24. [24]
    Scrapy Tutorial — Scrapy 2.13.3 documentation
    This tutorial will walk you through these tasks: Creating a new Scrapy project. Writing a spider to crawl a site and extract data. Exporting the scraped data ...Scrapy shell · Examples · Using your browser’s...Missing: design | Show results with:design
  25. [25]
    Scrapy shell — Scrapy 2.13.3 documentation
    The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider.Missing: philosophy | Show results with:philosophy<|control11|><|separator|>
  26. [26]
    Stats Collection — Scrapy 2.13.3 documentation
    Stats Collection . Scrapy provides a convenient facility for collecting stats in the form of key/values, where values are often counters.
  27. [27]
  28. [28]
    Spider Middleware — Scrapy 2.13.3 documentation
    The spider middleware is a framework of hooks into Scrapy's spider processing mechanism where you can plug custom functionality to process the responses that ...
  29. [29]
  30. [30]
    Extensions — Scrapy 2.13.3 documentation
    Extensions in Scrapy allow custom functionality and are 'wildcard' components, enabled via the EXTENSIONS setting, and loaded at startup.
  31. [31]
  32. [32]
    10 Years of Scrapy: Interview with Shane Evans - Proxyway
    Jun 5, 2025 · A few milestones stand out to me: 2008 – We open-sourced Scrapy, making it one of the first dedicated scraping frameworks available on GitHub.Missing: initial | Show results with:initial
  33. [33]
    Ten years since Scrapy 1.0: The stats and stories behind your ... - Zyte
    not just for the tool itself, but for an entire community of developers scraping ...
  34. [34]
    Release notes — Scrapy 1.8.4 documentation
    Scrapy 1.8.4 (2024-02-14)¶. Security bug fix: Due to its ReDoS vulnerabilities, scrapy.utils.iterators.xmliter is now deprecated in favor of xmliter_lxml() ...
  35. [35]
    Releases · scrapy/scrapy - GitHub
    Fixed a bug introduced in Scrapy 2.13.0 that caused results of request errbacks to be ignored when the errback was called because of a downloader error. Docs ...