Scrapy
Scrapy is a free and open-source web crawling and web scraping framework written in Python, designed to efficiently extract structured data from websites for applications such as data mining, information processing, and archival purposes.[1] It operates as a high-level application framework that supports asynchronous processing via the Twisted networking engine, enabling fast and scalable crawls while handling features like HTTP compression, user-agent spoofing, cookies management, and robots.txt compliance.[1] Built-in tools include CSS and XPath selectors for data extraction, an interactive shell for testing, and export formats such as JSON, CSV, and XML, making it extensible through signals, middlewares, and pipelines for custom data processing.[1]
Originally conceived in 2007 at Mydeco for aggregating furniture data, Scrapy was open-sourced in August 2008 under the BSD license by developers including Shane Evans and Pablo Hoffman, with Insophia playing a key role in its early stewardship.[2] The project gained momentum after Zyte (formerly Scrapinghub), founded in 2010, became its primary maintainer in 2011, fostering a collaborative community that has contributed nearly 11,000 commits on GitHub.[2] Key milestones include the 1.0 stable release in June 2015, which marked full maturity; Scrapy 1.1 in December 2015 introducing experimental Python 3 support; and Scrapy 2.0 in March 2020 adding asyncio integration for enhanced concurrency.[2] As of November 2025, the latest stable version is 2.13.3, requiring Python 3.9 or higher and boasting over 97 million downloads, underscoring its widespread adoption in production environments.[3][4][2]
At its core, Scrapy's architecture revolves around a central crawler engine that dispatches requests to spiders—user-defined scripts that define crawling behavior and data extraction rules—while managing downloads, item pipelines for data cleaning, and extensions for advanced functionality like auto-throttling and politeness delays to respect server loads.[1] This modular design allows developers to focus on writing extraction logic rather than boilerplate code for networking or concurrency, positioning Scrapy as a powerful tool for large-scale web data acquisition beyond simple parsing libraries.[5]
Overview
Purpose and Scope
Scrapy is an open-source Python framework designed for large-scale web crawling and extracting structured data from websites.[1] It serves as an application framework specifically for crawling web sites and processing the extracted information for applications such as data mining, information processing, and historical archival.[1]
The core applications of Scrapy include handling HTTP/HTTPS requests and responses asynchronously to enable efficient data collection, following links to navigate through website structures like pagination, and extracting targeted data using selectors such as CSS or XPath.[1] Once extracted, the data can be stored in various formats, including JSON, CSV, or XML, through built-in feed export mechanisms, facilitating integration with databases or further analysis tools.[1]
Scrapy's scope is primarily limited to web crawling and scraping tasks, distinguishing it from broader web automation tools that handle dynamic interactions like JavaScript rendering.[1] It supports standard protocols such as HTTP/HTTPS and can interface with APIs, but does not extend to real-time user simulations or browser-based automation. Compared to manual scraping methods, Scrapy provides automation advantages in scalability through concurrent request processing, increased speed via asynchronous operations, and built-in handling of anti-scraping techniques, including cookie management, user-agent rotation, robots.txt compliance, and automatic throttling to avoid detection or bans.[1]
Design Philosophy
Scrapy's design philosophy centers on creating a flexible, efficient framework for web scraping that prioritizes extensibility and reliability in handling dynamic web environments. At its core, the framework adopts a modular architecture built around key extensible components such as spiders for defining crawling logic, items for structuring extracted data, and pipelines for post-processing, enabling users to customize behavior without modifying the underlying codebase. This modularity ensures that developers can tailor Scrapy to specific scraping needs while maintaining a clean separation of concerns, where extraction logic in spiders remains distinct from data storage or validation in pipelines.[6]
To achieve high performance, Scrapy employs asynchronous processing powered by the Twisted networking framework, which facilitates non-blocking, event-driven handling of concurrent requests and responses. This approach allows for efficient resource utilization, enabling the framework to manage thousands of simultaneous downloads without blocking on I/O operations, thus optimizing speed and scalability for large-scale data extraction tasks. Complementing this is a strong emphasis on robustness, with built-in mechanisms for retries on failed requests, automatic handling of redirects, and comprehensive error management to cope with the inherent unreliability of web sources, such as temporary server errors or network timeouts. These features are configurable through dedicated middlewares, ensuring resilient operation in production environments.[6][7][8]
Further embodying its principles, Scrapy is released under the permissive BSD-3-Clause license, fostering an open-source ethos that invites widespread community involvement in developing spiders, extensions, and improvements. Configurability is a cornerstone, achieved via a centralized settings system that allows fine-grained control over behaviors like retry policies, download delays, and middleware activation through project-specific files or command-line overrides. This design not only promotes reusability and maintainability but also aligns with the framework's goal of empowering users to build robust, scalable scraping solutions efficiently.[9][8]
Architecture
Core Components
Scrapy is built around several core components that form the foundation of its web scraping and crawling capabilities. These components include spiders, items, item loaders, request and response objects, and settings, each serving a distinct role in defining, extracting, and configuring the scraping process.[6]
Spiders are custom classes that encapsulate the crawling logic for a specific website or set of sites. They define the starting points for scraping through the start_urls attribute, which specifies the initial URLs to request, or via the start_requests() method that yields Request objects asynchronously. Spiders implement parsing rules primarily through the parse() callback method, where responses are processed to extract data using selectors or other tools, and to generate follow-up requests for link extraction and navigation. Key attributes include the required name for unique identification, and optional allowed_domains to restrict crawling scope. Specialized subclasses like CrawlSpider use rules with LinkExtractor instances to automate link following based on patterns.[10]
Items serve as containers for holding the scraped data in a structured format, facilitating its export and further processing. They can be simple Python dictionaries for basic use, or more robust custom classes derived from scrapy.Item, dataclasses, or attrs classes via the itemadapter library for type safety and field validation. When using scrapy.Item, fields are declared with Field() objects, which populate the fields attribute and enable features like preventing invalid field assignments through KeyError exceptions. Items support a dict-like interface, including copy and deepcopy methods, and include reference tracking to aid in memory leak detection during development.[11]
Item Loaders provide an optional mechanism for populating and processing items with extracted data, emphasizing cleaning and validation to ensure data quality. They apply input processors to raw extracted values—such as stripping whitespace or converting data types—and output processors to format the final item fields before storage. Built on the itemloaders library, they integrate XPath and CSS selectors for targeted extraction from responses, and support nested loaders for handling complex, hierarchical data structures. This component is particularly useful for standardizing data handling across spiders without embedding processing logic directly in parsing methods.[12]
Request and Response objects manage the HTTP interactions central to Scrapy's operation, enabling the framework to fetch and handle web content. A Request object, typically generated in spiders, specifies the target URL (read-only, modifiable via replace()), HTTP method (defaulting to GET), headers as a dictionary-like structure, body as bytes, and a meta dictionary for passing contextual data like cookies or custom flags between requests. The corresponding Response object, returned after downloading, includes the HTTP status code (defaulting to 200), the final URL (potentially redirected), and the body as bytes, with access to the original request's meta via response.meta. These objects ensure seamless propagation of state and metadata throughout the scraping process.[13]
Settings act as the global configuration module, allowing customization of Scrapy's behavior through key-value pairs that influence core components, middleware, pipelines, and extensions. Common configurations include USER_AGENT for browser emulation, DOWNLOAD_DELAY to introduce pauses between requests for politeness, and CONCURRENT_REQUESTS to limit parallel downloads and avoid overwhelming servers. Settings are structured hierarchically, with defaults in scrapy.settings.default_settings, project-level overrides in settings.py (e.g., enabling ROBOTSTXT_OBEY = True), spider-specific adjustments via custom_settings, and command-line options holding highest precedence. The SCRAPY_SETTINGS_MODULE environment variable designates the active settings module, ensuring flexible adaptation to different environments.[8]
Execution Flow
The Scrapy execution flow is orchestrated by the Engine, which serves as the central coordinator managing interactions between components such as the Scheduler, Downloader, and Spiders through dedicated queues for requests and items. Upon initiation, the Engine retrieves the initial Requests from the Spider's start_urls or start_requests method and enqueues them in the Scheduler for prioritization and queuing. This orchestration ensures a continuous, asynchronous data flow, leveraging Twisted's non-blocking I/O to handle high concurrency without blocking on network operations.[6]
In the request cycle, the Engine signals the Scheduler to provide the next Request, which is then dispatched to the Downloader after passing through Downloader Middlewares for potential modifications like adding headers or handling retries. The Downloader fetches the HTTP response from the target website, processes it through the same middlewares in reverse, and returns it to the Engine, which subsequently forwards it to the Spider via Spider Middlewares for parsing. This cycle repeats as long as Requests remain available, enabling efficient crawling of linked pages.[6]
During parsing, the Spider processes the Response by extracting data and yielding Items or new Requests; yielded Items are routed through Spider Middlewares to the Engine and then to Item Pipelines for validation, cleaning, and storage, while new Requests are enqueued back into the Scheduler to continue the crawl. Middlewares at both the downloader and spider levels allow interception and customization of this yielding process, such as user-agent rotation or data transformation, without altering the core flow. This step-by-step yielding mechanism supports recursive crawling and structured data extraction in a modular fashion.[6]
Duplication filtering occurs primarily within the Scheduler using the default RFPDupeFilter, which generates a unique fingerprint for each Request based on attributes like URL, method, and body to prevent revisiting identical resources and avoid infinite loops or redundant processing. This filter maintains a set of seen fingerprints, discarding duplicates before enqueuing, which optimizes resource usage during large-scale crawls.[8]
The execution concludes with a shutdown process triggered when the Scheduler's queue is exhausted, indicating no further Requests are pending, or upon receiving system signals like SIGTERM for graceful termination. In graceful shutdowns, the Engine completes in-flight downloads and processes pending items before stopping the Twisted reactor, ensuring data integrity; errors or manual interruptions invoke signals like engine_stopped to log the reason and halt operations cleanly.[6][14]
Key Features
Selectors and Extraction
In Scrapy, selectors provide the primary mechanism for extracting data from HTML or XML responses received during web crawling. The framework supports two main selector types: XPath and CSS selectors. XPath selectors use a query language to navigate and select nodes in XML documents, allowing precise targeting of elements based on their structure, attributes, and content; for instance, response.xpath('//div[@class="title"]/text()').get() retrieves the text content of the first <div> element with a class attribute equal to "title".[15] CSS selectors, on the other hand, employ a stylesheet syntax to match elements by tag names, classes, IDs, and other attributes, such as response.css('span::text').get() to extract text from <span> elements.[15] These selectors are built as a thin wrapper around the parsel library, ensuring seamless integration with Scrapy's Response objects for efficient data selection.[15]
Selector objects in Scrapy, returned by methods like response.[xpath](/page/XPath)() or response.css(), offer several extraction methods to handle matched data. The get() method returns the first matched string or None if no match exists, making it suitable for single-value extractions.[15] For multiple values, getall() yields a list of all matching strings, which is essential for scraping lists or repeated elements.[15] Additionally, the re() method applies regular expressions to extract substrings from matches, such as response.[xpath](/page/XPath)('//a/text()').re(r'Name:\s*(.*)') to capture text following "Name:" in anchor elements, returning a list of results; re_first() provides the first such match.[15] These methods support flexible post-processing directly on selectors, enhancing their utility in parse methods within spiders.
XPath expressions distinguish between absolute and relative paths to accommodate dynamic content and nested structures. Absolute paths begin with / to select from the document root, like //p to match all <p> elements anywhere in the HTML, which is robust for global searches but less efficient in large documents.[15] Relative paths, prefixed with ., operate within a specific context, such as .//p to find <p> elements only under the current selector's node, aiding in drilling down through hierarchical or dynamically generated content without rescanning the entire response.[15] This distinction is particularly valuable for handling JavaScript-rendered pages or complex DOM trees, where context-aware selection prevents over-matching.
Scrapy extends selector functionality beyond HTML to non-HTML formats like JSON and XML through the Selector class's type parameter. For XML responses, instantiate with Selector(text=body, type="xml") to enable XPath queries on structured markup, preserving element hierarchies.[15] Scrapy extends selector functionality to JSON via the type="json" parameter in the Selector class, enabling JMESPath queries for structured extraction, such as response.jmespath('key'). Manual parsing with the json library remains an option for custom needs.[15] These capabilities ensure versatility across response types encountered in diverse web scraping scenarios.
For link extraction, Scrapy facilitates resolving relative URLs using the response.urljoin() method on Response objects, which combines a relative path with the response's base URL to form an absolute URL.[13] This is invoked as response.urljoin(relative_url), leveraging urllib.parse.urljoin under the hood and accounting for any <base> tag in HTML responses via TextResponse.[13] It is particularly useful when selectors yield relative links from href attributes, enabling seamless follow-up requests in spiders without manual URL normalization.[13]
Item Processing
In Scrapy, item processing occurs after data extraction, where scraped items—served as structured data containers—are passed through a series of components known as pipelines to clean, validate, and prepare them for storage or further use.[16] These pipelines act as sequential processors, allowing developers to perform operations such as filtering out duplicate items based on unique fields or converting data types to ensure consistency across the dataset.[16]
Each pipeline is implemented as a Python class with three key methods that define its lifecycle stages: open_spider(self, spider) for initialization and setup when the spider starts, process_item(self, item, spider) for handling individual items during the crawl, and close_spider(self, spider) for cleanup and teardown once the spider finishes.[16] The process_item method processes each item in sequence across enabled pipelines, returning the modified item, a Deferred for asynchronous operations, or raising an exception to halt further processing.[16]
Scrapy includes several built-in pipelines to handle common tasks, such as the FilesPipeline, which downloads and stores media files referenced in item fields like file_urls while avoiding redundant downloads through checksum verification.[17] Similarly, pipelines can integrate utilities like MailSender to send notifications upon completion or encountering specific conditions, leveraging Twisted's non-blocking I/O for efficient email dispatch.[18]
To enable pipelines, they are configured in the project's settings.py file using the ITEM_PIPELINES dictionary, where each entry maps a pipeline class to an integer priority between 0 and 1000, determining the execution order from lowest to highest value.[19] For instance, a configuration might prioritize data validation before storage by assigning lower numbers to earlier stages.[16]
Error handling in pipelines focuses on robustness, where invalid items—such as those with missing required fields or duplicates—can be dropped by raising the DropItem exception, preventing them from proceeding to subsequent pipelines or storage.[16] Alternatively, issues can be logged using the spider's logger for debugging, ensuring the crawl continues without interruption while maintaining data integrity.[16]
Installation and Setup
System Requirements
Scrapy requires Python 3.9 or later, compatible with both the standard CPython implementation and PyPy.[20] Support for Python 2.7 and earlier versions was officially dropped with the release of Scrapy 2.0 in March 2020.[21]
The framework depends on several core libraries for its functionality: Twisted provides the asynchronous networking and I/O capabilities essential for concurrent request handling; lxml serves as the XML and HTML parser; parsel enables CSS and XPath selectors for data extraction; w3lib provides utilities for URL handling and web encoding; cryptography handles secure HTTPS connections; and pyOpenSSL supports additional security features for network interactions.[20] Optionally, the service_identity library enhances TLS certificate verification when using Twisted, preventing warnings during secure connections.[22]
Scrapy is designed to run cross-platform on Windows, macOS, and Linux distributions.[20] On Windows, users may encounter installation challenges, such as the need for Microsoft Visual C++ Build Tools to compile dependencies like cryptography via pip, and it is recommended to use Anaconda or Miniconda for a smoother setup.[20] Windows-specific path handling issues can arise in file operations or command-line usage, often resolved by using forward slashes or raw strings in Python code to avoid backslash escaping problems.
For resource requirements, Scrapy operates efficiently on standard hardware for small to medium projects.[23] In large-scale crawls, performance scales with concurrency settings; optimal configurations target 80-90% CPU utilization, with memory usage increasing proportionally to manage high concurrency without bottlenecks.[23]
Project Initialization
To initialize a Scrapy project, first install the framework using pip in a dedicated virtual environment to isolate dependencies and prevent conflicts with other Python packages.[20] The recommended command is [pip](/page/Pip) install scrapy, which handles the installation of core dependencies such as Twisted for asynchronous I/O operations.[20] Virtual environments can be created using tools like Python's built-in venv module (e.g., python -m venv scrapy_env followed by activation) or Conda (e.g., conda create -n scrapy_env python=3.9 and conda activate scrapy_env).[20]
After installation, create a new Scrapy project by running the command scrapy startproject <project_name> from the desired parent directory, where <project_name> is replaced with the chosen project identifier (e.g., scrapy startproject myproject).[24] This command generates a standard directory structure for the project, including a top-level scrapy.cfg file for configuration and a Python module directory named after the project containing essential files and subdirectories.[25] The resulting structure appears as follows:
myproject/
scrapy.cfg
myproject/
__init__.py
items.py # Defines structured data items
middlewares.py # Custom [middleware](/page/Middleware) configurations
pipelines.py # Item processing pipelines
settings.py # Project settings
spiders/ # Directory for [spider](/page/Spider) files
__init__.py
myproject/
scrapy.cfg
myproject/
__init__.py
items.py # Defines structured data items
middlewares.py # Custom [middleware](/page/Middleware) configurations
pipelines.py # Item processing pipelines
settings.py # Project settings
spiders/ # Directory for [spider](/page/Spider) files
__init__.py
The settings.py file initializes with default configurations tailored to the project, such as BOT_NAME set to the project name (falling back to 'scrapybot'), USER_AGENT set to "Scrapy/VERSION (+https://scrapy.org)" for identifying requests, and ROBOTSTXT_OBEY set to True to respect robots.txt directives by default.[8]
To verify the installation and project setup, launch the Scrapy shell using the command scrapy shell <url>, where <url> is a target website (e.g., scrapy shell https://scrapy.org).[26] This opens an interactive Python console within the Scrapy environment, allowing immediate testing of response handling and basic extraction without running a full spider.[26]
Basic Usage
Creating Spiders
In Scrapy, spiders are the core classes responsible for defining the crawling behavior, starting from initial URLs and processing responses to extract data or follow further links. To create a basic spider, one subclasses the scrapy.Spider class, which provides the foundational structure for handling requests and responses.[10] The spider must define a unique name attribute, serving as an identifier for the crawler, and a start_urls attribute, which is a list of initial URLs from which the spider begins scraping.[25]
The primary method to implement in a spider is parse(self, response), which receives the response object from each request and processes it to yield extracted data or additional requests. For data extraction, the method typically uses Scrapy's selectors—such as CSS or XPath—to parse the HTML content and yield Python dictionaries representing items, for example:
python
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
This approach allows the spider to return structured data incrementally without blocking the crawling process.[25] To enable multi-page crawling, the parse method can yield new scrapy.Request objects for discovered links, specifying a callback function like self.parse to handle subsequent responses, as in:
python
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
This mechanism follows links recursively while respecting the spider's configuration.[25]
To restrict the spider's scope and prevent unintended crawling outside a target site, the allowed_domains attribute can be set as a list of domain names, ensuring requests are only followed within those domains.[10] For scenarios requiring dynamic generation of starting URLs—such as from a database or API—the start_requests() method can be overridden to yield custom scrapy.Request objects instead of relying on the static start_urls. An example implementation might look like:
python
def start_requests(self):
urls = ["https://example.com/page/1/", "https://example.com/page/2/"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def start_requests(self):
urls = ["https://example.com/page/1/", "https://example.com/page/2/"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
This flexibility supports more complex initialization logic while maintaining the spider's core parsing workflow.[10]
Running Scrapes
To execute a Scrapy project, the primary command is scrapy crawl <spider_name>, which starts the specified spider and initiates the crawling process from the project's root directory.[24] This command requires the project to be activated via the scrapy shell script or by running it from the project directory containing the scrapy.cfg file.[24]
Scrapy provides flexible output options to capture scraped items during execution. The -o FILE:FORMAT flag appends extracted items to a specified file in formats such as JSON, CSV, or XML, while -O FILE:FORMAT overwrites the file instead.[24] For instance, scrapy crawl myspider -o items.json exports items as JSON lines.[24] Additionally, the --loglevel LEVEL option controls output verbosity, with levels like INFO for standard messages or DEBUG for detailed traces, allowing users to monitor progress without overwhelming logs.[24]
For interactive debugging, the scrapy shell [url] command launches an IPython console with the response from the given URL preloaded, enabling direct inspection of page content, selectors, and XPath/CSS queries.[24] This mode supports spider-specific contexts via --spider=SPIDER and one-off code execution with -c 'code', such as checking response.status.[24]
Performance evaluation is facilitated by scrapy bench, which runs a standardized benchmark against sample websites to measure Scrapy's crawling speed and efficiency under default settings.[24]
Monitoring runs involves Scrapy's built-in logging and statistics collection. Logs display real-time events like requests sent and items processed, with levels adjustable via --loglevel.[24] The scrapy.stats object, accessible through the Crawler API as crawler.stats, tracks key metrics such as the total number of requests ("request_count"), successfully downloaded responses ("downloader/response_count"), and extracted items ("item_scraped_count/<spider_name>").[27] During execution, stats can be updated programmatically with methods like stats.inc_value("custom_metric"); post-run, they are available via the MemoryStatsCollector for analysis, including timestamps and counters that quantify scrape scale.[27]
Advanced Usage
Custom Pipelines
Custom pipelines in Scrapy enable developers to extend item processing with tailored logic, such as validation, transformation, or storage, allowing for sophisticated handling of scraped data after extraction. These components are integrated into the item's flow post-spider, providing a modular way to clean, enrich, or persist items without altering spider code. By defining custom classes, users can address specific requirements like data quality checks or external integrations, ensuring robust and scalable scraping workflows.[16]
To implement a custom pipeline, create a Python class that defines the process_item(self, item, spider) method, which processes each incoming item and returns the modified item, a Deferred for asynchronous handling, or raises scrapy.exceptions.DropItem to halt further processing. Supporting methods include open_spider(self, spider) for setup (e.g., resource allocation) when the spider initializes and close_spider(self, spider) for teardown upon completion. For access to project settings during instantiation, implement the classmethod from_crawler(cls, crawler), which receives the crawler context and can retrieve configuration values. This interface allows pipelines to operate statelessly per item while maintaining spider-wide state if needed.[16]
A representative example is a validation pipeline that inspects items for required fields and drops incomplete ones, preventing invalid data from advancing. The following code defines such a pipeline using the itemadapter library for consistent item handling across types like dicts or custom Items:
python
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class ValidationPipeline:
required_fields = ('name', 'price') # Define required fields
def process_item(self, item, spider):
adapter = ItemAdapter(item)
for field in self.required_fields:
if not adapter.get(field):
raise DropItem(f"Missing required field: {field}")
return item
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class ValidationPipeline:
required_fields = ('name', 'price') # Define required fields
def process_item(self, item, spider):
adapter = ItemAdapter(item)
for field in self.required_fields:
if not adapter.get(field):
raise DropItem(f"Missing required field: {field}")
return item
This pipeline raises DropItem for items lacking 'name' or 'price', effectively filtering out deficient records while logging the reason via Scrapy's default exception handling.[16]
For database integration, custom pipelines facilitate direct storage of items into SQL or NoSQL systems, decoupling persistence from spiders for better maintainability. Initialization occurs in open_spider to establish connections, processing in process_item to insert or update records, and cleanup in close_spider to release resources, minimizing overhead during high-volume crawls. The from_crawler method pulls database credentials from settings for secure configuration.
An example for NoSQL integration with MongoDB uses PyMongo to store items as documents:
python
import pymongo
from itemadapter import ItemAdapter
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get("MONGO_URI"),
mongo_db=crawler.settings.get("MONGO_DATABASE", "scrapy_items")
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db["items"].insert_one(ItemAdapter(item).asdict())
return item
import pymongo
from itemadapter import ItemAdapter
class MongoDBPipeline:
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get("MONGO_URI"),
mongo_db=crawler.settings.get("MONGO_DATABASE", "scrapy_items")
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db["items"].insert_one(ItemAdapter(item).asdict())
return item
For SQL databases like PostgreSQL or MySQL, the structure mirrors this: use libraries such as SQLAlchemy to create an engine in open_spider, execute inserts via a session in process_item (e.g., session.add(ItemModel(**ItemAdapter(item).asdict()))), and commit/close in close_spider. This approach supports transactional integrity and handles connection pooling for efficiency.[16]
Enabling custom pipelines requires registering them in the ITEM_PIPELINES setting as a dictionary mapping pipeline path to an integer priority (0 to 1000, with lower values executing earlier; values outside this range disable the pipeline). Multiple pipelines process items sequentially based on priority, allowing ordered operations like validation before storage. Example configuration in settings.py:
python
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 200,
'myproject.pipelines.MongoDBPipeline': 400,
}
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 200,
'myproject.pipelines.MongoDBPipeline': 400,
}
This setup runs the validation first (priority 200), followed by database insertion (priority 400), ensuring data quality prior to persistence.[19]
To test custom pipelines, configure isolated environments using scrapy.cfg or test-specific settings.py files to enable only the target pipeline, then run partial scrapes or unit tests that invoke process_item with fabricated items and mock spider objects, verifying outputs and side effects like database writes. This modular testing isolates pipeline logic from full crawls, facilitating debugging and validation.[24]
Middleware and Extensions
Downloader middleware in Scrapy provides a low-level framework for globally altering the requests and responses during the downloading phase.[7] It allows developers to implement custom logic, such as modifying request headers, handling proxies, or implementing retry mechanisms, by defining classes that inherit from scrapy.downloadermiddlewares.DownloaderMiddleware.[7] Key methods include process_request(request, spider), which is invoked for each request before it is sent and can return None to continue processing, a Response object, a Request object to redirect, or raise IgnoreRequest to drop the request; and process_response(request, response, spider), which processes responses after download and similarly returns a modified response, request, or raises an exception.[7]
Built-in downloader middlewares include HttpCacheMiddleware, which caches HTTP requests and responses to avoid redundant downloads, configurable via settings like HTTPCACHE_STORAGE (defaulting to scrapy.extensions.httpcache.FilesystemCacheStorage for file-based storage), and UserAgentMiddleware, which rotates or sets user agents based on the spider's user_agent attribute or global settings.[7] These middlewares are enabled by adding their paths to the DOWNLOADER_MIDDLEWARES dictionary in the project's settings, where the value represents a priority integer—lower numbers execute closer to the engine, higher ones closer to the downloader.[28] For example, {'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900} activates caching at a high priority.[7]
Spider middleware, in contrast, intercepts the spider's input and output to enable pre- and post-processing of responses, requests, and items during spider execution.[29] Developers create spider middlewares by defining classes that implement methods such as process_spider_input(response, spider), which filters responses before they reach the spider and returns None or raises DropItem; process_spider_output(response, result, spider), which modifies the iterable of requests or items yielded by the spider; and process_spider_exception(response, exception, spider), which handles exceptions during spider processing. Since Scrapy 2.13, an optional base class scrapy.spidermiddlewares.base.BaseSpiderMiddleware is available to simplify implementation by providing default behaviors for middleware methods, though subclassing is not required.[29] These middlewares are activated via the SPIDER_MIDDLEWARES setting, using a similar priority-based dictionary to control execution order, with lower priorities running nearer the engine.[30]
Extensions in Scrapy offer a flexible way to insert custom functionality tied to global events, distinct from middlewares by focusing on signals rather than request/response flows.[31] They are implemented as classes with a from_crawler class method that initializes the extension and connects it to Scrapy's signals system, which dispatches events like spider_opened or spider_closed.[31] For instance, an extension might connect a callback to spider_closed for tasks such as exporting statistics or closing database connections, as shown in this example:
python
from scrapy import signals
class MyExtension:
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, [spider](/page/Spider)):
# Custom logic, e.g., log stats
pass
from scrapy import signals
class MyExtension:
@classmethod
def from_crawler(cls, crawler):
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_closed(self, [spider](/page/Spider)):
# Custom logic, e.g., log stats
pass
Extensions are enabled through the EXTENSIONS setting with priority values, such as {'myproject.extensions.MyExtension': 100}.[32] Built-in extensions include CoreStats for collecting core statistics like request counts and StatsMailer for emailing scrape stats upon completion.[31] The signals system, briefly, enables event-driven behaviors across the framework without tight coupling to specific components.[14]
History and Development
Origins and Early Versions
Scrapy originated in 2007 as an internal tool at Mydeco, a London-based furniture e-commerce startup, where it was initially developed by Shane Evans, the company's head of software development, to efficiently collect structured product data from various websites.[2] Pablo Hoffman, a software engineer from Insophia in Uruguay, soon joined as a co-developer, collaborating with Evans to refine the framework's architecture for scalability and usability.[2] This partnership addressed the limitations of existing tools at the time, which lacked robust support for asynchronous web crawling and data extraction at scale, motivating the creation of a dedicated framework for such tasks.[33]
The first public release, Scrapy 0.7, occurred in August 2008, marking its debut as an open-source project under the permissive BSD license.[2] Hosted initially on platforms like Google Code before migrating to GitHub, the early version emphasized a modular design built on Python, leveraging the Twisted library for asynchronous networking to handle concurrent requests without blocking, and lxml for efficient HTML and XML parsing.[2] Internally at Mydeco, Scrapy powered data aggregation projects for competitive analysis and product cataloging, demonstrating its value in real-world e-commerce scenarios before broader dissemination.[33]
Following open-sourcing, Scrapy saw its first significant community contributions around 2009, as developers began submitting patches and enhancements, particularly for expanding parser support to handle diverse web formats like JSON and additional XML dialects.[34] These early inputs helped iterate on core components, fostering a growing ecosystem of users who adapted it for custom scraping needs beyond e-commerce.
In the pre-1.0 era, a key development challenge was stabilizing the asynchronous engine's integration with Twisted, ensuring reliable handling of high-volume crawls while mitigating issues like connection pooling and error recovery in dynamic web environments.[2] This period involved iterative refactoring to balance performance with maintainability, laying the groundwork for Scrapy's reputation as a production-ready tool.[33]
Major Releases and Evolution
Scrapy was initially developed in 2007 by Shane Evans at Mydeco, a London-based e-commerce startup, to automate data collection from websites, and it was open-sourced in August 2008 under the BSD license by Pablo Hoffman, marking its first public release as version 0.7.[2] Early versions, such as 0.16 released on October 18, 2012, introduced key features like Spider Contracts for formal spider testing and the AutoThrottle extension for adaptive download delays, while dropping support for outdated Python 2.5 and Twisted 2.5.[21]
The project reached a significant milestone with Scrapy 1.0.0 on June 19, 2015, which established a stable API by allowing spiders to return dictionaries instead of rigid Items, adding per-spider custom settings, and switching to Python's built-in logging system from Twisted's, with backward compatibility maintained.[21] This release emphasized simplicity in the Crawler API and included numerous bug fixes, solidifying Scrapy's role as a mature framework for large-scale scraping.[21] Subsequent 1.x versions built on this foundation; for instance, Scrapy 1.1.0 on May 11, 2016, enabled robots.txt compliance by default and introduced beta Python 3 support (requiring Twisted 15.5+), addressing growing demands for cross-version compatibility.[21] By Scrapy 1.5.0 on December 29, 2017, enhancements included better Google Cloud Storage integration and refined item pipeline behaviors, improving efficiency for cloud-based deployments.[35]
Scrapy 2.0.0, released on March 3, 2020, represented a major evolution by fully embracing asynchronous programming with initial asyncio support, alongside changes to scheduler queue handling for better customization.[21] This version dropped Python 2 support entirely, aligning with the ecosystem's shift to Python 3, and introduced features like improved HTTP authentication to mitigate credential exposure risks in later patches, such as 2.5.1 on October 5, 2021.[21] Security-focused updates continued, with Scrapy 2.6.0 on March 1, 2022, enhancing cookie handling to prevent exploits via redirects.[21]
In recent years, Scrapy has adapted to modern Python runtimes and performance needs. Scrapy 2.11.0 on September 18, 2023, allowed spiders to modify settings dynamically via the from_crawler method and added periodic stats logging for better monitoring.[36] The 2.12.0 release on November 18, 2024, dropped Python 3.8 support while adding Python 3.13 compatibility and introducing JsonResponse for streamlined JSON handling.[36] Most notably, Scrapy 2.13.0 on May 8, 2025, made the asyncio reactor the default (replacing Twisted's), deprecated non-async middlewares, and adjusted defaults like increasing DOWNLOAD_DELAY to 1 second and reducing CONCURRENT_REQUESTS_PER_DOMAIN to 1 for more ethical scraping out-of-the-box, with subsequent 2.13.x patches refining callback precedence and engine stability.[21] These changes reflect Scrapy's ongoing evolution toward async-native, secure, and user-friendly web extraction, driven by community contributions exceeding 11,000 GitHub commits.[2]