Shapefile
A shapefile is a geospatial vector data format developed by Esri in the early 1990s for storing the geometric location and associated attribute information of spatial features in geographic information system (GIS) software, such as points, lines, and polygons.[1] It was introduced alongside ArcView GIS version 2 to facilitate efficient data handling without topological relationships, enabling faster drawing, editing, and storage compared to more complex formats.[2][1] The format is publicly documented as an open specification, first detailed in Esri's 1998 technical description, which has made it a de facto standard for GIS data exchange across Esri and non-Esri applications, including tools like QGIS.[2][1] A shapefile comprises a collection of at least three mandatory files: the main geometry file (.shp) that holds vector coordinates in a binary structure with a fixed 100-byte header followed by variable-length records; the shape index file (.shx) that provides byte offsets for rapid access to features; and the dBASE attribute file (.dbf) that stores tabular data linked to each geometric feature.[2] Optional files, such as the projection file (.prj) for coordinate system definitions or spatial index files (.sbn and .sbx), can enhance functionality but are not required for basic use.[1] Shapefiles support five primary geometry types—Point, PolyLine, Polygon, MultiPoint, and their variants with Z (elevation) or M (measure) values—but each file is limited to a single type, and the format does not enforce topological integrity, such as shared edges between polygons.[2] Despite its widespread adoption, with over 160,000 instances in collections like those of the Library of Congress as of 2024, shapefiles have notable limitations, including a 2 GB file size cap per component, lack of Unicode support in attributes, absence of null data handling beyond specific "no data" values, and incompatibility with infinity or NaN representations.[1][2] These constraints have led to recommendations for migration to more modern formats like GeoPackage for long-term sustainability, though shapefiles remain prevalent due to their simplicity and broad compatibility.[1]Introduction
Definition and Purpose
A shapefile is an open specification binary format developed by Esri for representing vector geospatial data, such as points, lines, and polygons.[2] It serves as a vector data storage format for capturing the location, shape, and attributes of geographic features.[3] The primary purpose of a shapefile is to store geometric locations alongside associated attribute information, enabling mapping, spatial analysis, and visualization in geographic information systems (GIS).[4] Key characteristics of shapefiles include their composition as a collection of multiple related files with specific extensions, rather than a single file, which allows for modular handling of geometry and attributes.[5] This structure supports simple feature geometries that comply with Open Geospatial Consortium (OGC) standards for nontopological vector data.[6] Shapefiles maintain a one-to-one relationship between spatial shapes and their descriptive attributes, facilitating efficient data exchange without complex topological relationships.[2] Shapefiles are widely adopted in GIS applications for tasks such as spatial analysis, cartography, urban planning, and environmental modeling, owing to their straightforward design and interoperability across diverse software platforms.[6] In the context of geospatial data representation, shapefiles exclusively handle vector data—discrete features modeled as points, lines, and polygons—contrasting with raster formats that use a grid of cells to depict continuous surfaces like imagery or elevation.[7]History and Development
The Shapefile format was developed by Esri in the early 1990s as a simple, non-topological vector data storage solution for geographic information systems (GIS).[1] It was introduced with the release of ArcView GIS version 2 in the early 1990s, Esri's desktop GIS software aimed at broadening access to spatial analysis beyond specialized users.[8] Designed initially for ArcView, the format combined geometry storage with attribute data in a dBASE-compatible structure, prioritizing ease of use, faster rendering, and reduced storage needs compared to earlier topological formats like those in ARC/INFO.[2] Esri released the public technical specification for Shapefile in July 1998 through a white paper, transitioning it from a proprietary internal format to a mostly open one that promoted interoperability across GIS tools.[2] This openness facilitated its integration into Esri's next-generation ArcGIS platform, launched with version 8.0 in December 1999, where Shapefile became a core supported format for data exchange and analysis.[9] By the early 2000s, the format had achieved de facto standard status in open-source GIS ecosystems; the Geospatial Data Abstraction Library (GDAL), initiated in 2001, included robust Shapefile read/write support from its outset, enabling seamless handling in tools like QGIS, which launched in 2002 and relied on GDAL for vector data operations.[10] Esri's decision to publish the specification encouraged widespread adoption while maintaining regulatory oversight, allowing the format to influence data sharing without full open-source licensing.[11] With the release of ArcGIS 8.0 in 1999, enhancements included the introduction of .shp.xml metadata files, providing structured descriptions of spatial reference and dataset properties to address documentation gaps in the original design.[6] Standardization efforts aligned Shapefile geometries with the Open Geospatial Consortium (OGC) Simple Features specification, supporting common types like points, lines, and polygons for basic spatial queries and operations.[12] Its simplicity also indirectly shaped later formats, such as GeoJSON (standardized in 2008), by establishing a baseline for encoding simple feature geometries and attributes in interoperable ways.[1] As of 2025, Shapefile remains prevalent in GIS workflows despite its age, with ongoing support in modern software like ArcGIS Pro and QGIS, and continued use by major data providers such as the U.S. Census Bureau for annual TIGER/Line releases.[13] However, its limitations—such as a 2 GB file size cap and lack of advanced features—have led to a gradual decline in favor of more flexible, standards-compliant alternatives like GeoPackage, though it endures as a legacy interchange format in legacy systems and data archives.[1]Components
Required Files
A shapefile dataset requires three mandatory files to function as a complete vector data format: the main geometry file (.shp), the shape index file (.shx), and the attribute database file (.dbf). These files collectively enable the storage and retrieval of geospatial features, including their shapes and associated descriptive data. Without all three, the dataset cannot be properly interpreted by GIS software, rendering it invalid or incomplete.[2] The .shp file serves as the core component, storing the vector geometry data for each feature in a series of binary records. This includes representations such as points, lines, or polygons that define the spatial locations and shapes of geographic entities.[2] The .shx file acts as a positional index to the .shp file, containing offsets that allow software to quickly locate and access specific geometry records without scanning the entire .shp file. This indexing supports efficient querying and rendering of spatial data.[2] The .dbf file maintains the attribute information for each feature using the dBase III database format, where each record corresponds directly to a geometry in the .shp file by sequential order. This linkage allows attributes like names, populations, or classifications to be associated with their respective spatial elements.[2] All required files must share the same base filename—for example, "rivers.shp", "rivers.shx", and "rivers.dbf"—and reside in the same directory to ensure proper dataset integrity. The .shp file employs a mixed byte order, with big-endian for file management fields in the header (such as file code and length) and little-endian for data fields (such as shape type and bounding box coordinates); the .shx file uses big-endian byte order throughout.[2]Optional Files
Shapefiles may include several optional files that enhance functionality, such as defining spatial references, handling character encoding, providing metadata, or improving query performance, without affecting the core data integrity of the required files.[2][14] The .prj file stores the coordinate reference system (CRS) information for the shapefile, typically in Well-Known Text (WKT) format or PROJ.4 notation, enabling accurate georeferencing and projection during mapping and analysis in GIS software.[2][1] This file is recommended for all shapefiles to ensure interoperability across different systems and to prevent misinterpretation of spatial coordinates.[14] The .cpg file specifies the character encoding used in the associated .dbf attribute file, such as UTF-8 or ANSI, to support international characters and non-Latin scripts in attribute data.[1][14] It is particularly useful for datasets containing multilingual text, ensuring proper display and processing in diverse software environments.[10] Metadata can be stored in .shp.xml files using XML format, which documents details about the shapefile such as its origin, creation date, and descriptive attributes, facilitating validation, documentation, and integration with tools like ArcGIS.[14][1] The .sbn and .sbx files provide a spatial index using spatial binning to improve query performance on large datasets.[2] For performance optimization on large datasets, the .qix file provides a quadtree-based spatial index, accelerating spatial queries by organizing geometries into hierarchical quadrants, and is commonly generated by open-source tools like GDAL or MapServer for compatibility with Esri shapefiles.[10][15] These optional files share the same base filename as the core shapefile components (e.g., example.prj for example.shp) to maintain association, but their absence does not invalidate the dataset, though it may limit advanced features depending on the consuming software.[2] Usage of .prj is advised universally for georeferencing, while .cpg, .shp.xml, .sbn, .sbx, and .qix are employed based on data complexity, encoding needs, and query requirements in specific applications.[14][10]Formats
Geometry Format (.shp)
The .shp file contains the geometric data of the shapefile in a binary format, using a combination of big-endian and little-endian byte orders for different elements.[2] The file begins with a fixed 100-byte header that encodes metadata essential for parsing the entire structure. This header starts at byte 0 with a file code of 9994, stored as a 4-byte big-endian integer, followed by 20 bytes of unused space initialized to zero. Bytes 24 through 27 specify the total file length as a 4-byte big-endian integer, measured in 16-bit words (each word being 2 bytes) and including the header itself. The version number, fixed at 1000 for the standard shapefile format, occupies bytes 28 through 31 as a 4-byte little-endian integer. Bytes 32 through 35 contain the shape type as a 4-byte little-endian integer, which defines the geometry type shared by all records in the file (e.g., 1 for point shapes). The remaining bytes 36 through 99 form the file's bounding box, comprising four 8-byte little-endian doubles representing the minimum and maximum X and Y coordinates (Xmin, Ymin, Xmax, Ymax) of the overall spatial extent derived from all geometries; optional Z and M extents follow but default to zero if unused.[2] Following the header, the file consists of a sequence of variable-length records, each representing a single geometry. Each record starts with an 8-byte header: bytes 0 through 3 hold the record number as a 4-byte big-endian integer (beginning at 1 and incrementing sequentially), and bytes 4 through 7 store the content length (excluding the record header) as a 4-byte big-endian integer in 16-bit words. The record's content immediately follows, beginning with a 4-byte little-endian integer at offset 8 that specifies the shape type, which must match the file header's shape type. For null geometries (shape type 0), the content ends here with no additional data. Otherwise, the remaining variable-length binary data encodes the geometry specifics.[2] Geometry encoding uses little-endian byte order for all coordinate and descriptive data, with coordinates represented as 64-bit IEEE double-precision floating-point values for high precision. A simple point geometry (shape type 1) consists solely of an X coordinate (8 bytes) followed by a Y coordinate (8 bytes). For more complex types like polylines (shape type 3) and polygons (shape type 5), the encoding is identical in structure: it begins with a per-record bounding box of four 8-byte little-endian doubles (Xmin, Ymin, Xmax, Ymax), followed by a 4-byte little-endian integer for the number of parts, a 4-byte little-endian integer for the total number of points, an array of 4-byte little-endian integers (one per part) serving as indices into the points array to delineate multi-part boundaries, and finally the array of points (each an X-Y pair of 8-byte doubles). This part-index mechanism enables support for multi-part features, such as disconnected polyline segments or polygons with interior rings (islands or holes). The file's overall bounding box is computed as the union of all individual record extents during creation. There is no explicit end-of-file marker; the total number of records and file termination are inferred from the header's length field.[2]| Field | Bytes | Type | Endianness | Description |
|---|---|---|---|---|
| File Code | 0-3 | Integer | Big | Must be 9994 |
| Unused | 4-23 | - | - | 20 bytes of zeros |
| File Length | 24-27 | Integer | Big | Total length in 16-bit words |
| Version | 28-31 | Integer | Little | Must be 1000 |
| Shape Type | 32-35 | Integer | Little | Geometry type for the file |
| Xmin | 36-43 | Double | Little | Minimum X coordinate |
| Ymin | 44-51 | Double | Little | Minimum Y coordinate |
| Xmax | 52-59 | Double | Little | Maximum X coordinate |
| Ymax | 60-67 | Double | Little | Maximum Y coordinate |
| (Optional Zmin, Zmax, Mmin, Mmax) | 68-99 | Double | Little | If unused, set to 0.0 |
Index Format (.shx)
The index file (.shx) serves as a positional companion to the main shapefile (.shp), enabling efficient random access to individual geometry records without requiring a full sequential scan of the larger .shp file. It stores offsets and lengths for each record in the .shp, allowing software to jump directly to specific features during reading or rendering operations. This linear indexing approach is essential for performance in applications handling large datasets, as it facilitates quick lookups by record position, which corresponds to the order of attributes in the associated .dbf file.[2] The .shx file begins with a 100-byte header that mirrors the structure of the .shp header, ensuring consistency in basic metadata. This includes bytes 0–3 containing the file code 9994 (indicating the shapefile format), bytes 4–23 as unused (set to zero), bytes 24–27 specifying the total file length in 16-bit words, bytes 28–31 indicating version 1000, and bytes 32–35 denoting the overall shape type (an integer from 0 to 31, such as 1 for points or 5 for polygons). Bytes 36–99 encompass the bounding box fields (minimum and maximum X and Y coordinates as IEEE double-precision values), which match those in the .shp header to describe the spatial extent of all features; however, these are not used for indexing purposes in the .shx itself. The file length value accounts for the fixed 50 16-bit words of the header plus 4 words per index record, reflecting the total number of shapefile records.[2] Following the header, the .shx contains one fixed-length 8-byte record for each geometry record in the .shp, resulting in a total record count identical to that of the .shp. Each .shx record consists of two 4-byte big-endian integers: the first (bytes 0–3) provides the offset in 16-bit words from the beginning of the .shp file to the start of the corresponding .shp record header (for example, the first record's offset is typically 50, as it follows the 100-byte .shp header), and the second (bytes 4–7) specifies the content length of that .shp record in 16-bit words, excluding the 8-byte .shp record header itself. These offsets point precisely to the .shp record headers, which include a record number and length, enabling seamless synchronization between the files.[2] For the .shx to function correctly, it must maintain exact correspondence with the .shp in terms of record count, order, and content lengths; any addition, deletion, or modification of geometries in the .shp necessitates rebuilding the .shx to update the offsets and lengths accordingly. This positional alignment also links each .shx entry to the corresponding attribute row in the .dbf file by sequential order, supporting integrated access to spatial and tabular data. Unlike spatial indexing formats such as .sbn and .sbx, the .shx provides no capability for querying based on geographic location, limiting it to simple ordinal access.[2]Attribute Format (.dbf)
The .dbf file in a Shapefile stores tabular attribute data for each geometric feature in a format compatible with dBase III database tables, ensuring a one-to-one correspondence between records and shapes in the accompanying .shp file.[2] This structure allows for the association of descriptive attributes, such as names or population values, with spatial entities without embedding them directly in the geometry data.[16] The file consists of a fixed header, field descriptors, and data records, all adhering to the dBase III specification for interoperability with legacy database applications.[17] The file begins with a 32-byte header that provides essential metadata about the table. Byte 0 indicates the dBase version, typically 0x03 for dBase III without memo fields or 0x83 with memo support, though Shapefiles generally avoid memo fields.[18] Bytes 1 through 3 store the last update date (year minus 1900, month, and day, respectively). Bytes 4 to 7 contain the total number of records as a little-endian 32-bit integer, matching the number of shapes in the .shp file. Bytes 8 and 9 specify the header length (little-endian 16-bit), which includes the initial 32 bytes plus 32 bytes per field descriptor and a 1-byte terminator. Bytes 10 and 11 define the record length (little-endian 16-bit), determining the fixed size of each data row. The remaining bytes 12 to 31 are reserved, typically set to 0x00, with byte 28 sometimes indicating an incomplete transaction flag (0x00 or 0x01) and byte 29 for encryption (usually 0x00 in unencrypted Shapefiles).[19][17] Following the header are field subheaders, each 32 bytes long, defining the structure of the attribute columns until terminated by a 0x0D byte. The first 11 bytes (0-10) hold the field name as an ASCII string, limited to 10 characters followed by a null terminator or space padding. Byte 11 specifies the data type: 'C' for character strings, 'N' for numeric values, 'L' for logical (true/false), or 'D' for dates in YYYYMMDD format; floating-point numbers are also handled as 'N' type. Bytes 12 to 15 provide the byte displacement of the field within each record (little-endian 32-bit, often calculated on-the-fly). Byte 16 sets the field length (1 to 255 bytes), and byte 17 indicates decimal places (0 to 15 for numerics). Bytes 18 to 31 are reserved, set to 0x00. Shapefiles support up to 255 fields, with field names limited to 10 characters to maintain dBase III compatibility.[18][19][20] Data records follow immediately after the field descriptors, with one record per shape in positional order—the nth record in the .dbf corresponds directly to the nth shape in the .shp file, enabling straightforward linking without additional keys.[2][16] Each record is a fixed-length sequence matching the header's record length specification, starting with a 1-byte marker: 0x20 (space) for active records or 0x2A (asterisk) for deleted ones, which are skipped during processing but retained in the file. Subsequent bytes fill the fields sequentially: character fields are left-justified and space-padded; numeric fields are right-justified with leading spaces and no scientific notation; logical fields use a single byte with 'T', 'F', or space; date fields occupy 8 bytes in fixed YYYYMMDD format. The total record length is limited to 4 KB in standard Shapefile implementations to avoid exceeding dBase constraints, and complex data types like arrays or objects are not supported, restricting attributes to simple scalar values.[19][17][10] The file concludes with a 0x1A (end-of-file) terminator byte after the last record, signaling the end of data. By default, text encoding follows the dBase III standard using ASCII or OEM codepages, but Shapefiles may include an optional .cpg companion file specifying extended codepages (e.g., UTF-8 or Windows-1252) for international characters, with the numeric value in .cpg indicating the encoding to use if present.[18][21]Spatial Index Format (.sbn and .sbx)
The .sbn and .sbx files constitute an optional spatial indexing mechanism for shapefiles, enabling faster retrieval of features based on their geographic locations during queries. These files implement an R-tree data structure, which organizes the minimum bounding rectangles (MBRs) of the geometries stored in the corresponding .shp file into a balanced hierarchy of nodes. This approach minimizes the number of features examined in spatial operations, such as intersection tests or containment checks for points, lines, and polygons, particularly beneficial for large datasets exceeding thousands of records.[10][14] The .sbn file holds the core R-tree data in a binary format with variable-length records representing internal nodes and leaf nodes. Each node encapsulates MBRs that approximate the extent of child nodes or individual features, along with pointers to facilitate tree traversal. Leaf nodes reference specific shape records by their indices, allowing the index to guide searches without loading the full geometry data. The R-tree's design ensures logarithmic-time query performance by pruning irrelevant branches early, though it permits some overlap in MBRs to maintain balance during insertions. These indexes are generated by Esri's ArcGIS software during shapefile creation or optimization, and while not universally present, they significantly enhance rendering and analysis speed in compatible tools.[10][14] Complementing the .sbn file, the .sbx serves as a fixed-length index akin to the .shx file used for sequential access in the .shp, mapping record numbers to byte offsets and content lengths within the .sbn. This pairing allows efficient random access to R-tree nodes, streamlining the integration with the main shapefile components. Compatibility is limited to shapefiles at version 1000 or higher, where the spatial extent is initially partitioned into bins to seed the R-tree construction, promoting even distribution across the tree levels. Open-source libraries like GDAL support reading these indexes to exploit their acceleration benefits, though creation remains proprietary to Esri tools.[10][22]Shape Types and Records
Supported Geometry Types
The Shapefile format defines a set of geometry types to represent spatial features, each specified by a unique integer code stored in the file header and at the start of each record. These types encompass simple points, linear features, polygonal areas, and multi-part collections, with extensions for elevation (Z) values and linear measures (M) for applications like routing or surveying. All non-null geometries within a single shapefile must share the same type, ensuring uniformity. The format supports 15 primary types as of the original specification, with additional reserved codes for future extensions.[2] The following table enumerates the supported shape types, their codes, and basic compositions:| Code | Type | Description and Composition |
|---|---|---|
| 0 | Null | No geometric content; serves as a placeholder record with no coordinates. |
| 1 | Point | A single 2D point defined by X and Y double-precision coordinates. |
| 3 | Polyline | One or more parts, where each part is an array of connected 2D points (doubles for X,Y); represents open linear features. |
| 5 | Polygon | One or more closed rings, each an array of 2D points (at least four per ring, first and last identical); represents areal features. |
| 8 | MultiPoint | A collection of non-connected 2D points within a bounding box, stored as an array of X,Y doubles. |
| 11 | PointZ | A single 3D point with X,Y,Z doubles; optional M value follows. |
| 13 | PolylineZ | Polyline with Z-enabled points (X,Y,Z doubles per point); includes Z range and optional M range/array. |
| 15 | PolygonZ | Polygon with Z-enabled points; includes Z range and optional M range/array per ring. |
| 18 | MultiPointZ | MultiPoint with Z-enabled points; includes Z range and optional M range/array. |
| 21 | PointM | A single 2D point with an associated M double-precision measure. |
| 23 | PolylineM | Polyline with M values per point or segment; includes M range and array. |
| 25 | PolygonM | Polygon with M values; includes M range and array per ring. |
| 28 | MultiPointM | MultiPoint with M values per point; includes M range and array. |
| 31 | MultiPatch | A complex 3D surface composed of patches (e.g., triangle strips, fans, rings) using X,Y,Z coordinates; supports optional M and represents volumetric objects like buildings. |