Fact-checked by Grok 2 weeks ago

Data file

A data file is a computer file that contains information or data intended to be read, viewed, processed, or manipulated by software applications, rather than executable code designed to be run by the operating system.^[1] Unlike program files, which hold instructions for the computer to execute, data files serve as repositories for user-generated content, configuration settings, or application-specific information, such as text documents, spreadsheets, or images.^[2] These files are typically identified by extensions like .txt, .csv, or .jpg, which indicate their format and intended use.^[1] Data files form the backbone of information storage in computing systems, enabling the persistence of data across sessions and devices.^[3] They can be structured, such as database files organized into records and fields with primary keys for unique identification, or unstructured, like plain text files containing streams of bytes without a predefined schema.^[4] Common examples include comma-separated values (CSV) files for tabular data, extensible markup language (XML) files for hierarchical information, and binary formats for multimedia like JPEG images or WAV audio.^[5] In operating systems, data files are managed through file systems that use metadata structures, such as inodes, to track their location, size, and access permissions on storage media.^[6] The versatility of data files supports diverse applications, from simple text editing to complex data analysis in scientific computing.^[7] For long-term preservation, open and standard formats like CSV or PDF are recommended to ensure accessibility across different software and hardware environments.^[8] As data volumes grow in modern computing, effective management of data files—including backup, versioning, and format migration—remains critical to prevent loss and maintain usability.^[9]

Overview and Fundamentals

Definition and Scope

A data file is a computer file designed primarily to store structured or unstructured data for reading, writing, or processing by software applications, serving as a repository of information rather than instructions for execution. In operating system contexts, such files are often categorized as ordinary or regular files, which contain user-generated or application-produced content like text, numbers, or binary sequences, without inherent capability for direct execution by the processor. This core purpose distinguishes data files from other file types, emphasizing their role in data management and persistence across computing environments.^[10]^[11] The scope of data files extends to various applications where data storage and retrieval are central, including input and output streams for program execution, database records for organized information management, system logs for event tracking, and configuration files for software parameterization. These uses highlight data files' versatility in supporting computational workflows, from simple data exchange to complex analytical processing, while maintaining compatibility with file system abstractions that treat them as streams of bytes accessible via standard I/O operations. Notably, this scope deliberately excludes files dedicated to program code, such as source scripts or compiled binaries, and system files like device drivers, focusing instead on passive data containment.^[12]^[13] In contrast to executable or script files, which emphasize active execution to perform tasks, data files prioritize passive storage and interoperability, enabling applications to interpret and manipulate their contents without initiating runtime processes. This functional separation enhances system security and modularity, as operating systems handle data files through generic access controls rather than privileged execution paths. Data files may be represented in text-based or binary forms, providing a foundational distinction in their internal organization for human readability versus machine efficiency.^[14]^[15]

Key Characteristics

Data files possess several core attributes that facilitate their management and interaction within computing systems. These include metadata such as file size, which indicates the amount of storage occupied by the file; permissions, which control access rights like read, write, and execute; and timestamps recording creation, modification, and access times.^[16] Such metadata is typically stored in structures like inodes in Unix-like file systems, enabling efficient file system operations.^[16] The content of data files is encoded in specific formats to represent information accurately. Text-based data files often use encodings like ASCII for basic 7-bit characters or UTF-8 for broader Unicode support, allowing representation of international characters while maintaining compatibility with ASCII subsets.^[17] In contrast, binary data files store information as raw sequences of bytes without a human-interpretable structure, optimized for machine processing and compactness.^[16] Many data file formats incorporate extensibility through headers, which contain descriptive information such as version numbers or structural details, and footers, which may include summaries or checksums, allowing formats to evolve without breaking compatibility.^[18] Accessibility of data files varies by type and design. Text files are generally human-readable, enabling direct inspection and editing using standard tools, whereas binary files require specialized software for interpretation due to their opaque byte structure.^[19] Portability across different systems and platforms relies on adherence to standardized formats; for instance, using character-based encodings like ASCII or UTF-8 enhances interoperability, while binary formats may demand additional conversion to handle variations in byte order or integer sizes.^[20] To ensure reliability, data files often include integrity features. Checksums or cryptographic hashes, such as MD5 or SHA, are computed over the file content to detect errors or tampering during storage or transfer by verifying against a reference value.^[21]^[22] Versioning mechanisms track changes over time, preserving historical states through metadata annotations or separate file iterations, which supports data recovery and auditing in long-term storage scenarios.^[23]

Classification by Structure

Text-Based Files

Text-based files, also known as plain text files, consist of sequences of human-readable characters encoded using standards such as ASCII or Unicode, forming a linear arrangement without embedded formatting or binary elements.^[24] These files typically organize data into lines separated by newline characters, with additional delimiters like tabs or spaces used to structure content within lines, facilitating straightforward parsing and display.^[25] This character-based composition allows for universal accessibility across diverse systems, as long as the encoding is properly handled.^[26] A primary advantage of text-based files is their inherent human inspectability, enabling users to view and comprehend contents directly using basic text editors without specialized software.^[27] This readability supports easy manual editing and debugging, making them ideal for collaborative or iterative development processes.^[28] Additionally, they impose minimal overhead for storing simple data, as the format avoids the complexity of proprietary structures or compression layers, resulting in lightweight files suitable for quick operations.^[27] Despite these benefits, text-based files exhibit limitations in handling complex or voluminous datasets, where their lack of built-in compression leads to larger storage requirements compared to more optimized formats.^[29] Encoding challenges, such as the presence of a Byte Order Mark (BOM) in UTF-8 files, can introduce invisible characters that disrupt parsing or display, potentially causing issues in cross-platform compatibility or automated processing.^[30] Common examples of text-based files include configuration files like those with .ini or .cfg extensions, which store settings in a simple key-value format organized into sections for software initialization.^[31] Log files, typically ending in .log, record sequential events or system activities in timestamped lines for auditing and troubleshooting purposes.^[32] Comma-Separated Values (CSV) files represent tabular data in delimited rows, widely used for exporting datasets from spreadsheets or databases due to their portability.^[33]

Binary Files

Binary files consist of sequences of bytes that encode data in its raw machine-readable form, without conversion to human-readable characters. This structure typically involves the binary representation of primitive data types, such as integers stored as fixed-width bit patterns (e.g., 32-bit two's complement for signed values) and floating-point numbers adhering to the IEEE 754 standard, which defines precise bit layouts for single and double precision formats.^[34] Strings are often encoded as byte arrays, potentially with length prefixes, while complex data may be organized into fixed-length or variable-length records, sometimes preceded by headers that specify the overall format, including record counts or offsets.^[35] The byte order within multi-byte values follows either big-endian (most significant byte first) or little-endian (least significant byte first) conventions, which can vary by hardware architecture.^[36] The primary advantages of binary files stem from their efficiency in storage and processing. By utilizing the full range of byte values without textual encoding overhead, they achieve compact representations— for instance, a 32-bit integer occupies exactly four bytes regardless of its value, contrasting with variable-length text encodings.^[37] This compactness is particularly beneficial for numerical data, multimedia, and large datasets, enabling faster read/write operations and reduced memory usage during computation.^[38] Additionally, binary formats support intricate data hierarchies, such as nested structures or arrays, allowing efficient serialization of object graphs or multidimensional arrays without the parsing costs associated with text-based alternatives.^[35] Despite these benefits, binary files present notable limitations related to accessibility and interoperability. They are inherently non-human-readable, necessitating specialized software or tools to interpret and edit their contents, as direct inspection in text editors reveals only gibberish.^[39] Platform dependencies further complicate usage; differences in endianness can render a file incompatible across systems (e.g., x86 little-endian vs. some network protocols' big-endian), while structure padding—added bytes for data alignment to optimize hardware access—varies by compiler and architecture, potentially altering file layouts.^[40] These issues demand careful consideration of metadata for cross-platform portability, often requiring explicit handling of byte order and alignment during implementation.^[41] Representative examples illustrate the versatility of binary files across applications. SQLite database files (.db) employ a binary format with a fixed 16-byte header identifying the version and page size, followed by variable-length B-tree pages containing records of serialized data.^[42] In Java, serialized object files (.ser) use a stream protocol that encodes class descriptors, object instances, and references in a binary stream, supporting the persistence of entire object hierarchies.^[43] Similarly, Python's pickle format serializes objects into a binary stream via protocols that handle references and custom types efficiently for in-memory data transfer.^[44] For scientific computing, NetCDF files provide a binary container for multidimensional arrays, with headers defining dimensions, variables, and attributes in a self-describing structure optimized for array-oriented data.^[45]

Common Formats and Examples

Structured Data Formats

Structured data formats are file formats that employ predefined schemas, tags, records, or other mechanisms to enforce hierarchy and relations in data representation, enabling organized and machine-readable storage and exchange. These formats define parsing rules to interpret the structure, often using optional schemas to enforce additional constraints and distinguish them from unstructured data through organized representation.^[46]^[47]^[48] A prominent example is Extensible Markup Language (XML), which uses nested elements marked by start and end tags to create hierarchical structures, allowing for extensible markup that describes data relationships. XML documents consist of a single root element containing these nested components, with schemas such as Document Type Definitions (DTDs) or XML Schema Definition Language (XSD) imposing constraints on element types, attributes, and content to ensure structural integrity. Developed as a subset of Standard Generalized Markup Language (SGML), XML facilitates the representation of complex, tree-like data suitable for documents and configurations.^[46]^[49] JavaScript Object Notation (JSON) provides a lightweight alternative, structuring data through key-value pairs within objects (enclosed in curly braces) and ordered arrays (enclosed in square brackets), derived from the ECMAScript programming language. This format supports simple hierarchies like nested objects and arrays, making it ideal for web-based data interchange without the verbosity of markup languages. JSON's minimal syntax ensures portability across languages and systems, with values limited to strings, numbers, booleans, null, objects, or arrays.^[47] Another common structured format is Comma-Separated Values (CSV), which stores tabular data in plain text using commas or other delimiters to separate fields within records and newlines for rows, facilitating easy import into databases and spreadsheets. While it allows flexibility in field types without enforced validation, CSV follows a basic row-and-column schema defined in standards like RFC 4180.^[50] For binary efficiency, Protocol Buffers (Protobuf), developed by Google, serializes structured data using a schema defined in .proto files, which specify messages with fields of scalar or nested types compiled into language-specific code. Unlike text-based formats, Protobuf encodes data in a compact binary form, supporting extensibility through field numbers that allow backward-compatible updates without altering existing data. This mechanism is language-neutral and platform-independent, commonly used in distributed systems for its performance advantages over textual serialization.^[48] These formats enhance interoperability by standardizing data encoding and decoding across diverse systems and applications, reducing integration challenges in networked environments. For instance, XML's compatibility with internet protocols and JSON's language-independent design promote seamless exchange, while Protobuf ensures cross-language consistency through generated APIs.^[46]^[47]^[48] Validation against schemas is a core benefit, allowing enforcement of data constraints to prevent errors and maintain integrity; XML uses XSD to describe and constrain document contents, JSON employs JSON Schema vocabulary for structural checks, and Protobuf relies on .proto definitions during compilation and runtime. This capability supports automated verification, ensuring compliance before processing.^[49]^[51]^[48] The structured nature also aids ease of querying, as hierarchies enable path-based navigation—such as traversing element trees in XML or accessing nested keys in JSON—facilitating efficient data retrieval without full parses in many cases.^[46]^[47] Parsing structured data formats involves dedicated mechanisms to reconstruct the schema-defined hierarchy from the encoded file. For XML, the Document Object Model (DOM), a W3C standard API, represents the document as an in-memory tree of nodes, allowing programmatic access, manipulation, and traversal via methods like getElementById or querySelector. Libraries implementing DOM, such as those in JavaScript or Python's xml.dom, load the entire document for random access.^[52]^[53] JSON parsing leverages built-in or lightweight libraries across languages; for example, JavaScript's native JSON.parse() converts text to objects, while tools like Python's json module or Java's Jackson handle deserialization into native data structures for querying key-value pairs.^[47] Protobuf parsing uses language-specific libraries generated from the .proto schema, where the protoc compiler produces code with methods like parseFrom() to deserialize binary streams into message objects, enabling direct field access with minimal overhead. This approach supports streaming and partial parsing for large datasets.^[48]

Unstructured Data Formats

Unstructured data formats encompass free-form content that lacks a rigid schema or predefined structure, allowing data to be stored in its native or raw state while relying on informal conventions, delimiters, or byte offsets for interpretation. These formats are particularly suited for capturing semi-organized or ad-hoc information, such as plain text dumps of logs or raw binary streams of serialized objects, where the absence of enforced rules enables quick storage without prior data modeling.^[54]^[55]^[56] Common examples include plain text files (.txt) used for application logs or documents, containing timestamped entries or messages in varying layouts without a fixed schema; PDF files for formatted documents that embed text, images, and layouts in a binary structure not easily parsed relationally; and binary multimedia formats such as JPEG images or WAV audio files, which store raw pixel data or audio samples as streams of bytes. Additionally, language-specific binary serialization formats like Python's pickle encode complex objects into a compact stream interpretable only by compatible loaders, eschewing any universal schema.^[56]^[5]^[44] These formats prioritize ease of generation over interoperability, often resulting in files that blend human-readable elements with machine-specific encodings. The primary benefits of unstructured data formats lie in their simplicity and flexibility, enabling rapid accumulation of diverse data types for exploratory or temporary use without the overhead of schema design.^[54] However, drawbacks include parsing ambiguity, where inconsistent conventions can lead to errors in extraction, and heightened proneness to data corruption due to the lack of built-in integrity checks.^[54] Handling unstructured data formats typically demands custom scripts or specialized parsers for extraction and processing, as there is no inherent validation mechanism to ensure data consistency or completeness. Unstructured formats may incorporate binary encoding techniques akin to those in binary files for efficient storage of non-textual elements.^[57]

Historical Development

Origins in Early Computing

The concept of data files emerged from the need to store and manage information separate from processing instructions in early computing systems, building on pre-digital mechanical methods. In the 1940s and early 1950s, punched cards served as a primary medium for data input and storage in computing precursors, such as Herman Hollerith's tabulating machines adapted for electronic computers like the ENIAC, where stacks of cards held datasets for batch processing.^[58] By the mid-1950s, magnetic tapes began supplanting punched cards as more efficient storage, with the UNIVAC I computer in 1951 introducing the Uniservo tape drive—the first commercial tape storage device—which allowed sequential access to large volumes of data at speeds up to 7,200 characters per second on 1/2-inch metal tapes (nickel-plated phosphor bronze).^[59] These tapes marked an early distinction between transient input media like cards and persistent storage for datasets, enabling businesses to handle payroll and census data without constant manual reloading.^[59] A pivotal advancement came in 1956 with IBM's 305 RAMAC (Random Access Method of Accounting and Control), the world's first commercial hard disk drive, which revolutionized data file access by introducing random rather than sequential retrieval.^[60] The RAMAC's Model 350 disk storage unit featured 50 platters storing up to 5 million characters (about 3.75 MB) and weighed over a ton, but it permitted direct addressing of data records, facilitating the creation of organized data files for applications like inventory tracking without rewinding entire tapes.^[61] This random-access capability laid the groundwork for treating data as modular files independent of program code, influencing subsequent storage designs. In the 1960s, operating systems formalized the separation of data files from executable code, with IBM's OS/360—announced in 1964 and released in 1966—introducing structured file management for its System/360 mainframes.^[62] OS/360's dataset concept allowed users to define files as sequential, partitioned, or indexed structures on disks or tapes, supporting concurrent access by multiple programs and distinguishing persistent data storage from temporary work areas.^[63] Early data files took two primary forms: text-based files generated via teletypes, such as the 1963 Teletype ASR-33, which used ASCII encoding to punch or print human-readable datasets on paper tape for input to systems like the PDP-1; and binary files in scientific computing, where Fortran programs from 1957 onward employed unformatted I/O to store numerical arrays and simulation results directly in machine-readable format on tapes or disks.^[64] These formats prioritized efficiency—text for portability and debugging, binary for compact representation of floating-point data in fields like physics and engineering.^[65]

Modern Evolution

In the 1980s and 1990s, the proliferation of personal computing and spreadsheet applications drove the adoption of standardized text-based formats for data interchange, with Comma-Separated Values (CSV) emerging as a simple method to export tabular data from spreadsheet applications like Lotus 1-2-3 and later tools. CSV's origins trace to the early 1970s in programming environments, but it gained widespread traction in the 1980s as spreadsheets became essential for business and scientific data handling, enabling easy portability across systems without proprietary dependencies.^[66] By the late 1990s, the World Wide Web Consortium (W3C) formalized Extensible Markup Language (XML) as a recommendation on February 10, 1998, providing a flexible, hierarchical structure for document and data representation that supported interoperability across diverse platforms and applications.^[67] The 2000s marked a pivotal shift influenced by the expanding web ecosystem and the onset of big data challenges, where lightweight formats like JavaScript Object Notation (JSON) gained prominence for efficient data serialization in web services. Developed by Douglas Crockford and initially described on json.org in 2002, JSON offered a human-readable alternative to XML, rapidly becoming the de facto standard for API data exchange due to its simplicity and native integration with JavaScript. Concurrently, the introduction of Hadoop in 2006, including its Distributed File System (HDFS), revolutionized large-scale data storage by enabling fault-tolerant, distributed management of massive datasets across commodity hardware clusters.^[68] Compression techniques also matured during this period; the gzip algorithm, first released on October 31, 1992, achieved broad adoption in the 2000s for reducing file sizes in web transfers and big data pipelines, balancing efficiency with minimal computational overhead.^[69] From the 2010s onward, cloud computing and analytics demands spurred the development of specialized formats optimized for distributed environments, exemplified by Apache Parquet's release in 2013 as a columnar storage solution tailored for big data frameworks like Hadoop and Spark. Parquet's design emphasizes high-performance querying through column-wise compression and encoding, significantly reducing storage costs and I/O for analytical workloads while maintaining compatibility with open ecosystems.^[70] This era also saw the rise of additional columnar formats like Apache ORC (2013) for Hadoop workflows and row-based Apache Avro (2009, matured in 2010s) for serialization with schema support. In the late 2010s, lakehouse architectures introduced transactional formats such as Delta Lake (2019) and Apache Iceberg (2018), enabling ACID compliance, schema evolution, and time travel on data lakes, addressing limitations in earlier big data storage for cloud-scale analytics as of 2025.^[71]^[72] This era also highlighted a growing emphasis on open standards and interoperability, with Apache projects fostering collaborative evolution of formats that support schema evolution and multi-tool integration. Key trends include the migration to distributed storage architectures, which distribute data across scalable clusters to handle petabyte-scale volumes, and the incorporation of privacy-enhancing features such as encryption mechanisms within file handling protocols to comply with regulations like GDPR.^[73]^[74]

Usage and Management

Applications in Software Systems

Data files play a crucial role in database management systems by enabling the export and import of structured data through SQL dump files, which contain SQL commands to recreate database states for migration or replication. For instance, in PostgreSQL, the pg_dump utility generates these dump files to back up databases, allowing restoration via psql to recover data integrity after failures. Similarly, Oracle's Data Pump Export utility unloads data and metadata into dump file sets for efficient transfer between environments, supporting operations like schema evolution without downtime. Microsoft SQL Server employs backup files, created using T-SQL commands or SQL Server Management Studio, to ensure point-in-time recovery and disaster resilience by storing full or differential snapshots of database contents. In programming environments, data files facilitate input/output operations, such as loading comma-separated value (CSV) files into data structures for analysis; Python's pandas library, for example, uses the read_csv function to parse these files into DataFrames, enabling seamless manipulation of tabular data in workflows like machine learning pipelines. Configuration files further integrate data files by storing application settings in formats like INI or YAML, which are loaded at runtime to customize behavior without recompiling code; the Python standard library's configparser module reads INI files to populate dictionaries with key-value pairs, while YAML files require third-party libraries such as PyYAML.^[75]^[76] These mechanisms allow developers to handle persistent data streams efficiently, bridging external storage with in-memory processing. Within web services and APIs, data files underpin data exchange through JSON payloads, where structured objects are serialized into HTTP request bodies for transmitting information between clients and servers, as seen in RESTful architectures. Logging files capture runtime events in web applications, such as user interactions or error traces, which are aggregated into structured formats like JSON for subsequent analytics to derive insights on performance and user behavior. This integration supports scalable service-oriented designs, where logs from services like Apache HTTP Server are parsed to monitor traffic patterns and optimize resource allocation. However, integrating data files into software systems presents challenges, particularly in scalability for big data scenarios, where handling terabyte-scale files requires distributed processing to avoid bottlenecks in storage and computation; organizations must expand systems to manage growing volumes from sources like IoT streams, as traditional monolithic approaches falter under exponential data growth. Security concerns also arise, necessitating rigorous sanitization of inputs from data files to prevent injection attacks, such as SQL injection via unsanitized uploads, where attackers exploit unvalidated content to execute malicious code; adherence to input validation protocols ensures that only expected data formats are processed, mitigating risks in file-based workflows.

Tools for Handling Data Files

Tools for handling data files encompass a range of software utilities and libraries designed to create, edit, view, and process these files efficiently. Basic editors and viewers provide straightforward interfaces for manual manipulation, while programming libraries enable automated operations in various languages. Advanced tools support complex transformations and validation, and best practices ensure reliable management over time. For text-based data files, such as those in CSV or plain text formats, popular editors include Notepad++, a free, open-source tool for Windows that supports syntax highlighting, macros, and plugin extensions to facilitate editing large files. Similarly, Vim, a highly configurable command-line editor available on Unix-like systems, excels in efficient text manipulation through modal editing and scripting capabilities, making it suitable for handling structured data in scripts or terminals.^[77] These tools allow users to inspect and modify file contents directly, often with features like search-and-replace to process repetitive data entries. Binary data files, which store information in non-human-readable formats, require specialized hex editors for low-level inspection and modification. HxD, a freeware hex and disk editor for Windows, enables users to view and edit binary data in hexadecimal, decimal, or ASCII representations, supporting files of any size without loading them fully into memory for performance.^[78] This is particularly useful for debugging or altering embedded data structures within executable or proprietary files. Programming libraries streamline file handling in application development. In Python, the built-in csv module provides functions like csv.reader and csv.writer to parse and generate comma-separated value files, handling delimiters, quoting, and dialects for robust data interchange.^[79] The json module complements this by offering json.load and json.dump methods to serialize and deserialize JavaScript Object Notation data, ensuring compatibility with web APIs and configuration files.^[80] For Java developers, the java.io package includes classes such as FileInputStream, FileOutputStream, and BufferedReader for stream-based reading and writing, supporting both character and byte streams to manage diverse file types.^[81] Additionally, Apache Commons IO extends these capabilities with utility methods in FileUtils for tasks like copying, moving, and reading file contents, simplifying common parsing operations across projects.^[82] Advanced tools address more sophisticated processing needs, such as data transformation and validation. Apache NiFi serves as an ETL (Extract, Transform, Load) platform, using a visual flow-based interface with processors to ingest, route, and modify data files from various sources, automating workflows for large-scale data pipelines.^[83] For structured formats like XML, validators based on XML Schema Definition (XSD) ensure compliance; Apache Xerces, an open-source XML parser, implements full W3C XML Schema support to check documents against schemas, reporting errors in validity or well-formedness. Best practices for managing data files emphasize integration with version control and automation to maintain integrity and reproducibility. Using Git, a distributed version control system, developers can track changes to small text-based data files by committing snapshots, branching for experiments, and merging updates, which is useful for collaborative environments handling evolving datasets; however, for large or binary data files, Git has limitations such as repository bloat and slow performance, so extensions like Git Large File Storage (LFS) or specialized tools like Data Version Control (DVC) are recommended.^[84]^[85] For automation, scripts should incorporate error handling, logging, and modular design—such as using try-catch blocks in Python or Java—to process files reliably, with practices like checksum verification to confirm data integrity post-transfer.^[86] These approaches prevent data loss and facilitate scalable handling in production systems.

References

[1]
What Is a Data File? - Computer Hope
Jul 9, 2025 · A data file is any file containing information, but not code; it is only meant to be read or viewed and not executed.Missing: science | Show results with:science
[2]
Types of Files
The operating system can take a program file, copy it into main memory, and start it running. ... The operating system can take a data file, and supply its ...
[3]
What is a file?
Aug 3, 2004 · A file includes: Data contents - stream of bytes - stored somewhere on the disk. Administrative data called an "inode" which uniquely ...Missing: definition | Show results with:definition
[4]
Chapter 6 Database Management 6.1 Hierarchy of Data - UMSL
File - a group of related records. Files are frequently classified by the application for which they are primarily used (employee file). A primary key in a file ...
[5]
File Formats | U.S. Geological Survey - USGS.gov
Jan 2, 2024 · Examples of file formats are comma-separated values (.csv), ascii text (.txt), Microsoft Excel (.xlsx), JPEG (.jpg), or Audio-Video Interleave format (.avi).
[6]
10.9. Files — Computer Systems Fundamentals
As noted above, a file is defined by an inode, as persistent data structure that contains the file's metadata and pointers to the file's contents. The ...
[7]
The Unix File System - UC Homepages
A file can be informally defined as a collection of (typically related) data, which can be logically viewed as a stream of bytes (i.e. characters). A file is ...
[8]
File formats for long-term access | Data management - MIT Libraries
File formats for long-term access ; Tabular data. Comma-Separated Values (.csv); Tab-Separated Values (.tsv); Delimited Text (.txt) ; Statistical data. Comma- ...Missing: computing | Show results with:computing
[9]
[PDF] BACKING-UP YOUR DATA FILES Any DATA file that you create, edit ...
BACKING-UP YOUR DATA FILES. Any DATA file that you create, edit and save onto your computer's hard drive becomes your responsibility. You must make backup ...<|control11|><|separator|>
[10]
Unix File System - GeeksforGeeks
Sep 18, 2025 · An ordinary file is a file on the system that contains data, text, or program instructions. Used to store your information, such as some ...
[11]
Section 14: The Unix File System - FSL
Ordinary files can contain text, data, or program information. An ordinary file cannot contain another file, or directory. An ordinary file can be thought of as ...
[12]
Definition of information - LAITS
Indeed, any computer data file constitutes data under our definition. Our primary concern will be data for analysis and decision making. Many computer files ...
[13]
File Header - Glossary | CSRC
Definitions: Data within a file that contains identifying information about the file and possibly metadata with information about the file contents.
[14]
danielskatz/software-vs-data - GitHub
Aug 16, 2016 · Software is executable, data is not. A commonsense definition of software is that it is "a set of instructions that direct a computer to do a ...
[15]
[PDF] Data Representation
A bit is a 0 or 1 used in the digital representation of data. • A digital file, usually referred to simply as a file, is a named collection of data that exits ...
[16]
Filesystem organization (CS 4410, Summer 2018)
In UFS, each file has a single block called an inode that contains: the metadata for the file, including. file length; permissions; access timestamps; bits ...Missing: attributes | Show results with:attributes
[17]
Processing Text Files in Python 3 - Alyssa Coghlan's Python Notes
“utf-8” is becoming the preferred encoding for many applications, as it is an ASCII-compatible encoding that can encode any valid Unicode code point. “latin-1” ...
[18]
Internals: file format - FSArchiver
The headers are quite high-level data structures that are reused a lot in the sources. They provide endianess management, checksumming, and extensibility.
[19]
A case study on alternate representations of data structures in XML
Nov 2, 2005 · We find that for several applications, human readable formats outperform binary equivalents, especially in the area of data size, and that the ...
[20]
[PDF] Guidelines for Software Portability - VU Research Portal
To improve portability of file manipulations: –use sequential files;. –use character files–avoid binary files;. –use a standard character code (ASCII, EBCDIC);.
[21]
Enhancing File System Integrity Through Checksums
This report discusses the various design choices in file system checksumming and describes an implementation using an in-kernel database in a stackable ...
[22]
[PDF] Techniques to audit and certify the long-term integrity of digital ...
Integrity checking can be per- formed by computing the hash of each copy locally, and send- ing all the hashes to an auditor.
[23]
[PDF] Guidance on Versioning of Digital Assets
Dec 12, 2022 · This document is intended to provide high-level guidance for versioning of data and metadata within the Helmholtz Metadata Collaboration (HMC) ...
[24]
[PDF] Plain Text & Character Encoding: A Primer for Data Curators
Aug 11, 2021 · Plain text is a sequence of encoded characters, a rudimentary representation of text as a linear sequence of symbols, with no formatting.
[25]
Flat-files
While a person can easily read the information, its structure and use may not be readily apparent. To overcome this limitation, text-based flat-files often ...
[26]
Character encoding: Types, UTF-8, Unicode, and more explained
Apr 7, 2025 · In this article, we'll explore various types of character encoding used in the world of information technology.
[27]
5.2 Plain text formats
The main advantage of plain text formats is their simplicity: we do not require complex software to create or view a text file and we do not need esoteric ...
[28]
Data Exchange Mechanisms & Considerations
Feb 7, 2020 · Data Formats. Text-Based Formats. The primary advantage to text-based codecs is human readability. XML. XML is A flexible text format for data.
[29]
7.5 Plain text files
The main disadvantage of plain text formats is their simplicity. Lack of standards: No standard way to specify data format. No standard way to express “special ...
[30]
Display problems caused by the UTF-8 BOM - W3C
Jul 17, 2007 · A UTF-8 BOM, a signature at the start of a file, can cause extra lines or unwanted characters, or blank lines within the page.
[31]
https://www.ni.com/docs/en-US/bundle/labview/page/configuration-files.html
[32]
A Comprehensive Log Files Guide - Elastic
Log files are data files (typically text-based) generated by devices, networks, applications, and operating systems containing recorded information about their ...Log file definition · Types of log files · Finding log files · Working with log filesMissing: structure | Show results with:structure
[33]
Schema.ini File (Text File Driver) - Microsoft Learn
Jun 25, 2024 · Schema.ini files provide schema information about the records in a text file. Each Schema.ini entry specifies one of five characteristics of the table.
[34]
A Tutorial on Data Representation - Integers, Floating-point numbers ...
For example, the 32-bit integer 12345678H (30541989610) is stored as 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in little endian. An 16-bit integer 00H ...
[35]
What Is a Binary File? - phoenixNAP
Feb 28, 2024 · Binary files store data in a compact, computer-readable format that can encode a wide variety of data types, enabling operational efficiency.
[36]
Understanding Big and Little Endian Byte Order - BetterExplained
Each byte-order system has its advantages. Little-endian machines let you read the lowest-byte first, without reading the others.<|separator|>
[37]
A little diddy about binary file formats - BetterExplained
Binary files are efficient is because they can use all 8 bits in a byte, while most text is constrained to certain fixed patterns, leaving unused space.
[38]
7.7 Binary files
One of the advantages of binary files is that they are more efficient. In terms of memory, storing values using numeric formats such as IEEE 754, rather than as ...
[39]
What is a binary file and how does it work? - TechTarget
Jun 23, 2022 · Binary files are not human readable and require a special program or hardware processor that knows how to read the data inside the file. Only ...
[40]
Structure Member Alignment, Padding and Data Packing
Jul 29, 2025 · In this article, we will discuss the property of structure padding in C along with data alignment and structure packing. Data Alignment in ...
[41]
Data Alignment Across Architectures: The Good, The Bad And The ...
May 10, 2022 · Compilers add such padding where necessary by default, which ensures that each data member of a struct is aligned in memory. Obviously this ' ...
[42]
Database File Format - SQLite
Every valid SQLite database file begins with the following 16 bytes (in hex): 53 51 4c 69 74 65 20 66 6f 72 6d 61 74 20 33 00. This byte sequence corresponds to ...The Database Header · File format version numbers · B-tree Pages · Record Format
[43]
6 - Object Serialization Stream Protocol
If the modified UTF-8 encoding of the given String is less than 65536 bytes in length, the length is written as 2 bytes representing an unsigned 16-bit integer.
[44]
pickle — Python object serialization — Python 3.14.0 documentation
Keyed-Hashing for... · Marshal · Pickletools
[45]
An Introduction to NetCDF - NSF Unidata Software Documentation
By default, netCDF uses the classic format (CDF-1). To use the CDF-2, CDF-5, or netCDF-4/HDF5 format, set the appropriate constant in the file mode argument ...The Netcdf Interface · The Netcdf File Format · Netcdf Library Architecture
[46]
Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
Nov 26, 2008 · XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, ...Namespaces in XML · Abstract · Review Version · First Edition
[47]
RFC 8259 - The JavaScript Object Notation (JSON) Data ...
JavaScript Object Notation (JSON) is a lightweight, text-based, language-independent data interchange format. It was derived from the ECMAScript Programming ...
[48]
Overview | Protocol Buffers Documentation
Protocol Buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data. It's like JSON, except it's smaller and faster.Protobuf Editions Overview · Tutorials · Java API
[49]
W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures
Apr 5, 2012 · This document specifies the XML Schema Definition Language, which offers facilities for describing the structure and constraining the contents of XML documents.
[50]
JSON Schema
JSON Schema is the vocabulary that enables JSON data consistency, validity, and interoperability at scale.Specification · Tools · JSON Schema Validation · Get Started
[51]
xml.dom — The Document Object Model API — Python 3.14.0 ...
The Document Object Model, or “DOM,” is a cross-language API from the World Wide Web Consortium (W3C) for accessing and modifying XML documents.
[52]
Document Object Model (DOM) Requirements - W3C
Nov 3, 2020 · The Document Object Model provides a standard set of objects for representing HTML and XML ... DOM representation as a XML document. Some ...Chapter 2: DOM Level 1... · Chapter 3: DOM Level 2... · Chapter 4: DOM Level 3...
[53]
Structured vs. Unstructured Data: What's the Difference? - IBM
Structured data has a fixed schema, like names in rows/columns. Unstructured data has no fixed schema, like audio files or web pages.
[54]
Glossary: Unstructured Data | resources.data.gov
A dataset may also present information in a variety of non-tabular formats, such as an extensible mark-up language (XML) file, a geospatial data file, or an ...
[55]
Structured vs. Unstructured Data Types - Oracle
Apr 1, 2022 · Structured data has a predefined format with fixed fields, while unstructured data lacks definition and can come in various forms like text, ...
[56]
What Is Unstructured Data? - IBM
Unstructured data is information that does not have a predefined format. Unstructured datasets are massive (often terabytes or petabytes of data) and ...
[57]
Magnetic tape - IBM
Beginning in the early 1950s, magnetic tape greatly increased the speed of data processing and eliminated the need for massive stacks of punched cards as a data ...
[58]
Memory & Storage | Timeline of Computer History
UNIVAC introduces the "UNISERVO" tape drive for the UNIVAC I computer. It was the first tape storage device for a commercial computer, and the relative low cost ...
[59]
1951: Tape unit developed for data storage
In 1951, Univac introduced the Uniservo 1, a tape drive using 0.5 inch tape, and in 1952, IBM announced its first magnetic tape storage unit.
[60]
1956: First commercial hard disk drive shipped | The Storage Engine
IBM developed and shipped the first commercial Hard Disk Drive (HDD), the Model 350 disk storage unit, to Zellerbach Paper, San Francisco in June 1956.
[61]
RAMAC - IBM
or simply RAMAC — was the first computer to use a random-access disk drive. ... An IBM 305 RAMAC console and cabinets in 1956. Reynold B ...
[62]
The IBM System/360
The IBM System/360, introduced in 1964, ushered in a new era of compatibility in which computers were no longer thought of as collections of individual ...Missing: file | Show results with:file
[63]
http://bitsavers.org/pdf/ibm/360/os/R01-08/C28-6535-0_OS360_Concepts_and_Facilities_1965.pdf
[64]
Timeline of Computer History
On April 7, IBM announced five models of System/360, spanning a 50-to-1 performance range. At the same press conference, IBM also announced 40 completely new ...
[65]
[PDF] The History of Fortran I, II, and III by John Backus
This article discusses attitudes about "automatic programming," the eco- nomics of programming, and existing programming systems, all in the early. 1950s. It ...
[66]
CSV (Comma-Separated Values) Files - CelerData
Sep 9, 2024 · The origins of CSV files trace back to the early 1970s. Developers created CSV files to simplify data storage and sharing. The CSV format's ...
[67]
Extensible Markup Language (XML) 1.0 - W3C
Feb 10, 1998 · XML has been designed for ease of implementation and for interoperability with both SGML and HTML. Status of this document. This document has ...
[68]
Apache Hadoop
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job ...Download · Setting up a Single Node Cluster · Hadoop 2.7.2 · Hadoop CVE List
[69]
https://www.loc.gov/preservation/digital/formats/fdd/fdd000599.shtml
[70]
Apache Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression ...Missing: 2013 | Show results with:2013
[71]
Parquet, ORC, and Avro: The File Format Fundamentals of Big Data
Oct 26, 2022 · This article compares the most common big data file formats currently available – Avro versus ORC versus Parquet – and walks through the benefits of each.Missing: distributed 2010s 2020s
[72]
Privacy + Data Security Predictions for 2025 - Morrison Foerster
Jan 7, 2025 · Increased use of advanced encryption and anonymization techniques; A continued rise in ransomware attacks; Growing exploitation of supply chain ...
[73]
Vim documentation : vim online
Vim's online documentation system, accessible via the :help command, is an extensive cross-referenced and hyperlinked reference.
[74]
HxD - Freeware Hex Editor and Disk Editor - mh-nexus
HxD is a fast hex editor for raw disk editing, RAM modification, and files of any size. It can edit drives and memory like regular files.Downloads · HxD License · Translators of HxD · Change log for HxD
[75]
csv — CSV File Reading and Writing — Python 3.14.0 documentation
Configuration... · File Formats · PEP 305
[76]
json — JSON encoder and decoder — Python 3.14.0 documentation
The Python `json` module is a JSON encoder and decoder, using `json.dump()` to serialize and `json.JSONDecoder()` to decode Python objects to JSON.Missing: csv | Show results with:csv
[77]
Package java.io - Oracle Help Center
Package java.io provides for system input and output through data streams, serialization and the file system.
[78]
Apache Commons IO
Apache Commons IO is a library of utilities to assist with developing IO functionality. The Commons IO packages include: io - This package defines utility ...Download · Org.apache.commons.io.output · Package org.apache... · User guide
[79]
Apache NiFi Documentation
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data.NiFi Version 1 Documentation · Components · Guides · Project VideosMissing: ETL | Show results with:ETL
[80]
File Transfer Automation - Best Practices - Progress Software
Best practices include good planning, flexible script design, error handling, and logging for troubleshooting. Consider software if you are not a PowerShell ...