Fact-checked by Grok 2 weeks ago

Data

Data is the representation of facts, concepts, or instructions in a manner suitable for communication, interpretation, or processing by humans or by automatic means. In the context of computing and information technology, data serves as the foundational element for storage, analysis, and decision-making, often existing in raw form before being transformed into meaningful information through processing. Data can be broadly categorized by and . Structured data is highly organized and follows a predefined format, such as rows and columns in relational databases, making it easily searchable and analyzable using standard tools like SQL. , in contrast, lacks a fixed and includes diverse formats like emails, images, videos, and posts, comprising the majority of data generated today and requiring advanced techniques such as for extraction. Additionally, data is distinguished as quantitative or qualitative: quantitative data consists of numerical values that can be measured and statistically analyzed, such as sales figures or temperatures, while qualitative data involves descriptive, non-numerical observations, like customer feedback or survey responses. The significance of data has grown exponentially in modern society, driving advancements across fields like , , and through data-driven insights. For instance, organizations leverage to enhance , optimize operations, and predict trends, resulting in improved and . In scientific , enables complex modeling and discovery, reshaping disciplines from to studies. However, this proliferation also raises challenges, including concerns, management, and ethical use, necessitating robust standards and regulations.

Fundamentals

Etymology and Terminology

The word "data" originates from the Latin datum, the neuter past participle of dare meaning "to give," thus translating to "something given" or "a thing granted." As the plural form data, it entered English in the mid-17th century, with the recording its earliest evidence in 1645 in the writings of Scottish author and translator , where it referred to facts or propositions given as a basis for reasoning or in scientific and mathematical contexts. Initially borrowed directly from Latin scientific texts, the term appeared in English via scholarly works emphasizing empirical observations and computations. A historical milestone in the application of data occurred in 1662 with John Graunt's Natural and Political Observations Made upon the , which analyzed parish records to derive demographic patterns, representing one of the earliest systematic uses of aggregated numerical data in what is now recognized as , even though Graunt himself did not employ the specific term "data." The concept gained further traction in scientific discourse throughout the 17th and 18th centuries. By the , "data" was widely adopted in computing, notably by in naming its systems, such as the 1953 Electronic Data Processing Machine, which processed large volumes of numerical for business and scientific purposes, solidifying the term's role in technological contexts. In the , particularly with the expansion of , the usage of "data" evolved from its traditional form—taking verbs like "are"—to a treated as singular, as in "data is," reflecting its conceptualization as an undifferentiated collection rather than discrete items; Ngram analysis shows the singular form rising from a minority in the early to parity with the by the late . Key terminological distinctions include , defined as unprocessed facts, figures, or symbols without inherent meaning or context, versus , which arises when is organized, processed, and interpreted to convey significance, as outlined in standards like the U.S. Department of Defense's . Modern style guides address the singular/ debate: the (APA) recommends treatment ("data are") in formal and scientific writing for precision, while permits either, favoring singular for general audiences but in technical contexts to honor the word's Latin roots.

Definitions and Meanings

Data is defined as the representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or automated systems. This encompasses numerical values, textual descriptions, symbolic notations, or other discrete units that capture observations or measurements without inherent or significance on their own. For instance, raw readings from a recording temperatures at specific intervals exemplify data as unprocessed inputs awaiting . A key distinction lies between data and related concepts like , where data serves as the raw, unstructured foundation, while emerges from its , contextualization, and to convey meaning. This relationship is formalized in the DIKW hierarchy, which progresses from data (basic symbols or signals) to (processed and related facts), (applied understanding through patterns and rules), and (evaluative judgment for ). The hierarchy, introduced by in 1989, underscores that data alone lacks meaning until transformed, as seen in examples like isolated numbers from a database becoming meaningful sales trends when aggregated and analyzed. In philosophical contexts, data refers to empirical observations or sense-data that form the basis of perceptual and epistemic justification, distinct from interpretive thought. These are immediate sensory impressions, such as visual or auditory inputs, that philosophers like analyzed as mind-independent entities grounding knowledge claims. In legal settings, data functions as evidentiary facts—recorded information or predicate details that support inferences in judicial proceedings, such as digital logs or witness statements admissible under rules like Federal Rule of Evidence 703. Everyday usage treats data as personal records, including metrics, financial transactions, or histories, which individuals manage for practical purposes like budgeting or fitness tracking. Since the early 2000s, the meaning of data has evolved to incorporate digital traces of user behavior, driven by the rise of platforms and analytics, where unstructured logs from social interactions and online activities are treated as valuable raw inputs for predictive modeling. This shift, exemplified by the growth of on sites like early , expanded data's scope beyond traditional records to encompass behavioral patterns analyzed for and .

Types of Data

Data can be broadly classified into qualitative and quantitative types based on its nature and measurability. Qualitative data, also known as categorical data, consists of non-numerical that describes qualities, characteristics, or attributes, such as text, images, audio, or observations that capture themes, patterns, or meanings without assigning numerical values. In contrast, quantitative data is numerical and measurable, allowing for mathematical operations like counting, averaging, or statistical ; it includes values such as heights, temperatures, or sales figures that represent quantities or amounts. This distinction is fundamental in and , where qualitative data provides depth and , while quantitative data enables precision and generalizability. Another key categorization distinguishes structured from unstructured data based on organization and format. Structured data is highly organized and stored in a predefined format, such as rows and columns in relational or spreadsheets, making it easily searchable, analyzable, and integrable with tools like SQL; examples include records in a system or readings in fixed schemas. , comprising about 80-90% of all data generated today, lacks a predetermined structure and includes free-form content like emails, posts, videos, or documents that require advanced processing techniques for extraction and interpretation. This divide impacts storage, processing efficiency, and application, with structured data suiting traditional and unstructured data fueling modern AI-driven insights. Additional classifications refine these categories further. data consists of distinct, countable values with no intermediate points, such as the number of items sold (integers) or categories like gender, which can only take specific, separated states. data, however, forms a of possible values within a range, measurable to any degree of precision, as in weight, time, or , often represented by real numbers. Separately, primary data is original collected firsthand by the researcher for a specific , through methods like surveys or experiments, ensuring direct but requiring more resources. , derived from existing sources compiled by others, such as published reports or databases, offers broader scope and cost savings but may introduce biases or outdated elements. Emerging types of data reflect evolving technological and analytical needs. is characterized by the "three Vs"—high volume (massive scale of data generation), (rapid speed of data creation and processing), and variety (diverse formats from structured to unstructured sources)—demanding innovative handling beyond traditional systems, as defined by in 2011. , or "data about data," provides descriptive context for other data, including details like creation date, author, format, or location, standardized by ISO/IEC 11179 to facilitate and across systems. Spatiotemporal data integrates spatial (location-based, e.g., coordinates) and temporal (time-based) dimensions, capturing changes over geographic areas and periods, essential in fields like GIS for modeling phenomena such as climate patterns or urban growth. These types underscore the increasing complexity and interconnectedness of data in contemporary applications.

Acquisition

Data Sources

Data sources refer to the origins and generators from which is derived, encompassing both natural phenomena and artificial systems that produce for , , or recording. These sources provide the foundational inputs for across scientific, social, and technological domains, generating diverse forms of information that reflect environmental conditions, biological processes, human activities, and machine operations. Natural sources yield data through inherent environmental and biological processes, often captured via observational tools. Environmental sources include weather sensors that monitor atmospheric variables such as , , and , providing essential for climate analysis; for instance, automated weather stations on the record meteorological parameters to track glacial changes. Geological samples from field trials and laboratory experiments offer data on subsurface structures and resource compositions, supporting assessments of natural systems like aquifers and mineral deposits. Biological sources encompass DNA sequences extracted from organisms, stored in databases as sets of genetic records to enable genomic research and studies. Medical scans, such as magnetic resonance imagery, generate imaging data that reveal internal biological structures, contributing to diagnostics and heritability analyses when fused with genomic information. Human-generated sources arise from intentional activities and interactions, producing structured and reflective of societal behaviors. Surveys conducted by government agencies collect demographic and economic indicators, forming datasets that track population trends and policy impacts. posts on platforms like and capture , including text, images, and interactions, which serve as records of public sentiment and network dynamics. Transaction logs from and financial systems record user activities such as purchases and account accesses, providing chronological data for behavioral analysis. Archival records, including digitized and web archives, preserve human outputs like emails and news comments for longitudinal studies of . Technological sources leverage engineered systems to automate data production, often at scale and in real time. devices, such as smart sensors in urban environments, generate streams of operational data from physical assets like vehicles and appliances, enabling and efficiency optimizations. Satellites equipped with instruments collect data, including multispectral imagery for monitoring and environmental change detection. Automated systems like tools extract structured data from online sources, such as product listings or news feeds, facilitating large-scale aggregation for . The historical development of data sources illustrates a progression from to methods, driven by technological advancements. Prior to the 1900s, data primarily originated from ledgers and paper-based records, such as handwritten enumerations in early es that relied on in-person visits and sheets. The U.S. , initiated in 1790, exemplifies this era, with marshals manually collecting household data on and using pens and bound volumes. By the post-1980s period, the shift to sensors and systems transformed sources, enabling automated capture through computers and ; for example, the U.S. transitioned from counting (1790–1880) to mechanical tabulators in 1890 and fully processing by the late . This evolution expanded source diversity, from analog instruments to interconnected , vastly increasing data volume and .

Data Collection Methods

Data collection methods encompass a range of techniques designed to acquire from various sources in a systematic and reliable manner. These methods are essential for ensuring the quality, accuracy, and of data for subsequent analysis, with choices depending on the objectives, available resources, and the of the data being sought. Observational, experimental, sampling, and approaches each offer distinct advantages in capturing real-world phenomena or controlled outcomes, while ethical protocols safeguard participant rights throughout the process. Observational methods involve passive and recording of phenomena without , allowing for the collection of natural occurring data. Direct techniques, such as using calibrated thermometers to record , provide precise in-situ readings by placing instruments in direct contact with the subject or . For instance, in hydrological studies, thermistor thermometers are submerged in stream water to measure accurately, following standardized protocols to minimize errors. extends this approach by acquiring data from a distance, often via satellites or aircraft equipped with sensors like or devices, which detect reflected or emitted to map large-scale environmental features without physical access. This method is particularly valuable for inaccessible areas, such as planetary surfaces or vast ecosystems, where direct is impractical. Experimental methods generate data through controlled interventions to test hypotheses and establish causal relationships. In scientific fields like physics, laboratory trials manipulate variables under tightly controlled conditions—such as varying magnetic fields in experiments on particle behavior—to isolate effects and measure outcomes precisely. These setups ensure reproducibility and reduce confounding factors, enabling researchers to draw reliable inferences from the observed results. In software development, A/B testing serves as a digital analog, where two variants of an application or webpage are randomly assigned to user groups to compare performance metrics like engagement or conversion rates. This randomized controlled approach, akin to clinical trials, helps optimize user experiences by identifying superior designs based on empirical evidence. Sampling techniques are employed to select subsets of a efficiently, reducing costs and time while maintaining representativeness to minimize . Random sampling assigns equal probability to each unit, ensuring unbiased estimates but potentially requiring large sample sizes for . divides the into homogeneous subgroups (strata) based on key characteristics, then samples proportionally from each to improve precision, particularly when subpopulations vary significantly. groups the into clusters and randomly selects entire clusters for , which is cost-effective for geographically dispersed populations but may introduce higher variance if clusters are heterogeneous. To optimize in stratified designs, Neyman allocation proportionally distributes sample sizes across strata based on their variability and size, minimizing overall as derived in Jerzy Neyman's foundational 1934 work on stratified sampling theory. Digital methods leverage computational tools to gather vast quantities of data from online environments at scale. Application Programming Interfaces () enable programmatic access to structured data from platforms, such as querying databases for weather records or metrics, by sending standardized requests and receiving responses in formats like . Web crawling, or scraping, involves automated scripts that navigate websites, extract unstructured content like text or images, and compile it into usable datasets, often using libraries like BeautifulSoup for parsing . platforms distribute microtasks to distributed workers; for example, , launched in 2005, allows researchers to collect labeled data through human intelligence tasks, such as image annotation, facilitating large-scale annotation efforts. Ethical considerations are integral to all data collection methods, particularly when involving human subjects, to protect and . requires providing participants with clear information about the study's purpose, procedures, risks, benefits, and data usage, ensuring voluntary participation without . This process, often documented via forms or verbal agreements, allows individuals to withdraw at any time and fosters trust, as emphasized in federal regulations for human subjects research.

Storage and Management

Data Formats and Documents

Data formats define the structure and representation of data in files or records, enabling efficient storage, interchange, and processing across systems. These formats vary based on the data's nature, such as tabular for structured records, hierarchical for nested relationships, and binary for compact, machine-readable encoding. Tabular formats like organize data into rows and columns separated by delimiters, with CSV formalized in RFC 4180 as a standard for text-based tabular interchange. Spreadsheets, such as introduced in 1985 for the Macintosh, extend this by providing interactive tabular documents with formulas and formatting. Hierarchical formats represent data as trees or nested structures, suitable for complex, interrelated information. Extensible Markup Language (XML), a W3C Recommendation since 1998, uses tags to define hierarchical elements for document and data exchange. JavaScript Object Notation (JSON), derived from and standardized in RFC 8259, offers a lightweight alternative with key-value pairs and arrays for web APIs and configuration files. Binary formats encode data directly in machine-readable bytes to minimize size and parsing overhead; for instance, compresses images lossily under ISO/IEC 10918, finalized in 1992. Relational databases, queried via Structured Query Language (SQL), store tabular data in binary files or blocks for efficient indexing and transactions, as seen in systems like , developed in 1995. Data documents encompass tools and systems for managing formatted records. Spreadsheets like Excel support user-editable tabular data with built-in computation, evolving from early versions to handle millions of cells. databases, such as released in 2009, use document-oriented binary storage for flexible, schema-less hierarchical data like JSON-like . These documents facilitate usability by combining format with metadata, such as headers in or indexes in databases. Standardization efforts ensure interoperability in data representation. The (RDF), a W3C Recommendation from 2004, provides a schema for data as triples in graph structures, enabling across domains. (EDI), with ANSI X12 standards established in 1979, defines protocols for structured business document exchange, reducing errors in supply chain transactions. The evolution of data formats reflects technological advances in storage and scale. Punch cards, pioneered by in the 1890s for the U.S. Census, encoded data as perforations for mechanical tabulation, marking an early shift from paper to automated processing. Modern cloud-native formats like , introduced in 2013 by and , employ columnar binary storage optimized for analytics, compressing and partitioning datasets for distributed systems like Hadoop. This progression from rigid, physical media to efficient, scalable digital structures has enabled handling vast, diverse data volumes.

Data Preservation and Longevity

Data preservation involves a range of strategies designed to ensure that digital information remains intact, accessible, and usable over long periods, countering the inherent fragility of . Key techniques include regular s, which can be full—capturing an entire —or incremental, recording only changes since the last to optimize storage and time efficiency. Another critical method is , where information is transferred to newer formats or storage media to prevent obsolescence, such as converting legacy files from outdated systems to contemporary standards like for long-term archiving. further supports preservation by simulating obsolete hardware and software environments, allowing access to data on formats like early floppy disks without original equipment. Despite these approaches, several challenges threaten data longevity. Bit rot, or silent data corruption, occurs when errors accumulate in storage media over time due to hardware degradation or transmission faults, potentially rendering files unreadable without detection. Format obsolescence exacerbates this issue; for instance, floppy disks from the 1980s and 1990s became largely unreadable by the 2020s as compatible drives vanished from common use. Environmental factors also pose risks, including the high energy demands of data centers for cooling to prevent overheating, which can lead to hardware failures if power or climate controls falter. To address these challenges systematically, international standards and initiatives have emerged. The Open Archival Information System (OAIS) reference model, formalized in ISO 14721 in 2003, provides a framework for creating and maintaining digital archives, emphasizing ingestion, , and dissemination processes to ensure long-term viability. Organizations like the , founded in 1996, exemplify practical implementation through vast digital repositories that preserve and other media via web crawling and redundant . Archival laws further institutionalize these efforts; in the United States, the Act of 1934 established federal requirements for preserving government records, later extended to digital formats. Metrics for assessing data longevity highlight the urgency of proactive preservation. Estimates for the of digital scientific data without intervention vary by field and storage medium, often falling within years to decades due to format shifts and hardware evolution. Such figures underscore the need for ongoing and to extend usability beyond this threshold.

Data Accessibility and Retrieval

Data accessibility and retrieval encompass the technologies and protocols that enable efficient location, access, and sharing of data across systems. Retrieval systems rely on indexes to optimize query performance by creating structured pointers to data, allowing to avoid full table scans during searches. For instance, in relational , SQL queries use indexes to retrieve specific records rapidly, forming the backbone of structured data access. Search engines like , first released in February 2010, extend this capability to unstructured and large-scale data through distributed indexing and . Additionally, facilitate data sharing by providing standardized interfaces for programmatic access, enabling seamless integration between disparate systems without direct database exposure. Key principles guide the design of accessible data systems, emphasizing openness and usability. The open data movement promotes public release of government and institutional data under permissive licenses, as outlined in the International Open Data Charter adopted in 2015 by over 170 governments and organizations. Complementing this, the FAIR principles—Findable, Accessible, Interoperable, and Reusable—provide a framework for scientific data stewardship, introduced in a 2016 paper to ensure data can be discovered and utilized by both humans and machines. These principles advocate for persistent identifiers, standards, and open protocols to enhance discoverability and reuse. Despite these advancements, significant barriers hinder data . Paywalls restrict access to subscription-based datasets, limiting availability to paying users or institutions. Proprietary formats, such as certain vendor-specific file structures, impede by requiring specialized software for decoding. Digital divides exacerbate these issues; as of 2023, approximately 33% of the global population—over 2.6 billion people—lacks , primarily in low-income regions. To address these challenges, specialized tools support data discovery and management. Data catalogs like , launched in 2018, index over 25 million datasets from repositories worldwide, allowing users to search and filter based on . For tracking changes and maintaining lineages, version control systems such as , often extended with tools like Git LFS for large files, enable collaborative data versioning and audit trails. These mechanisms ensure that data retrieval remains reliable and traceable, fostering broader usability while respecting preservation needs.

Processing and Analysis

Data Processing Techniques

Data processing techniques encompass a range of methods used to clean, transform, and prepare for subsequent or , ensuring accuracy, , and usability. These techniques address common issues in datasets, such as incompleteness, inconsistencies, and varying scales, which can otherwise lead to erroneous outcomes in downstream applications. focuses on identifying and rectifying errors, while standardizes data formats and structures. through scripting and pipelines further enhances efficiency, particularly in large-scale environments. Cleaning is a foundational step that involves handling missing values and detecting outliers to maintain data integrity. Missing values can be addressed through imputation methods, such as mean substitution, where absent entries are replaced with the average of observed values in the same feature; this approach is simple and preserves the dataset size but may introduce bias if the data is not randomly missing. For outlier detection, the Z-score method calculates the standardized distance of a data point from the mean, defined as z = \frac{x - \mu}{\sigma}, where \mu is the mean and \sigma is the standard deviation; values with |z| > 3 are typically flagged as potential outliers, as they deviate significantly from the normal distribution under the assumption of approximate normality. These techniques are essential for mitigating the impact of anomalies. Transformation techniques prepare data by rescaling, encoding, and aggregating features to make them compatible with analytical models. via min-max scaling rescales features to a fixed range, usually [0, 1], using the formula x' = \frac{x - \min(X)}{\max(X) - \min(X)}, which preserves the relative relationships while bounding values to prevent dominance by large-scale features in algorithms like distance-based clustering. For categorical variables, encoding converts them into binary vectors, creating a new column for each category with 1 indicating presence and 0 otherwise; this avoids ordinal assumptions and enables numerical processing, though it increases dimensionality for high-cardinality features. Aggregation summarizes data by grouping, such as summing daily sales figures to monthly totals, which reduces granularity and computational load while highlighting trends like seasonal patterns. ETL (Extract-Transform-Load) processes form structured pipelines for integrating data from disparate sources into a unified repository. In ETL, data is first extracted from operational databases or files, then transformed to resolve inconsistencies—such as standardizing formats or applying business rules—and finally loaded into a target system like a ; this paradigm originated in the 1970s for mainframe and remains central to . Tools like , released in 2015 by as an open-source orchestrator, automate these pipelines by defining dependencies as directed acyclic graphs (DAGs), enabling scheduling and monitoring of complex ETL jobs. Automation in leverages scripting and processing paradigms to handle and . The library , developed by starting in 2008, provides data structures like DataFrames for efficient cleaning and transformation operations, such as filling missing values or applying encoding via built-in functions, making it a standard for interactive data manipulation. Processing can occur in batch mode, where fixed datasets are handled offline, or stream mode for real-time ingestion; , introduced in 2011 by as a distributed messaging system, supports by enabling low-latency pub-sub pipelines that handle millions of events per second, contrasting with batch systems like Hadoop by processing data incrementally as it arrives.

Data Analysis and Interpretation

Data analysis and interpretation involve applying statistical and computational methods to processed datasets to uncover patterns, test hypotheses, and derive actionable insights. This process builds on cleaned and structured data, transforming raw information into meaningful knowledge that informs across various domains. Key approaches include descriptive, inferential, and predictive analyses, each serving distinct purposes in summarizing, generalizing, and from data. Descriptive analysis focuses on summarizing the main characteristics of a without making broader inferences about a . Central tendency measures such as the , which calculates the arithmetic average of values, and the , which identifies the middle value in an ordered , provide essential overviews of data . These summaries help identify trends and outliers; for instance, the is sensitive to extreme values, while the offers robustness in skewed . Visualizations enhance this process: histograms display the frequency distribution of continuous variables by dividing data into bins, revealing shape, , and variability. Scatter plots, meanwhile, illustrate relationships between two continuous variables, plotting points to highlight potential correlations or clusters. Inferential analysis extends descriptive insights to make probabilistic statements about a larger based on sample data. Hypothesis testing evaluates claims about population parameters; for example, the Student's t-test, developed by in 1908 under the "Student," assesses whether observed differences between sample means are statistically significant, accounting for small sample sizes through the t-distribution. This method assumes and equal variances, yielding a that indicates the probability of the result occurring by chance. Confidence intervals complement hypothesis testing by providing a range of plausible values for a population parameter, such as the mean, with a specified level of confidence (e.g., 95%), derived from sample statistics and . These tools enable generalization while quantifying uncertainty, though they require careful consideration of assumptions to avoid misleading conclusions. Predictive analysis employs models to forecast future outcomes or classify new data points. , pioneered independently by in 1805 and around 1795, models the linear relationship between a dependent variable y and one or more independent variables x using the equation y = mx + b, where m represents the slope and b the intercept, minimizing the sum of squared residuals via estimation. This approach assumes , , and homoscedasticity, making it foundational for predicting continuous outcomes like sales or temperatures. In , decision trees extend predictive capabilities by recursively partitioning data based on feature thresholds to minimize or variance; the Classification and Regression Trees (CART) algorithm, introduced by Leo Breiman and colleagues in 1984, uses Gini for and for , creating interpretable tree structures that handle nonlinear relationships without assuming data distribution. These methods facilitate forecasting but demand validation to ensure generalizability. Interpreting analytical results presents significant challenges, particularly in distinguishing correlation from causation and avoiding practices like p-hacking. Correlation measures the strength and direction of linear associations between variables, but it does not imply causation, as confounding factors or reverse causality may explain observed patterns; for instance, ice cream sales correlate with drownings due to seasonal weather, not direct influence. P-hacking involves selectively analyzing data—such as choosing subsets, transformations, or multiple tests—until a statistically significant p-value (typically <0.05) emerges, inflating false positives and undermining reliability. The replication crisis, highlighted by the Open Science Collaboration's 2015 study replicating 100 psychological experiments, revealed that only 36% produced significant results compared to 97% in originals, attributing low reproducibility to p-hacking, publication bias, and underpowered studies, prompting calls for preregistration and transparency in the 2010s.

Applications and Implications

Data in Computing and Information Science

In and , data is fundamentally organized using data structures to enable efficient , retrieval, and manipulation within algorithms and programs. Basic data structures include arrays, which provide contiguous memory allocation for fast indexed access with O(1) average for retrieval but O(n) for in unsorted cases; linked lists, which allow dynamic insertion and deletion in O(1) time per operation at known positions through pointer-based connections; , such as search trees that support balanced O(log n) operations for search, insert, and delete; and graphs, which model relationships via nodes and edges, with traversal algorithms like achieving O(V + E) efficiency where V is vertices and E is edges. These structures are essential for optimizing computational performance, as analyzed through , which upper-bounds the growth rate of resource usage relative to input size. Information theory formalizes data's quantitative aspects, particularly through Claude Shannon's concept of , which measures the uncertainty or average in a . Introduced in 1948, Shannon is defined as H = -\sum_{i=1}^{n} p_i \log_2 p_i where p_i is the probability of each possible symbol in the source, providing a foundation for data compression techniques like that minimize redundancy by assigning shorter codes to frequent symbols, and for quantifying in noisy communication systems. This metric underpins modern data encoding and error-correcting codes, ensuring reliable transmission while maximizing efficiency. Database management systems (DBMS) handle persistent and concurrent access, enforcing reliability through properties: Atomicity ensures complete fully or not at all; Consistency maintains rules; prevents interference between concurrent operations; and guarantees committed changes survive failures. These principles, building on Jim Gray's 1981 work on concepts and formalized in full by Härder and Reuter in 1983, enable robust operations in relational databases like SQL Server. For large-scale data, frameworks like , initiated by in 2006 as an open-source implementation inspired by Google's and GFS, distribute processing across clusters using HDFS for fault-tolerant storage of petabyte-scale datasets. Modern trends in data handling emphasize scalability and decentralization, with data lakes emerging in the as repositories for raw, in its native format, allowing schema-on-read processing without upfront transformation. Coined by James Dixon in 2010, data lakes store diverse types like images and logs using scalable , often integrated with Hadoop for analytics on voluminous, schema-flexible data. Complementing this, has gained prominence post-2020 by shifting data processing to devices near the source, reducing latency and bandwidth demands in ecosystems; the market for edge solutions grew from $44.7 billion in 2022 to a projected $101.3 billion by 2027, driven by real-time applications in telecom and healthcare.

Data in Statistics and Scientific Research

In statistics, data serves as the foundation for inferential processes, distinguishing between a population, which encompasses the entire set of entities of interest, and a sample, a subset from which data is collected to estimate characteristics. This distinction enables researchers to draw generalizations while accounting for variability, as samples are often used due to practical constraints in accessing full s. Probability distributions model the likely patterns in data; for instance, the normal distribution, characterized by its symmetric bell-shaped curve, describes continuous variables like measurement errors or biological traits in many natural phenomena, while the applies to discrete outcomes in fixed trials with two possibilities, such as success or failure in coin flips or binary clinical responses. Within the scientific method, data functions as empirical evidence to test hypotheses, aligning with Karl Popper's principle of , where theories are corroborated or refuted based on observational outcomes rather than proven true. Hypotheses generate predictions that data either supports through consistency or challenges via discrepancies, emphasizing rigorous experimentation to advance knowledge. Replication reinforces this integration, ensuring findings are robust; however, the 2015 in , involving 270 researchers who replicated 100 studies from top journals, revealed that only 36% of replications yielded statistically significant results, compared to 97% in originals, highlighting systemic issues in reliability. Key tools facilitate these statistical and scientific applications. The R programming language, developed in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland, provides an open-source environment for data analysis, modeling distributions, and visualization, now used by millions for its extensibility via packages. Similarly, SPSS (Statistical Package for the Social Sciences), first released in 1968 by Norman H. Nie, C. Hadlai Hull, and Dale H. Bent, revolutionized social science research with its user-friendly interface for hypothesis testing and multivariate analysis. Experimental designs like randomized controlled trials (RCTs), pioneered by Ronald A. Fisher in his 1925 work Statistical Methods for Research Workers and expanded in The Design of Experiments (1935), minimize bias by randomly assigning subjects to treatment or control groups, ensuring causal inferences from data. Advances in have enhanced data's role in research. journals implemented a mandatory data availability policy in , requiring authors to share underlying datasets upon publication to promote transparency, validation, and reuse across studies. platforms like , launched in 2009 by the Citizen Science Alliance, engage volunteers in classifying vast datasets—such as astronomical images—contributing to over 100 peer-reviewed publications and democratizing in fields like and physics.

Data in Society and Ethics

Data plays a pivotal role in modern economies, driving and growth while reshaping societal structures. The global , encompassing data creation, storage, and , is projected to contribute approximately 15% to world GDP in nominal terms by , amounting to around $16 trillion according to estimates from the International Data Center Authority (IDCA) and the . This sector fuels industries such as , , and healthcare, enabling personalized services and that enhance efficiency but also concentrate economic power in tech giants. However, this data-driven model has introduced concepts like surveillance capitalism, where is commodified for behavioral prediction and profit, as articulated by in her 2019 book . Zuboff describes this as a new economic order that extracts human experience as raw material for commercial practices, often without adequate user awareness or consent. Ethical challenges surrounding data are profound, particularly in , , and consent. Privacy violations have prompted stringent regulations, with the European Union's (GDPR), effective since May 2018, imposing fines totaling €6.74 billion as of November 2025 for non-compliance, averaging €2.55 million per penalty across 2,645 cases. These enforcement actions underscore the regulation's role in protecting rights amid widespread breaches. Bias in datasets exacerbates inequalities; for instance, a 2018 study by and revealed that commercial facial recognition systems exhibited error rates up to 34.7% for darker-skinned females, compared to 0.8% for light-skinned males, highlighting intersectional disparities in applications. Consent models remain contentious, with traditional opt-in approaches often insufficient; ethical frameworks advocate for dynamic consent, allowing individuals to granularly control data reuse over time, as explored in health data research ethics. Policy frameworks have evolved to address these issues, balancing innovation with protection. In the United States, the California Consumer Privacy Act (CCPA) of 2018 grants residents rights to access, delete, and opt out of personal data sales, applying to businesses handling data of 50,000 or more consumers annually and influencing broader U.S. privacy standards. Internationally, the EU's Artificial Intelligence Act, adopted in 2024 and entering force on August 1, 2024, classifies AI systems by risk levels, mandating transparency, bias mitigation, and human oversight for high-risk applications involving data processing. These laws reflect a global push toward accountable data governance, with enforcement mechanisms like the GDPR's supervisory authorities ensuring compliance. Looking ahead, has emerged as a critical concern amid geopolitical tensions, with nations implementing controls to localize data storage and restrict cross-border flows for . For example, policies in the and emphasize data residency to prevent foreign influence, driven by U.S.-China tech rivalries and events like the 2022 Russia-Ukraine conflict that highlighted digital vulnerabilities. Concurrently, digital rights movements, led by organizations like the , advocate for as a civil right, pushing for legislation that curbs discriminatory data uses and extends Fourth Amendment protections to digital spaces. These efforts aim to safeguard individual autonomy against unchecked data exploitation in an increasingly fragmented global landscape.

References

  1. [1]
    data - Glossary | CSRC - NIST Computer Security Resource Center
    Representation of facts, concepts, or instructions in a manner suitable for communication, interpretation, or processing by humans or by automatic means.
  2. [2]
    [PDF] Data.pdf - Introduction to Computing
    Internally, all data is stored just as a sequence of bits, so the type of the data is important to understand what it means. We have seen several different ...
  3. [3]
    Types of Data - Data Science Discovery
    AND —; Unstructured Data. Structured Data. Structured data refers to data that has been organized and categorized in a well-defined format.
  4. [4]
    Learn About Data - UW-IT - University of Washington
    Jun 6, 2025 · Data refers to raw, unprocessed facts and figures without context. Examples include numbers, dates, and strings of text.
  5. [5]
    Quantitative and qualitative data | Australian Bureau of Statistics
    Apr 18, 2023 · Data collected about a numeric variable will always be quantitative and data collected about a categorical variable will always be qualitative.
  6. [6]
    [PDF] NIST Big Data Interoperability Framework: Volume 1, Definitions
    Oct 2, 2019 · Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present.
  7. [7]
    The Advantages of Data-Driven Decision-Making - HBS Online
    Aug 26, 2019 · Data-driven decisions lead to more confident decisions, enable proactive actions, and can realize cost savings.
  8. [8]
    [PDF] Harnessing the Power of Digital Data for Science and Society
    Jan 14, 2009 · Digital technologies are reshaping the practice of science. Digital imaging, sensors, analytical instrumentation.Missing: modern | Show results with:modern
  9. [9]
    SP 800-60 Rev. 2, Guide for Mapping Types of Information and ...
    Jan 31, 2024 · This publication provides a methodology to map types of information and systems to security categories (ie, confidentiality, integrity, and availability) and ...
  10. [10]
    data, n. meanings, etymology and more | Oxford English Dictionary
    OED's earliest evidence for data is from 1645, in the writing of Thomas Urquhart, author and translator. data is a borrowing from Latin. Etymons: Latin data, ...
  11. [11]
    Data - Etymology, Origin & Meaning
    Originating in the 1640s from Latin datum, meaning "a thing given," data means facts or numerical information collected for reference or calculation.
  12. [12]
    Early Popular Computers, 1950 - 1970
    Jan 9, 2015 · In 1953, IBM delivered its 701 Electronic Data Processing Machine, a large-scale computer for scientific and engineering applications, including ...
  13. [13]
    Origins of Statistics
    Counting as Descriptive Statistics. John Graunt's 1662 Observatons on the Bills of Mortality is often cited as the first instance of descriptive statistics.
  14. [14]
    Datum isn't; data are - PMC - NIH
    A search of Google Books shows that the use of plural form of “data,” which once outnumbered the singular form by a factor of 4, has been reduced to equality ...
  15. [15]
    DM2 - Information and Data - DoD CIO - Department of War
    Data is the representation of information in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means.
  16. [16]
    Plural Nouns - APA Style - American Psychological Association
    Nouns can be singular (i.e., only one) or plural (i.e., more than one). To make a noun plural, add “s”, “es”, and sometimes “ies”.
  17. [17]
    Usage and Grammar - The Chicago Manual of Style
    Should I treat “data” as a singular or a plural noun? I have been looking for a definitive answer to this question in online style manuals and grammar guides.
  18. [18]
    Qualitative vs. Quantitative Research: What's the Difference?
    Oct 9, 2023 · The key difference is that qualitative research seeks to understand meanings, while quantitative research aims to quantify variables. Because ...
  19. [19]
    Types of Data - Education Data and Statistics
    May 2, 2025 · Qualitative data tells a story. Such data are non-numerical in nature. They also have a low level of measurability. Qualitative data can include ...
  20. [20]
    Chapter 6 Qualitative vs Quantitative Data – FITS 204
    Qualitative data involves a descriptive judgment using concept words instead of numbers. Gender, country name, animal species, and emotional state, feelings, ...
  21. [21]
    Structured vs. Unstructured Data Types - Oracle
    Apr 1, 2022 · Just as structured data comes with definition, unstructured data lacks definition. Rather than predefined fields in a purposeful format, ...
  22. [22]
    What is Big Data? | IBM
    The "V's of Big Data"—volume, velocity, variety, veracity and value—are the five characteristics that make big data unique from other kinds of data. These ...
  23. [23]
    Discrete vs. Continuous Data: A Guide for Beginners - Coursera
    Nov 13, 2024 · Discrete data is countable with distinct values, while continuous data is measurable and can take any value within a range.
  24. [24]
    Public Health Research Guide: Primary & Secondary Data Definitions
    Jul 16, 2025 · Qualitative Research Definition: Data collected that is not numerical, hence cannot be quantified. It measures other characteristics through ...
  25. [25]
    Spatiotemporal Analysis
    Spatiotemporal models arise when data are collected across time as well as space and has at least one spatial and one temporal property. An event in a ...
  26. [26]
    Chapter: 2 Types of Data and Methods for Combining Them
    This chapter briefly discusses types of data, including government surveys and data collected by government agencies while administering programs.
  27. [27]
    Weather Station Data on the Juneau Icefield
    Jun 26, 2019 · Data at many JIRP sites consist of a single temperature sensor, recording in a Stevenson-type shield. Automated weather stations at Camp 10, 17, ...
  28. [28]
    Geological & Environmental Systems | netl.doe.gov
    Our research uses new and historical data from field trials and laboratory experiments and novel computational tools to assess engineered and natural systems in ...
  29. [29]
    All Resources - Site Guide - NCBI - NIH
    Each record in the database is a set of DNA sequences. For example, a population set provides information on genetic variation within an organism, while a ...
  30. [30]
    Imaging genomics: data fusion in uncovering the missing heritability
    Feb 1, 2024 · Recent efforts to unravel this 'missing heritability' focus on garnering new insight from merging different data types including medical imaging.Missing: sources | Show results with:sources
  31. [31]
    Using Big Data - Open Publishing
    Social exchanges, for example, are captured in emails, instant messages, and social networking platforms. Expressions of attitudes and biases manifest in blog ...
  32. [32]
    [PDF] Research and Methodological Foundations of Transaction Log ...
    This chapter outlines and discusses theoretical and methodological foundations for transaction log analysis. We first address the fundamentals of ...Missing: generated | Show results with:generated
  33. [33]
    Web-archiving and social media: an exploratory analysis
    Jun 22, 2021 · Web and social media archives provide an invaluable resource for researchers to study human behaviour and history as they provide clear records ...
  34. [34]
    [PDF] Internet of Things (IoT) Advisory Board (IoTAB) Report
    Oct 21, 2024 · The IoT Advisory Board (IoTAB) has prepared findings and recommendations to realize opportunities and overcome challenges. Development of this ...Missing: scraping | Show results with:scraping
  35. [35]
    Using big data for evaluating development outcomes: A systematic ...
    This study, utilising web scraping to collect data on project characteristics and various sources of satellite data for measuring the outcomes of interest ...
  36. [36]
    [PDF] Microblogs Data Management: A Survey
    Microblogs data, the micro-length user-generated data that is posted on the web, such as tweets, online reviews, news comments, social media comments, and user ...
  37. [37]
    How Technology Is Making it Possible to Build the Largest Dataset ...
    Apr 20, 2023 · From 1790 to 1880, the U.S. Census Bureau recorded census data using manual counting processes. Only household information was captured prior to ...
  38. [38]
    The History and Growth of the United States Census
    Aug 19, 2024 · Census statistics date back to 1790 and reflect the growth and change of the United States. Past census reports contain some terms that today's ...Missing: sources manual digital
  39. [39]
    Measuring America: The Decennial Censuses From 1790 to 2000
    This report contains detailed information on the questionnaires and instructions used for each census, plus individual histories of each census.Missing: history | Show results with:history
  40. [40]
    RFC 4180 Common Format and MIME Type for CSV Files - IETF
    This RFC documents the format of comma separated values (CSV) files and formally registers the "text/csv" MIME type for CSV in accordance with RFC 2048 [1].
  41. [41]
    The History of Microsoft - 1985
    Apr 16, 2009 · September 30, 1985 Microsoft announces the shipment to retail stores of Excel for the Macintosh, a powerful, full-featured microcomputer ...
  42. [42]
    Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
    Nov 26, 2008 · XML was developed by an XML Working Group (originally known as the SGML Editorial Review Board) formed under the auspices of the World Wide Web ...Namespaces in XML · Review Version · Abstract · First EditionMissing: history | Show results with:history
  43. [43]
    RFC 8259 - The JavaScript Object Notation (JSON) Data ...
    The JavaScript Object Notation (JSON) Data Interchange Format · RFC - Internet Standard December 2017. View errata Report errata. Obsoletes RFC 7159. Was draft- ...Missing: history | Show results with:history
  44. [44]
    JPEG 1
    The JPEG 1 standard (ISO/IEC 10918) was created in 1992 (latest version, 1994) as the result of a process that started in 1986.JPEG AI · JPEG XS · JPEG DNA · JPEG XRMissing: history | Show results with:history
  45. [45]
    A History of MySQL Database - Exadel
    Oct 12, 2017 · MySQL AB was one of the most anticipated technology IPOs of 2008, with lower costs that attracted investors over competing tools like Oracle 11.
  46. [46]
    MongoDB Evolved – Version History
    The first version of the MongoDB database shipped in August 2009. The 1.0 release and those that followed shortly after were focused on validating a new and ...What's New In The Latest... · 2024 -- Mongodb 8.0 · 2023 -- Mongodb 7.0
  47. [47]
    Resource Description Framework (RDF): Concepts and Abstract ...
    Feb 10, 2004 · The RDF Working Group has produced a W3C Recommendation for a new version of RDF which adds features to this 2004 version, while remaining ...Missing: history | Show results with:history
  48. [48]
    What is EDI? Electronic Data Interchange Explained - OpenText
    The American National Standards Institute (ANSI) X12 standard, developed in 1979, remains the primary EDI standard in North America. X12 defines specific ...
  49. [49]
    The IBM punched card
    The punched card preceded floppy disks, magnetic tape and the hard drives of later computers as the first automated information storage device, increasing ...
  50. [50]
    Understanding Parquet Data Format | ClicData Data Guides
    Rating 4.6 (187) Parquet was developed in 2013 as a joint effort between Twitter and Cloudera, built on Google's Dremel paper and inspired by its internal columnar storage ...
  51. [51]
  52. [52]
    A Closer Look at Data Retrieval in Databases - CelerData
    Aug 20, 2024 · Key Components of Data Retrieval. Queries. Structured requests formulated using query languages such as SQL (Structured Query Language) or APIs.
  53. [53]
    Elasticsearch: 15 years of indexing it all, finding what matters
    Feb 12, 2025 · When was Elasticsearch first released? Elasticsearch was released in February 2010. How many times has Elasticsearch has been downloaded?
  54. [54]
  55. [55]
    The FAIR Guiding Principles for scientific data management ... - Nature
    Mar 15, 2016 · This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data ...
  56. [56]
    Facts and Figures 2023 - Internet use - ITU
    Oct 10, 2023 · In low-income countries, 27 per cent of the population uses the Internet, up from 24 per cent in 2022. This 66 percentage point gap reflects the ...
  57. [57]
    An Analysis of Online Datasets Using Dataset Search (Published, in ...
    Aug 25, 2020 · The result is Dataset Search, which we launched in beta in 2018 and fully launched in January 2020.
  58. [58]
    Data Version Control · DVC
    Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.Use Cases · DVC Tools for Data Scientists... · Get Started · DVC DocumentationMissing: lineages | Show results with:lineages
  59. [59]
    Descriptive Statistics for Summarising Data - PMC - PubMed Central
    May 15, 2020 · These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are ...
  60. [60]
    Chapter 9 Visualizing data distributions | Introduction to Data Science
    Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard ...
  61. [61]
    Graphical Summaries | Introduction to Data Science
    Scatterplots. A scatterplot is a very widely-used method for visualizing bivariate data. They have many uses, but the most relevant for us is to plot the joint ...Missing: descriptive | Show results with:descriptive
  62. [62]
    T Test - StatPearls - NCBI Bookshelf - NIH
    William Sealy Gosset first described the t-test in 1908, when he published his article under the pseudonym 'student' while working for a brewery.
  63. [63]
    6.6 - Confidence Intervals & Hypothesis Testing | STAT 200
    Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis.Missing: seminal | Show results with:seminal
  64. [64]
    Gauss and the Invention of Least Squares - jstor
    Gauss was the first mathematician of the age, but it was Legendre who crystalized the idea in a form that caught the mathematical public's eye. Just as the ...
  65. [65]
    6.1 Correlation and Causation - Sense & Sensibility & Science
    Feb 21, 2024 · How can we be sure? We explain the mantra that "correlation does not equal causation" by defining causation as "correlation under intervention." ...
  66. [66]
    The Extent and Consequences of P-Hacking in Science - PMC - NIH
    Mar 13, 2015 · One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant.
  67. [67]
    Estimating the reproducibility of psychological science
    We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
  68. [68]
    Lecture Notes | Introduction to Algorithms - MIT OpenCourseWare
    Lecture 1: Introduction notes (PDF) · Recitation 1 notes (PDF). 2, Lecture 2: Data Structures notes (PDF) · Recitation 2 notes (PDF). 3, Lecture 3: Sorting ...
  69. [69]
    [PDF] A Mathematical Theory of Communication
    379–423, 623–656, July, October, 1948. A Mathematical Theory of Communication. By C. E. SHANNON. INTRODUCTION. THE recent development of various methods of ...
  70. [70]
    [PDF] Jim Gray - The Transaction Concept: Virtues and Limitations
    1981. Published by Tandem ... This section discusses techniques for “almost perfect” systems and explains their relationship to transaction processing.
  71. [71]
    (PDF) Principles of Transaction-Oriented Database Recovery
    Database consistency theory establishes that reliable systems must maintain fundamental ACID properties (Atomicity, Consistency, Isolation, Durability) to ...
  72. [72]
    [PDF] The Hadoop Distributed File System - cs.wisc.edu
    The Apache Hadoop project was founded in 2006. By the end of that year, Yahoo! had adopted. Hadoop for internal use and had a 300-node cluster for devel- opment ...
  73. [73]
    A Brief History of Data Lakes - Dataversity
    Jul 2, 2020 · In October of 2010, James Dixon, founder and former CTO of Pentaho, came up with the term “Data Lake.” Dixon argued Data Marts come with several ...
  74. [74]
    Four Foundational Technology Trends to Watch in 2023 - IEEE SA
    Jan 20, 2023 · From a performance standpoint, edge computing can deliver much faster response times—locating key processing functions closer to end users ...
  75. [75]
    1.2 - Samples & Populations | STAT 200 - STAT ONLINE
    Population: The entire set of possible cases ; Sample: A subset of the population from which data are collected ; Statistic: A measure concerning a sample (e.g., ...
  76. [76]
    Population vs Sample: Uses and Examples - Statistics By Jim
    Population: The whole group of people, items, or element of interest. Sample: A subset of the population that researchers select and include in their study.
  77. [77]
    Normal Distribution - PMC - PubMed Central - NIH
    Statistical notes: The parameters of normal distribution are mean and SD. Distribution is a function of SD. Sample size plays a role in normal distribution.Missing: seminal sources
  78. [78]
    Scientific Method - Stanford Encyclopedia of Philosophy
    Nov 13, 2015 · Harper, W.L., 2011, Isaac Newton's Scientific Method: Turning Data into Evidence about Gravity and Cosmology, Oxford: Oxford University Press.
  79. [79]
    [PDF] R: A LANGUAGE FOR DATA ANALYSIS and GraphICS
    In this article we discuss our experience designing and implementing a statistical computing language. In developing this new language, we sought to combine ...
  80. [80]
    (PDF) SPSS (software) - ResearchGate
    Nov 29, 2016 · Introduced in 1968, it helped revolutionize research practices in the social sciences, enabling researchers to conduct complex statistical ...Missing: original | Show results with:original<|separator|>
  81. [81]
    Fisher, Bradford Hill, and randomization - Oxford Academic
    In the 1920s RA Fisher presented randomization as an essential ingredient of his approach to the design and analysis of experiments, validating significance ...
  82. [82]
    Data Access for the Open Access Literature: PLOS's Data Policy
    Feb 25, 2014 · The new PLOS Data Policy will require all submitting authors to include a data availability statement as of March 1, 2014.
  83. [83]
    Zooniverse: 10 years of people-powered research
    Dec 12, 2019 · The Zooniverse was launched on 12th December 2009 following the success of an initial 2-year project, Galaxy Zoo, ...
  84. [84]
    Global Digital Economy Report - 2025 | IDCA
    The Digital Economy comprises about 15 percent of world GDP in nominal terms, according to the World Bank. This amounts to about $16 trillion of ...<|separator|>
  85. [85]
    The Age of Surveillance Capitalism: The Fight for a Human Future at ...
    In this masterwork of original thinking and research, Shoshana Zuboff provides startling insights into the phenomenon that she has named surveillance ...
  86. [86]
    Numbers and Figures | GDPR Enforcement Tracker Report 2024/2025
    There were 2,245 GDPR fines (2,560 if including incomplete data) totaling around EUR 5.65 billion, with an average fine of EUR 2,360,409. The highest fine was ...Missing: onwards | Show results with:onwards
  87. [87]
    [PDF] Gender Shades: Intersectional Accuracy Disparities in Commercial ...
    Past research has also shown that the accuracies of face recognition systems used by US-based law enforcement are systematically lower for people labeled female ...
  88. [88]
    Ethical Issues in Consent for the Reuse of Data in Health Data ...
    Under a meta-consent model, individuals would be able to choose how they prefer to provide consent—for example, whether they prefer a blanket or dynamic model ...
  89. [89]
    California Consumer Privacy Act (CCPA)
    Mar 13, 2024 · The California Consumer Privacy Act of 2018 (CCPA) gives consumers more control over the personal information that businesses collect about them.CCPA Regulations · CCPA Enforcement Case · CCPA Opt-Out Icon
  90. [90]
    The Act Texts | EU Artificial Intelligence Act
    The original AI Act as proposed by the European Commission can be downloaded below. Please note that the 'Final draft' of January 2024 as amended is the version ...
  91. [91]
    Fragmenting the Internet: The Geopolitics of Data Sovereignty
    Aug 22, 2025 · Once states recognize data as strategic, the imperative to reassert sovereignty over it manifests most visibly in data localization mandates.Missing: 2020s | Show results with:2020s
  92. [92]
    Digital Privacy Legislation is Civil Rights Legislation
    May 18, 2023 · Data surveillance is a civil rights problem, and legislation to protect data privacy can help protect civil rights.