Fact-checked by Grok 2 weeks ago

Data set

A dataset, also known as a data set, is a structured collection of related data points organized to facilitate storage, analysis, and retrieval, typically associated with a unique body of work or research objective.^[1] In essence, it represents a coherent assembly of information—such as numerical values, text, images, or measurements—that can be processed computationally or statistically to derive insights.^[2] Datasets form the foundational building blocks for fields like statistics, computer science, and data science, enabling everything from hypothesis testing to predictive modeling.^[3] In statistics, a dataset consists of observations or measurements collected systematically for analysis, often arranged in tabular form with variables representing attributes and cases denoting individual entries.^[3] For instance, the classic Iris dataset includes measurements of sepal and petal dimensions for three species of iris flowers, serving as a benchmark for classification tasks.^[4] Datasets in this context must exhibit qualities like completeness, accuracy, and relevance to ensure valid statistical inferences, with size influencing the reliability of results—larger datasets generally allowing for more robust generalizations.^[5] Within computer science and machine learning, datasets are critical for training algorithms, where they are divided into subsets such as training data (used to fit models), validation data (for tuning hyperparameters), and test data (for evaluating performance).^[6] A machine learning dataset typically comprises examples with input features (e.g., pixel values in an image) and associated labels or targets (e.g., object categories), enabling supervised learning paradigms.^[6] Development of such datasets involves stages like collection, cleaning, annotation, and versioning to address challenges like bias, imbalance, or incompleteness, which can profoundly impact model fairness and accuracy.^[6] Datasets vary by structure and format, broadly categorized as structured (e.g., relational tables in SQL databases with predefined schemas), unstructured (e.g., raw text documents or video files lacking fixed organization), and semi-structured (e.g., XML or JSON files with tags but flexible schemas).^[7] They also differ by content type, including numerical (for quantitative analysis), categorical (for discrete groupings), or multimodal (combining text, audio, and visuals).^[5]^[8] In practice, high-quality datasets are indispensable for applications ranging from business intelligence and healthcare diagnostics to climate modeling, underscoring their role in driving evidence-based decisions across industries.^[1]

Fundamentals

Definition

A dataset is an organized collection of related data points, typically assembled for purposes such as analysis, research, or record-keeping.^[9]^[10] In its most common representation, a dataset takes a tabular form where rows correspond to individual observations or instances, and columns represent variables or attributes describing those observations.^[9]^[11] This structure facilitates systematic examination and manipulation of the data, enabling users to identify patterns, test hypotheses, or derive insights.^[2] While related to broader concepts in data handling, a dataset differs from a database and a data file in scope and purpose. A database is a comprehensive, electronically stored and managed collection of interrelated data, often supporting multiple datasets through querying and retrieval systems, whereas a dataset constitutes a more focused, self-contained subset tied to a specific project or investigation.^[1] Similarly, a data file refers to the physical or digital container holding the data, but the dataset emphasizes the logical organization and content within that file rather than the storage medium itself.^[1] The fundamental components of a dataset include observations, variables, and metadata. Observations are the discrete units or records captured, each forming a row in the tabular structure and representing a single entity or event under study.^[11] Variables are the measurable characteristics or properties associated with those observations, organized as columns to provide consistent descriptors across the collection.^[9] Metadata, in turn, encompasses descriptive information about the dataset as a whole, such as its title, creator, collection methods, and contextual details, which aids in its interpretation, reuse, and management without altering the core data.^[12]^[13]

Historical Development

The concept of datasets originated in the 19th century through the compilation of statistical tables by early pioneers in social and medical statistics. Adolphe Quetelet, a Belgian mathematician and statistician, systematically collected anthropometric measurements and social data across populations in the 1830s and 1840s, using these proto-datasets to derive averages and identify regularities in human behavior, which founded the field of social physics.^[14] Similarly, Florence Nightingale gathered and tabulated mortality data from British military hospitals during the Crimean War in the 1850s, employing statistical tables to quantify the effects of poor sanitation versus battlefield injuries, thereby influencing sanitary reforms and demonstrating data's persuasive power in policy.^[15] The 20th century saw datasets evolve alongside mechanical and early electronic computing. Punch cards, first mechanized by Herman Hollerith for the 1890 U.S. Census, became integral to data processing in IBM systems by the 1950s, allowing automated sorting and tabulation of large-scale records for business and scientific applications.^[16] This mechanization paved the way for software innovations, such as the Statistical Analysis System (SAS), developed in the late 1960s by researchers at North Carolina State University and first released in 1976, which enabled programmable analysis of agricultural and experimental datasets on mainframe computers.^[17] The digital era of the 1980s and 1990s introduced standardization and accessibility to datasets through relational models and open repositories. Edgar F. Codd's 1970 relational database model, implemented commercially by systems like Oracle in 1979, structured data into tables with defined relationships, while SQL emerged as the dominant query language, standardized by ANSI in 1986 for efficient data retrieval and management.^[18] Complementing this, the UCI Machine Learning Repository was established in 1987 by David Aha at the University of California, Irvine, as an FTP archive of donated datasets, fostering research in pattern recognition and becoming a foundational resource for algorithmic testing.^[19] The 21st century witnessed explosive growth in dataset scale and distribution, driven by big data technologies. Apache Hadoop, initially released in 2006, provided an open-source framework for storing and processing petabyte-scale datasets across distributed clusters, drawing from Google's MapReduce paradigm to handle unstructured volumes in web-scale applications.^[20] Subsequently, Kaggle launched in 2010 as a platform for hosting open datasets and competitions, enabling global collaboration on real-world problems and accelerating the adoption of machine learning through crowdsourced data sharing.^[21]

Characteristics

Key Properties

The size of a data set refers to the total number of records or samples it contains, often denoted as n, which directly influences the statistical power and reliability of analyses performed on it. Larger data sets, with millions or billions of samples, enable more robust generalizations but demand significant storage and processing resources. Dimensionality describes the number of features or attributes per sample, typically denoted as p, representing the breadth of information captured for each observation. In many modern applications, such as genomics or image processing, data sets exhibit high dimensionality where p exceeds n, leading to challenges like the curse of dimensionality that can degrade model performance without appropriate techniques. Granularity pertains to the level of detail or resolution in the data, determining how finely observations are divided—for instance, aggregating sales data at a daily versus hourly level affects the insights derivable. Finer granularity provides richer context but increases data volume and complexity in handling. Completeness measures the extent to which data values are present, often quantified by the proportion of non-missing entries across variables, with missing values arising from collection errors or non-response.^[22] Incomplete data sets can introduce bias and reduce analytical accuracy unless addressed.^[22] Accuracy refers to the extent to which data correctly describes the real-world phenomena it represents, free from errors in measurement or transcription. Inaccurate data can lead to flawed analyses and decisions.^[23] Consistency evaluates whether data remains uniform and non-contradictory across different parts of the dataset or multiple sources, such as matching formats or values. Inconsistencies can hinder data integration and trust in results.^[23] Timeliness assesses how up-to-date the data is relative to its intended use, ensuring availability when needed without obsolescence. Outdated data may yield irrelevant insights.^[23] Data sets exhibit variability in the distribution of their points, characterized by homogeneity when samples share similar attributes or heterogeneity when they display substantial diversity across features. Homogeneous data sets facilitate simpler modeling, while heterogeneous ones capture real-world complexity but require advanced techniques to manage underlying differences. Statistical measures such as the mean, which indicates central tendency, and variance, which quantifies spread, serve as key indicators of a data set's distribution without implying uniform patterns across all cases. Scalability concerns how these properties impact practical usability; small data sets with low dimensionality allow efficient processing on standard hardware, whereas large, high-dimensional ones escalate computational demands, often necessitating distributed systems to handle time and memory requirements proportional to n and p. For example, algorithms with quadratic complexity become infeasible for n > 10^6, highlighting the need for scalable methods in big data contexts.^[24]

Formats and Structures

Data sets are organized and stored in various formats that reflect their structural properties, enabling efficient storage, retrieval, and exchange. Common formats include tabular structures for simple, row-and-column data; hierarchical formats for nested relationships; and graph-based formats for interconnected entities. These formats determine how data is represented, with choices often influenced by factors such as dimensionality, where higher-dimensional data may favor compressed or columnar layouts over flat files.^[25]^[26]^[27] Tabular formats, such as CSV and Excel, are widely used for storing data in a grid-like arrangement of rows and columns, ideal for relational or spreadsheet-style datasets. CSV, defined as a delimited text file format using commas to separate values, supports basic tabular data without complex nesting, making it human-readable and compatible with numerous tools.^[28] Excel files, typically in .xlsx format based on the Office Open XML standard, extend tabular storage to include formulas, multiple sheets, and formatting, though they are less efficient for large-scale data due to their binary or zipped XML structure.^[29] Hierarchical formats like XML and JSON provide tree-like structures for representing nested data, suitable for semi-structured information. XML, a markup language derived from SGML, uses tags to define elements and attributes, allowing for extensible schemas that enforce data validation and hierarchy.^[30] JSON, a lightweight text-based format using key-value pairs and arrays, supports objects and nesting for easy parsing in web and programming environments, often preferred for its simplicity over XML's verbosity.^[25] Graph-based formats, such as RDF, model data as networks of nodes and edges to capture semantic relationships, particularly in linked data scenarios. RDF represents information through subject-predicate-object triples forming directed graphs, where resources are identified by IRIs or literals, enabling inference and interoperability in the Semantic Web.^[27] Structural elements in data sets are often defined by schemas that outline organization and constraints. In the relational model, data is structured into tables (relations) with rows as tuples and columns as attributes, using primary and foreign keys to enforce uniqueness and links between tables, as formalized by Edgar F. Codd.^[31] Non-relational alternatives, known as NoSQL, employ flexible schemas accommodating document, key-value, column-family, or graph models, avoiding rigid tables to handle varied data types and scales.^[32] Standardization of formats plays a crucial role in ensuring interoperability across systems, allowing data to be shared without loss of structure or meaning. For instance, Parquet, a columnar storage format optimized for big data, uses metadata-rich files with compression and nested support, facilitating efficient querying and exchange in ecosystems like Hadoop and Spark.^[33]

Types

Structured Datasets

Structured datasets are collections of data organized according to a predefined schema, typically arranged in rows and columns or relational tables, which allows for efficient storage, retrieval, and querying using standardized languages such as SQL.^[34] This organization ensures a predictable format and consistent structure, making the data immediately suitable for computational processing and mathematical analysis without extensive preprocessing.^[34] Key traits include the use of fixed fields for data entry, such as numerical values, categorical labels, or timestamps, which enforce data integrity and enable relationships between data elements to be explicitly defined.^[31] Common subtypes of structured datasets include tabular formats, which resemble spreadsheets with rows representing records and columns denoting attributes; relational datasets, stored in systems like MySQL that link multiple tables through keys to model complex relationships; and time-series datasets, which organize sequential observations with associated timestamps for tracking changes over time.^[35] Tabular structures are often used for straightforward reporting, while relational ones support advanced joins and normalization to minimize redundancy, as pioneered in the relational model.^[31] Time-series examples include stock prices or sensor readings, where each entry pairs a value with a precise temporal marker to facilitate trend analysis.^[35] The primary advantages of structured datasets lie in their high interoperability across systems and readiness for analysis, as the rigid schema reduces ambiguity and supports automated tools for querying and aggregation.^[34] For instance, census data is commonly formatted in fixed schemas with columns for demographics like age, income, and location, enabling rapid statistical computations and policy insights without custom parsing.^[35] This structure also promotes completeness by defining required fields upfront, ensuring comprehensive coverage in applications like financial transactions or operational metrics.^[34]

Semi-structured Datasets

Semi-structured datasets feature a flexible organization with tags, markers, or keys that provide some inherent structure without adhering to a fixed schema, allowing variability in format while enabling partial organization.^[7] Common examples include XML and JSON files, email messages with metadata headers, and NoSQL databases using key-value or document stores.^[34] This type bridges the gap between structured and unstructured data, facilitating easier extraction and querying than fully unstructured forms through tools like XPath for XML or JSON parsers, though it often requires schema inference or validation for consistent analysis.^[7] Key characteristics include self-describing elements (e.g., tagged fields) that support hierarchical or nested data representations, making them suitable for web content, log files, or API responses where structure evolves over time.^[34] Advantages of semi-structured datasets include adaptability to diverse and irregular sources, reduced need for extensive preprocessing compared to unstructured data, and support for scalable storage in formats like MongoDB. They are widely used in applications such as social media feeds or configuration files, balancing flexibility with analyzability.^[7]

Unstructured Datasets

Unstructured datasets encompass free-form information lacking a predefined data model or schema, such as text documents, images, videos, and audio files, which cannot be readily queried or analyzed using conventional relational databases.^[36] This absence of inherent organization distinguishes them from structured data, as they do not adhere to fixed fields or formats, often comprising the majority of data generated in modern digital environments—estimated at 80% of global data volumes as of 2025.^[37] To render them usable, unstructured datasets necessitate preprocessing to extract features and impose artificial structure, enabling integration into analytical workflows.^[38] Key subtypes include text corpora, which consist of raw textual content like emails, social media posts, and literary works without tagged metadata; multimedia datasets featuring images, videos, and audio recordings that capture visual or auditory information in binary formats; and sensor data streams from Internet of Things (IoT) devices, such as real-time logs from environmental monitors or wearable trackers producing continuous, unformatted signals.^[39] For instance, text corpora like email archives require parsing to identify entities and relationships, while multimedia examples, such as video surveillance feeds, involve frame-by-frame analysis to detect objects or events.^[40] Sensor streams, often generated in high-velocity bursts, exemplify dynamic unstructured data that defies static storage without temporal aggregation.^[41] Handling unstructured datasets presents significant challenges due to their volume, variety, and lack of standardization, demanding advanced techniques like natural language processing (NLP) for textual data and computer vision for visual content to derive insights.^[42] In social media feeds, for example, NLP algorithms must navigate slang, emojis, and context-dependent sentiments to classify posts or track trends, often contending with noise from multilingual or abbreviated content that complicates accurate extraction.^[43] These processing demands can increase computational costs and error rates, as models trained on one subtype may underperform on others without domain-specific adaptations.^[38] Unlike structured datasets, unstructured ones exhibit pronounced heterogeneity in format and semantics, amplifying the need for robust feature engineering prior to analysis.^[44]

Creation and Management

Methods of Creation

Data sets can be created through primary methods including direct collection, synthetic generation, and aggregation of existing sources. Direct collection involves gathering raw data from real-world sources, such as surveys that solicit responses from individuals to capture opinions, behaviors, or demographics, or sensors that automatically record environmental or physiological measurements like temperature, motion, or biometric signals in real-time.^[45]^[46] These approaches ensure the data reflects authentic phenomena but require careful design to cover the target domain adequately. Synthetic generation produces artificial data that mimics real distributions, often via simulations that model complex systems—such as weather patterns or economic scenarios—to output large volumes of controlled data without real-world constraints, or through machine learning algorithms like Generative Adversarial Networks (GANs), introduced by Goodfellow et al. in 2014, where a generator creates samples and a discriminator evaluates their realism to iteratively improve output quality. More recent techniques include diffusion models, which generate data through iterative denoising processes, and large language model-based approaches for tabular and text data, increasingly integrated into ML workflows as of 2025.^[47]^[48]^[49] Aggregation, meanwhile, combines data from multiple disparate sources, such as merging records from various databases or files to form a unified set, enabling broader analysis while resolving inconsistencies in formats or scales.^[50] Tools and processes facilitate these methods, including APIs for web scraping to extract publicly available data from websites, database querying with SQL to retrieve and filter structured information from relational systems, and sampling techniques like simple random or stratified sampling to select subsets that maintain population representativeness without exhaustive collection.^[51]^[52] For instance, stratified sampling divides the population into subgroups before random selection to ensure proportional inclusion of key characteristics. A key consideration in dataset creation is the potential introduction of biases, particularly selection bias, where the chosen method or sample systematically excludes certain population segments, leading to skewed representations that undermine generalizability.^[53] Mitigating this involves validating sampling strategies against the target population and diversifying sources to approximate true variability.

Data Cleaning and Preparation

Data cleaning and preparation involve a series of systematic processes applied to raw datasets to identify, correct, or remove errors, inconsistencies, and inaccuracies, transforming them into reliable formats suitable for subsequent analysis.^[54] These steps are essential post-creation activities that address issues arising from data collection, ensuring the dataset's integrity without altering its underlying meaning.^[55] A primary step in data cleaning is handling missing values, which occur when data points are absent due to collection errors or non-responses. Common imputation methods include mean substitution, where missing values are replaced with the average of observed values in that feature, preserving central tendency while introducing minimal bias in symmetric distributions.^[56] More advanced techniques, such as multiple imputation by chained equations, generate several plausible datasets to account for uncertainty, but mean substitution remains a widely adopted baseline for its simplicity and effectiveness in preliminary preparation.^[57] Outlier detection is another critical step to mitigate the influence of anomalous data points that can skew results. The z-score method calculates the deviation of each value from the mean in standard deviation units, with thresholds typically set at ±3 identifying potential outliers under the assumption of approximate normality.^[58] This statistical approach, rooted in standard deviation principles, allows for robust identification without assuming a specific distribution, though it requires caution with small samples where z-scores may overestimate extremes.^[58] Normalization, or scaling features, ensures variables contribute equally to analysis by adjusting their ranges, preventing dominance by those with larger magnitudes. Techniques like min-max scaling transform data to a [0,1] interval using the formula x' = \frac{x - \min(x)}{\max(x) - \min(x)}, which is particularly useful for distance-based algorithms.^[59] Z-score standardization, subtracting the mean and dividing by the standard deviation (x' = \frac{x - \mu}{\sigma}), centers data around zero with unit variance, enhancing compatibility with gradient-based methods.^[59] The choice of scaling impacts model performance, as empirical studies demonstrate varying effectiveness across datasets and algorithms.^[59] Additional techniques encompass deduplication to eliminate redundant records, often via hashing unique identifiers or fuzzy matching for near-duplicates, reducing storage and improving query efficiency.^[60] Format conversion standardizes disparate representations, such as unifying date formats or encoding categorical variables, facilitating interoperability across tools.^[61] Validation against schemas enforces structural rules, checking data types, required fields, and constraints using formal specifications like JSON Schema to flag non-conformant entries early.^[62] The importance of these processes lies in their direct impact on analysis reliability; unclean data can propagate errors, leading to biased inferences or model failures. For instance, achieving high data completeness—measured as the percentage of non-missing values across essential fields—correlates with improved predictive accuracy. Effective cleaning thus enhances overall data quality, minimizing downstream risks and supporting trustworthy outcomes in statistical and machine learning applications.^[54]

Applications

Statistical Analysis

Statistical analysis represents a foundational application of data sets, enabling researchers to derive insights from collected observations through systematic examination. In this context, data sets serve as the empirical foundation for both summarizing patterns within the data and making generalizations beyond the observed sample. Proper preparation of data sets, such as cleaning and handling missing values, is essential to ensure the reliability of subsequent analyses.^[63] Descriptive statistics provide methods to summarize and characterize the key features of a data set without making inferences about a larger population. Measures of central tendency, including the mean (the arithmetic average of values), median (the middle value when data are ordered), and mode (the most frequent value), quantify the typical or central value in the data set.^[64] These are complemented by measures of dispersion, such as the standard deviation, which calculates the average distance of each data point from the mean, thereby indicating the spread or variability within the data set.^[65] For instance, in a data set of exam scores, the mean might reveal the average performance, while the standard deviation highlights the consistency of results across students.^[66] Inferential statistics extend this by using data sets to test hypotheses and estimate population parameters, allowing conclusions about broader phenomena based on sample evidence. Hypothesis testing, such as the t-test, compares means between groups or against a hypothesized value to determine if observed differences are statistically significant, often under the null hypothesis of no effect.^[67] Regression analysis models relationships between variables; in simple linear regression, the model takes the form

y = mx + b

where y is the dependent variable, x is the independent variable, m is the slope representing the change in y per unit change in x, and b is the y-intercept. Fitting involves minimizing the sum of squared residuals between observed and predicted values via ordinary least squares to estimate m and b.^[68] This approach is widely used to predict outcomes or assess associations, such as linking study hours (x) to test scores (y).^[69] Software tools facilitate these analyses on structured data sets, streamlining computation and visualization. The R programming language offers an integrated environment for statistical computing, supporting functions for descriptive summaries (e.g., mean(), sd()) and inferential tests (e.g., t.test(), lm() for regression).^[70] Similarly, Python's pandas library provides data frames for efficient manipulation of tabular data, enabling quick calculations of summary statistics via methods like describe() and integration with statistical functions for hypothesis testing and modeling.^[71]

Machine Learning and AI

In machine learning and artificial intelligence, datasets serve as the foundational input for training models to recognize patterns, make predictions, and generate outputs. The process begins with preparing the dataset through techniques such as splitting it into distinct subsets: typically, 70-80% for training, 10-15% for validation to tune hyperparameters, and the remainder for testing to assess generalization. This division, often exemplified by the 80/20 rule for training and testing, ensures that models learn from one portion while being evaluated on unseen data to prevent overfitting.^[72]^[73] Feature engineering complements this by transforming raw data into more informative representations, such as normalizing numerical features or creating interaction terms, which can significantly enhance model accuracy by aligning inputs with algorithmic requirements.^[74] Central to machine learning paradigms are supervised and unsupervised learning, each relying on specific dataset characteristics. Supervised learning algorithms, like linear regression or support vector machines, train on labeled datasets where each input is paired with a corresponding output, enabling the model to learn mappings for tasks such as classification or regression.^[75] In contrast, unsupervised learning operates on unlabeled datasets to uncover inherent structures, with clustering methods like K-means grouping similar data points based on proximity without predefined categories.^[76] Model performance in these approaches is evaluated using metrics tailored to the task; for instance, accuracy measures the proportion of correct predictions in balanced datasets, while the F1-score, the harmonic mean of precision and recall, provides a robust assessment for imbalanced classes by balancing false positives and negatives.^[77] The evolution of datasets in machine learning has been marked by a post-2010 shift toward large-scale, high-quality collections that fueled the deep learning revolution. Prior to this, models were constrained by modest data volumes, but the availability of massive datasets enabled training of complex neural networks with millions of parameters. A pivotal example is the ImageNet dataset, comprising over 1.2 million labeled images across 1,000 categories, which powered the 2012 AlexNet breakthrough—a convolutional neural network that achieved a top-5 error rate of 15.3% on the ImageNet challenge, dramatically outperforming prior methods and catalyzing widespread adoption of deep learning.^[78]^[79] This transition underscored how expansive datasets, combined with advances in compute, transformed AI from niche applications to scalable systems in computer vision, natural language processing, and beyond.

Notable Examples

Classic Datasets

The Iris dataset, introduced by British statistician Ronald Fisher in 1936, represents one of the earliest multivariate datasets employed in statistical analysis and classification tasks.^[80] It comprises 150 samples evenly divided among three species of iris flowers—Iris setosa, Iris versicolor, and Iris virginica—each characterized by four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters. Originally derived from measurements taken in the Gaspé Peninsula, Canada, the dataset served as a demonstration for linear discriminant analysis in Fisher's seminal paper, highlighting the utility of multiple measurements for taxonomic classification.^[80] Its simplicity and balanced structure have made it a foundational benchmark for evaluating classification algorithms, influencing the development of early machine learning techniques and remaining a standard introductory example in statistical education. The Boston Housing dataset, compiled in the 1970s from 1970 U.S. Census data, consists of 506 instances representing census tracts in the Boston metropolitan area, aimed at modeling housing prices through regression analysis.^[81] Each instance includes 13 features such as crime rate, proportion of residential land zoned for lots over 25,000 square feet, and nitric oxides concentration, with the target variable being the median value of owner-occupied homes in thousands of dollars.^[82] Developed by economists David Harrison and Daniel Rubinfeld, the dataset was introduced in their 1978 paper to investigate hedonic pricing models and the demand for clean air, using housing market data to estimate environmental valuation.^[81] However, it has been widely criticized for ethical issues, particularly the inclusion of a feature representing the proportion of Black residents (derived from redlining data), which can perpetuate racial biases in models; as a result, it was deprecated and removed from libraries like scikit-learn starting in version 1.2 (2022).^[83] It played a pivotal role in advancing econometric modeling and regression-based predictive methods, becoming a key resource for testing algorithms in computational statistics and early artificial intelligence applications.^[82] The University of California Irvine (UCI) Machine Learning Repository, established in 1987, provided a centralized benchmark for algorithmic evaluation, with the Wine dataset serving as an exemplary classic from this collection. Detailed in a 1988 study by Mario Forina et al. and donated to the repository in 1991, the dataset includes 178 samples of wine from three cultivars in Italy's Piedmont region, each described by 13 physicochemical features such as alcohol content, malic acid, and flavanoids. These attributes stem from chemical analyses conducted to distinguish wine origins, supporting multiclass classification tasks. The UCI repository's datasets, including Wine, facilitated standardized comparisons of machine learning methods during the repository's formative years, driving innovations in pattern recognition and feature selection techniques by offering accessible, real-world examples for researchers. These classic datasets collectively shaped early statistical and computational practices by providing compact, well-documented benchmarks that enabled reproducible experimentation and algorithm refinement, from Fisher's discriminant methods to regression and clustering advancements.^[81] Their enduring use underscores the importance of modest-scale, high-quality data in foundational research, influencing pedagogical tools and software libraries in data science.

Contemporary Datasets

Contemporary datasets represent a shift toward massive, diverse collections that fuel advancements in artificial intelligence, particularly in computer vision, natural language processing, and public health modeling. These resources, often exceeding terabytes in scale, are typically crowdsourced, web-scraped, or aggregated from global reporting systems, providing the volume and variety essential for training sophisticated machine learning models.^[84]^[85]^[86] ImageNet, introduced in 2009, stands as a foundational large-scale image database for computer vision research. It comprises over 14 million annotated images organized into more than 21,000 categories derived from the WordNet hierarchy, with the commonly used ImageNet-1K subset featuring about 1.2 million images across 1,000 classes. This dataset enabled breakthroughs in deep learning, such as the development of convolutional neural networks that achieved human-level performance on object recognition tasks, transforming fields like autonomous driving and medical imaging.^[84]^[87]^[88] Common Crawl, initiated in 2008 as an open repository of web data, offers petabyte-scale archives captured monthly from billions of web pages. By 2024, it encompassed over 300 billion pages totaling more than 9.5 petabytes of compressed data, including text, metadata, and links suitable for natural language processing tasks. Widely adopted for training large language models, such as through cleaned subsets like the Colossal Clean Crawled Corpus (C4), it supports scalable pre-training of transformers by providing diverse, real-world linguistic patterns without proprietary restrictions.^[85]^[89] Since 2020, COVID-19 datasets from the World Health Organization (WHO) have provided critical global health data for epidemiology and predictive modeling. These include daily and weekly reports on confirmed cases, deaths, hospitalizations, and vaccination rates across member states, aggregating millions of records from over 200 countries under a Creative Commons license. Such datasets have informed compartmental models like SIR extensions to simulate transmission dynamics, evaluate intervention efficacy, and guide policy responses during the pandemic.^[86]^[90] A key trend in contemporary datasets is their increasing availability through open platforms like Kaggle and Hugging Face, which host terabyte-scale collections for seamless access via APIs and streaming. These repositories have democratized AI research by enabling collaborative curation and versioning of massive datasets, such as multimodal corpora exceeding billions of examples, fostering innovation in areas like generative models while emphasizing reproducibility.^[91]^[92]

Challenges

Data Quality Issues

Data quality issues in datasets encompass a range of problems that undermine the reliability and validity of the data for analysis and decision-making. These issues can arise during data collection, storage, or processing, leading to distortions that affect downstream applications such as statistical modeling or predictive analytics. Common categories include inaccuracies, incompleteness, and inconsistencies, each of which can propagate errors through analytical pipelines. Inaccuracies refer to errors where recorded data does not accurately reflect the true underlying values, often stemming from measurement errors during collection. For instance, sensor malfunctions or human input mistakes can introduce systematic deviations, making the dataset unreliable for representing real-world phenomena. Such errors are particularly problematic in quantitative datasets, where even small inaccuracies can amplify in aggregated statistics. Incompleteness occurs when datasets contain missing values or omitted entries, reducing the overall information available for analysis. Missing data rates can vary, but thresholds exceeding 5% are often considered consequential, as they diminish statistical power and increase the risk of biased estimates. This issue frequently results from non-response in surveys or equipment failures in observational data, limiting the dataset's representativeness.^[93] Inconsistencies involve mismatches in data formats, units, or structures across entries or sources, such as varying date representations (e.g., MM/DD/YYYY vs. DD/MM/YYYY) that hinder integration and querying. These discrepancies arise from heterogeneous collection methods and can lead to faulty joins or computations if not addressed. Detection typically involves profiling tools to flag format variations, ensuring uniformity before analysis. Biases in datasets further compromise quality by introducing systematic distortions. Sampling bias emerges when the selected subset does not proportionally represent the target population, such as underrepresentation of certain demographics due to non-random selection methods. For example, a dataset drawn exclusively from urban areas may skew results away from rural realities. Measurement bias, on the other hand, occurs when the precision or calibration of data collection tools differs across groups, leading to inaccurate classifications or values.^[94] Detection of these quality issues often relies on statistical methods to identify anomalies. For biases, tests like the chi-square goodness-of-fit assess uniformity in distributions, revealing deviations from expected patterns. Incompleteness can be quantified via missing value ratios, while inaccuracies and inconsistencies are probed through validation against reference standards or cross-source comparisons. These techniques help quantify error rates and guide remedial actions, such as data cleaning processes outlined in preparation workflows. The impact of poor data quality manifests in flawed analyses, where inaccuracies and biases yield erroneous conclusions and inflated error rates in models. For instance, incomplete datasets can significantly reduce statistical power, while biases may propagate to overestimate or underestimate effects in statistical tests. Businesses reportedly lose an average of $15 million annually from productivity hits tied to such quality lapses, underscoring the need for rigorous assessment to maintain dataset integrity.^[95]^[93]

Privacy and Ethical Concerns

Data sets often contain personal information that, even when anonymized, can pose significant privacy risks through re-identification attacks, where attackers link de-identified records to individuals using auxiliary data sources. A systematic literature review of such attacks found that 72.7% of successful re-identifications occurred since 2009, frequently involving the combination of multiple datasets to infer identities, highlighting the limitations of traditional anonymization techniques like k-anonymity. To mitigate these risks, regulations such as the General Data Protection Regulation (GDPR), enacted in 2018, mandate explicit consent for processing personal data and impose strict requirements on data controllers to ensure pseudonymization or anonymization is effective, with potential fines up to 4% of global annual turnover for non-compliance.^[96]^[96]^[97] Ethical concerns in data sets extend to bias amplification, particularly in AI applications, where skewed training data perpetuates disparities across demographic groups. For instance, in facial recognition systems, datasets like those evaluated in the Gender Shades study revealed error rates up to 34.7% higher for darker-skinned females compared to lighter-skinned males, amplifying societal inequalities in automated decision-making. Fairness audits, which systematically evaluate datasets and models for demographic parity and equalized odds, have become essential tools to detect and address such biases, as outlined in frameworks emphasizing regulatory compliance and privacy in machine learning datasets. Additionally, informed consent remains a cornerstone of ethical data collection involving human subjects, as per the Belmont Report's principles, requiring researchers to provide comprehensive information about data use, risks, and withdrawal rights to ensure voluntary participation.^[98]^[98]^[99]^[100] In the 2025 context, emerging AI-specific ethical issues include data sovereignty challenges in global datasets, where nations enforce localization mandates to retain control over sensitive information amid cross-border AI training. These policies, driven by concerns over foreign access to national data, require organizations to store and process datasets within jurisdictional boundaries, balancing innovation with security as seen in recent economic analyses of data localization impacts. Furthermore, the environmental footprint of data storage raises sustainability ethics, with global data centers contributing about 0.5-1% of worldwide CO2 emissions as of 2025; for example, storing one terabyte of data annually generates approximately 40 kg of CO2 equivalent, underscoring the need for energy-efficient practices in dataset management. While data quality issues like incomplete records can overlap with ethical biases by exacerbating disparities, the focus here remains on normative implications for privacy and equity.^[101]^[101]^[102]^[103]

References

[1]
What are the differences between data, a dataset, and a database?
Mar 22, 2011 · A dataset is a structured collection of data generally associated with a unique body of work. A database is an organized collection of data ...
[2]
Dataset | NNLM
Jun 13, 2022 · A dataset is a collection of related data, however what constitutes a dataset is not clearly demarcated. One could consider all the data ...
[3]
Data Science: Finding Datasets - WMU Research Guides
Aug 20, 2025 · The Cambridge Dictionary of Statistics (4th Edition - 2006) defines a "data set" as a "general term for observations and measurements collected ...
[4]
[PDF] Guidelines for Dataset Development
A dataset is a collection of files or data (obtained from a digital device) created to have a desired set of attributes and known content.Missing: definition | Show results with:definition
[5]
Data and its (dis)contents: A survey of dataset development and use ...
Nov 12, 2021 · Datasets form the basis for training, evaluating, and benchmarking machine learning models and have played a foundational role in the ...Missing: authoritative | Show results with:authoritative
[6]
Understanding the Types of Data in Data Science
Apr 1, 2025 · Data can be classified into qualitative (descriptive) and quantitative (numerical) types, which require different analysis methods. · Data is ...Quantitative Data · Qualitative Data · Big Data Types And Their...<|control11|><|separator|>
[7]
Glossary: Dataset | resources.data.gov
A dataset is an organized collection of data. The most basic representation of a dataset is data elements presented in tabular form. Each column represents a ...Missing: statistics | Show results with:statistics
[8]
Research Data & Dataset - Metadata - UCF Research Guides
Aug 20, 2024 · Data set: A logically meaningful collection or grouping of similar or related data, usually assembled as a matter of record or for research, for ...
[9]
Datasets: Quality and Functionality Factors - The Library of Congress
Jun 18, 2024 · Rows usually represent observations and columns represent the values for a fixed set of variables for each observation. Some formats designed ...
[10]
Structuring & describing data (metadata) - Data Management @ NAU
Aug 18, 2025 · Common metadata elements ; General overview, Title, Name of the dataset. ; General overview · Creator, Name(s) and contact information for the ...
[11]
5 Metadata and Standards - The National Academies Press
For instance, metadata about a dataset include descriptions of the dataset itself and the underlying data. For a machine-readable dataset, describing the ...Metadata: The Basics · Metadata Encoding · Metadata Systems
[12]
Adolphe Quetelet (1796-1874)--the average man and indices of ...
Adolphe Quetelet (1796-1874) was a Belgian mathematician, astronomer and statistician, who developed a passionate interest in probability calculus that he ...Missing: proto- | Show results with:proto-
[13]
Florence Nightingale (1820–1910): An Unexpected Master of Data
This article shows how she used her mathematical and statistical knowledge to advise the British Army and government on the best approaches for medical data ...
[14]
Punch Card Data Processing - IBM Hursley Park Museum
Punch Card Data Processing. From the 1890 U.S. Census until into the 1990s the Punch Card played a significant role in the processing of data.
[15]
SAS History
In the late 1960s, eight Southern universities came together to develop a general purpose statistical software package to analyze agricultural data.
[16]
A brief history of databases: From relational, to NoSQL, to distributed ...
Feb 24, 2022 · Oracle brought the first commercial relational database to market in 1979 followed by DB2, SAP Sysbase ASE, and Informix. In the 1980s and '90s, ...
[17]
About - UCI Machine Learning Repository
The archive was created as an ftp archive in 1987 by UCI PhD student David Aha. Since that time, it has been widely used by students, educators, and ...
[18]
The History of Apache Hadoop and Big Data
Sep 13, 2023 · Apache Hadoop started in 2006 as an open source implementation of Google's file system and MapReduce execution engine. It quickly became a ...
[19]
10 Great Places to Find Free Datasets for Your Next Project
Nov 9, 2023 · Kaggle launched in 2010 with a number of machine learning competitions, which subsequently solved problems for the likes of NASA and Ford. It's ...
[20]
https://www.sjsu.edu/cs/community-and-events/cs-talks/spring-2019/gates.php
[21]
[PDF] A Comprehensive Review of Handling Missing Data - arXiv
Apr 9, 2024 · This PLS-based regression model is utilized for imputing missing values, effectively enhancing the completeness of the dataset. The ...
[22]
[PDF] Randomized Algorithms for Scalable Machine Learning
Dec 14, 2012 · In the massive data setting, computation of even a single point estimate on the full dataset can be quite computationally demanding, and so ...
[23]
JSON
ECMA-404 The JSON Data Interchange Standard. JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and ...
[24]
Extensible Markup Language (XML) - W3C
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale ...Introduction · Working Groups · Events
[25]
RDF 1.1 Concepts and Abstract Syntax
RDF is a graph-based data format for representing semantic data on the Web, defined by an abstract syntax in the RDF 1.1 Concepts and Abstract Syntax specification. Key features include:
[26]
RFC 4180 - Common Format and MIME Type for Comma-Separated ...
This RFC documents the format of comma separated values (CSV) files and formally registers the "text/csv" MIME type for CSV in accordance with RFC 2048.
[27]
Excel file format in Azure Data Factory and Azure Synapse Analytics
Feb 13, 2025 · Follow this article when you want to parse the Excel files. The service supports both ".xls" and ".xlsx". Excel format is supported for the following ...
[28]
Extensible Markup Language (XML) 1.0 (Fifth Edition) - W3C
Nov 26, 2008 · XML documents should be human-legible and reasonably clear. The XML design should be prepared quickly. The design of XML shall be formal and ...Namespaces in XML · Abstract · Review Version · First Edition
[29]
[PDF] A Relational Model of Data for Large Shared Data Banks
Keys provide a user-oriented means (but not the only means) of expressing such cross- references.
[30]
Relational vs Nonrelational Databases - Difference Between Types ...
Relational databases use tabular data with strict rules, while non-relational databases are more flexible, using various models and handling structured, semi- ...<|separator|>
[31]
Documentation - Apache Parquet
Documentation. Welcome to the documentation for Apache Parquet. The specification for the Apache Parquet file format is hosted in the parquet-format ...File FormatOverview
[32]
Chapter 2.4: Structured vs. Unstructured Data – Introduction to Data ...
This chapter examines the fundamental distinctions between structured, unstructured, and semi-structured data formats, exploring their characteristics, ...
[33]
What is Big Data? | MDS@Rice
Jun 1, 2023 · Structured data is the traditional data type: organized data stored in a relational database, typically with defined properties, formats, ...
[34]
Glossary: Unstructured Data | resources.data.gov
Unstructured data refers to masses of (usually) computerized information which do not have a data structure which is easily readable by a machine.
[35]
Challenges and best practices for digital unstructured data ...
First, digital unstructured data require novel forms of processing and standardization. Second, there is a lack of standardized guidelines, tools or techniques ...Missing: vision | Show results with:vision
[36]
Unstructured Data: Examples and How It Works - Datamation
Jan 2, 2024 · Examples of unstructured data include text documents, emails, social media posts, multimedia content (images, videos, audio), sensor data, and ...How Does Unstructured Data... · Advantages of Unstructured Data
[37]
Examples of Unstructured Data - MongoDB
Examples of unstructured data include emails, multimedia content, text files, social media content, and survey responses.
[38]
Structured vs Unstructured Data Explained with Examples - AltexSoft
Dec 16, 2024 · Unstructured data includes a wide array of forms, such as email, text files, social media posts, video, images, audio, sensor data, and so on.
[39]
[2412.09900] Analyzing Fairness of Computer Vision and Natural ...
Dec 13, 2024 · The study focuses on assessing and mitigating biases for unstructured datasets using Computer Vision (CV) and Natural Language Processing (NLP) ...
[40]
A Sprinklr Guide to Using NLP in Social Media Marketing
Dec 24, 2024 · In social media, NLP helps analyze vast amounts of unstructured text data (such as posts, comments and messages), gauge sentiments, and make ...Missing: feeds | Show results with:feeds
[41]
Characteristics of Big Data: Types, & Examples
Jun 1, 2020 · Unstructured data entails information with no predefined conceptual definitions and is not easily interpreted or analyzed by standard databases ...
[42]
What Is Data Collection: Methods, Types, Tools - Simplilearn.com
Oct 22, 2025 · Data collection is the process of collecting and evaluating information or data from multiple sources to find answers to research problems.
[43]
What is Sensor Data? Examples of Sensors and Their Uses
Aug 4, 2025 · Sensors gather and transmit data used for a variety of smart devices. They do this by collecting signals and turning them into data.
[44]
Synthetic data generation – AnyLogic Simulation Software
Synthetic data generation. Simulation models can be used to generate unlimited amounts of relevant, clean, structured, and labeled training data.Missing: dataset | Show results with:dataset
[45]
[1406.2661] Generative Adversarial Networks - arXiv
Jun 10, 2014 · We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models.
[46]
Data Merging Essentials: Process, Benefits, and Use Cases
Mar 7, 2025 · Data merging is the process of combining two or more data sets into a single, unified database. It involves adding new details to existing data.
[47]
How to Build Your Own Dataset with Web Scraping - ScraperAPI
Understand the complexities around datasets; what a dataset is, the types, and how to build your own with web scraping (step-by-step guide included!).
[48]
Different Types of Data Sampling Methods and Techniques
Jul 23, 2025 · Sampling techniques are categorized into two main types: probability sampling and non-probability sampling. Each type is tailored to specific research needs.Probability Sampling Techniques · Non-Probability Sampling...
[49]
Sampling Bias and How to Avoid It | Types & Examples - Scribbr
May 20, 2020 · Sampling bias occurs when some members of a population are systematically more likely to be selected in a sample than others.
[50]
Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities
Sep 6, 2005 · Data cleaning is the process of detecting, diagnosing, and editing faulty data, involving repeated cycles of screening, diagnosing, and editing.
[51]
Data Cleaning: Definition, Benefits, And How-To - Tableau
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.Missing: seminal | Show results with:seminal
[52]
The prevention and handling of the missing data - PMC - NIH
Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data ...Missing: completeness | Show results with:completeness
[53]
(PDF) Missing value imputation Techniques: A Survey - ResearchGate
Mar 5, 2023 · This paper offers a review on different techniques available for imputation of unknown information, such as median imputation, hot (cold) deck imputation, ...
[54]
1.3.5.17. Detection of Outliers - Information Technology Laboratory
Although it is common practice to use Z-scores to identify possible outliers, this can be misleading (particularly for small sample sizes) due to the fact ...Missing: reference | Show results with:reference
[55]
[PDF] The choice of scaling technique matters for classification performance
Dec 26, 2022 · Abstract. Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline.
[56]
Performance and Scalability of Data Cleaning and Preprocessing ...
In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and ...
[57]
(PDF) Data Preprocessing: The Techniques for Preparing Clean and ...
Aug 4, 2025 · This paper is about the different data preprocessing techniques which can be use for preparing the quality data for the data analysis for the available rough ...
[58]
[PDF] Data Validation for Machine Learning
We have also extended the schema to cover struc- tured examples (e.g., JSON objects or protocol buffers) but we do not discuss this capability in this paper.
[59]
A New Paradigm to Analyze Data Completeness of Patient Data - PMC
Summary. Background. There is a need to develop a tool that will measure data completeness of patient records using sophisticated statistical metrics.
[60]
Choose a Data Cleaning Platform - Research Guides
The "pandas" package is the most commonly-used tool in Python for handling tabular data, and Python users have a wide range of tools available for visualization ...
[61]
Types of Variables, Descriptive Statistics, and Sample Size - PMC
The main tools used for summary statistics are broadly grouped into measures of central tendency (such as mean, median, and mode) and measures of dispersion or ...Sorting And Grouping · Summary Statistics · Sample Size
[62]
Lesson 2: Descriptive Statistics | Biostatistics
Differentiate between descriptive and inferential statistics; Define basic measures of central tendency (mean, median, mode) and dispersion (standard deviation, ...
[63]
Descriptive Statistics - Purdue OWL
The mean, the mode, the median, the range, and the standard deviation are all examples of descriptive statistics.Welcome To The Purdue Owl · The Mean · The MedianMissing: dispersion | Show results with:dispersion
[64]
2.12 - Further Examples | STAT 501
A scatterplot with a regression line superimposed is given below, together with results of a simple linear regression model fit to the data. scatterplot ...
[65]
[PDF] Simple Linear Regression
Simple linear regression models the relationship between two variables, x and y, with the equation y = β0 + β1 x + ε, where ε is a random error term.
[66]
Chapter 21 Correlation and Regression | Introduction to Statistics ...
The t-test on correlations is about testing if the relation between X and Y is significantly strong, and the F-test is about whether adding X to the null model ...<|separator|>
[67]
R | STAT ONLINE - Penn State
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical ...
[68]
STAT 487: Introduction to Statistical Analysis with Python
This course teaches Python for statistical analysis, using Pandas, Statsmodels, and visualization tools, and introduces novice Python users to these tools.
[69]
Why 70/30 or 80/20 Relation Between Training and Testing Sets
Empirical studies show that the best results are obtained if we use 20-30% of the data for testing, and the remaining 70-80% of the data for training.
[70]
Training, Validation, Test Split for Machine Learning Datasets - Encord
Nov 19, 2024 · The train-test split is a technique in machine learning where a dataset is divided into two subsets: the training set and test set. The training ...
[71]
What is a feature engineering? | IBM
Feature engineering is the process of transforming raw data into relevant information for use by machine learning models.
[72]
Supervised machine learning: A brief primer - PMC - PubMed Central
The goal of this paper is to provide a primer in supervised machine learning (i.e., machine learning for prediction) including commonly used terminology ...Missing: seminal | Show results with:seminal
[73]
2.3. Clustering — scikit-learn 1.7.2 documentation
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the ...
[74]
Evaluation metrics and statistical tests for machine learning - NIH
Mar 13, 2024 · Our aim here is to introduce the most common metrics for binary and multi-class classification, regression, image segmentation, and object detection.<|control11|><|separator|>
[75]
[PDF] ImageNet Classification with Deep Convolutional Neural Networks
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in ...
[76]
A Golden Decade of Deep Learning: Computing Systems ...
May 1, 2022 · One early approach used large-scale distributed systems to train a single deep learning model. Google researchers developed the DistBelief ...
[77]
THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC ...
THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. R. A. FISHER Sc.D., F.R.S.,. R. A. FISHER Sc. ... First published: September 1936. https://doi.org ...
[78]
Hedonic housing prices and the demand for clean air - ScienceDirect
This paper investigates the methodological problems associated with the use of housing market data to measure the willingness to pay for clean air.
[79]
Boston Housing Dataset
Oct 10, 1996 · The data was originally published by Harrison, D. and Rubinfeld, D.L. ` Hedonic prices and the demand for clean air ', J. Environ. Economics & ...
[80]
[PDF] A Large-Scale Hierarchical Image Database - ImageNet
This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with. 5247 synsets and 3.2 million images in total. We show that. ImageNet ...
[81]
Common Crawl - Open Repository of Web Crawl Data
Common Crawl is a 501(c)(3) non–profit founded in 2007. · Over 300 billion pages spanning 18 years. · Free and open corpus since 2007. · Cited in over 10,000 ...Overview · Get Started · Common Crawl Infrastructure... · Examples Using Our Data
[82]
WHO COVID-19 dashboard data
COVID-19 downloadable statistical releases, Statistical release, Daily frequency reporting of new COVID-19 cases and deaths by date reported to WHO.
[83]
https://scikit-learn.org/stable/whats_new/v1.2.html
[84]
https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf
[85]
The Increase of Common Crawl Citations in Academic Research
Aug 6, 2024 · ... System Online) database is used for the biomedical literature. 81.7GB sized text data of Common Crawl which Amazon Web Services hosts is used ...Cumulative Citations · The Data · Resources
[86]
https://data.who.int/dashboards/covid19/data
[87]
Open Datasets and Tools: An overview for Hugging Face
Aug 27, 2025 · Trends in open data. Just like open-source LLMs have witnessed rapid growth, we have seen several trends in the past few years in open datasets:.Overview Of Dataset Types... · Trends In Open Data · Uci Machine Learning...
[88]
Share your open ML datasets on Hugging Face Hub!
Nov 12, 2024 · The Hub can host terabyte-scale datasets, with high per-file and per-repository limits. If you have data to share, the Hugging Face datasets ...Dataset Viewer · Third Party Library Support · Reach And Visibility
[89]
Principled missing data methods for researchers - PubMed Central
Yet, there is no established cutoff from the literature regarding an acceptable percentage of missing data in a data set for valid statistical inferences.
[90]
What is Data Bias? - IBM
Measurement bias can occur when the accuracy or quality of the data differs across groups or when key study variables are inaccurately measured or classified.
[91]
The Impact of Poor Data Quality (and How to Fix It) - Dataversity
Mar 1, 2024 · Poor data quality can lead to poor customer relations, inaccurate analytics, and bad decisions, harming business performance.
[92]
Re-identification attacks—A systematic literature review
The publication of increasing amounts of anonymised open source data has resulted in a worryingly rising number of successful re-identification attacks.
[93]
Regulation - 2016/679 - EN - gdpr - EUR-Lex - European Union
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing ...
[94]
Gender Shades: Intersectional Accuracy Disparities in Commercial ...
In this work, we present an approach to evaluate bias present in automated facial analysis algorithms and datasets with respect to phenotypic subgroups.
[95]
On responsible machine learning datasets emphasizing fairness ...
Aug 12, 2024 · In this study we discuss the importance of responsible machine learning datasets through the lens of fairness, privacy and regulatory compliance.
[96]
Read the Belmont Report | HHS.gov
Jul 15, 2025 · Informed Consent; Assessment of Risk and Benefits; Selection of Subjects. Ethical Principles & Guidelines for Research Involving Human Subjects.
[97]
The economics of data sovereignty and data localisation mandates
Oct 15, 2025 · Data sovereignty refers to the principle that data is subject to the laws of the country in which it is collected or stored. Data localisation ...
[98]
The Environmental Impact of Data: Strategies for Sustainability
Storing 1 terabyte of data in the cloud has a carbon footprint of 2 tonnes annually. Data centers, the backbone of the cloud, require extensive material ...