Fact-checked by Grok 2 weeks ago

Database theory

Database theory is the mathematical and logical study of finite structures underlying databases, focusing on formal representations, query processing, and data management principles to ensure efficient, correct, and scalable information handling in computer systems.^[1] It encompasses the foundational principles that guide the design, analysis, and optimization of database management systems (DBMS), drawing from areas such as logic, algebra, and complexity theory to address challenges in data modeling, storage, retrieval, and integrity.^[2] The field originated in the late 1960s with Edgar F. Codd's introduction of the relational model, which proposed representing data as relations (tables) connected via keys, enabling declarative querying without procedural navigation of complex links.^[3] This model, detailed in Codd's seminal 1970 paper, revolutionized database design by emphasizing data independence—separating logical views from physical storage—and laid the groundwork for relational algebra, a procedural query language using operations like selection, projection, and join.^[3] Codd's work addressed limitations of earlier hierarchical and network models, such as the CODASYL approach, by providing a mathematically rigorous framework that supports normalization to minimize redundancy and anomalies.^[3] Key areas in database theory include dependency theory, which analyzes functional, multivalued, and join dependencies to ensure data consistency; query optimization and equivalence, exploring how to transform queries for efficiency while preserving semantics; and constraint satisfaction, incorporating integrity rules like primary keys and foreign keys.^[4] More advanced topics extend to non-relational paradigms, such as semistructured data models for XML or JSON, and theoretical foundations for distributed databases, including consistency models like ACID properties in transactions.^[4] The field continues to evolve, integrating concepts from graph theory for NoSQL systems and probabilistic models for uncertain data, influencing modern applications in big data and machine learning.

History and Development

Origins in the 1960s and 1970s

The origins of database theory in the 1960s and 1970s were shaped by the need to manage growing volumes of data in early computing environments, where file-based systems and punched-card storage led to significant challenges such as data redundancy and inconsistency due to duplicated information across separate files for different applications.^[5] These limitations prompted innovations in structured data management, transitioning from ad hoc file handling to more systematic approaches that could support shared data access in business and scientific computing.^[6] A pivotal early development was Charles Bachman's creation of the Integrated Data Store (IDS) in 1963 at General Electric, recognized as the first direct-access database management system (DBMS).^[7] IDS introduced the network model, utilizing set types to represent owner-member relationships between data records, allowing navigation through linked structures rather than sequential file scans.^[8] This practical system addressed the inefficiencies of prior storage methods by enabling multiple pathways to data, influencing subsequent standards. Building on IDS, the Conference on Data Systems Languages (CODASYL) Data Base Task Group (DBTG) formalized the network model through conferences starting in 1969, publishing initial specifications in October of that year that defined a schema for interconnected record types and navigational queries.^[9] In parallel, hierarchical models emerged to organize data in tree-like structures, with IBM's Information Management System (IMS) released in 1968 for the System/360 mainframe, originally developed for NASA's Apollo program.^[10] IMS emphasized parent-child relationships in a rigid hierarchy, facilitating efficient access for transaction processing but limiting flexibility for complex associations compared to network approaches.^[11] These pre-relational models dominated early DBMS implementations, prioritizing performance in batch-oriented environments over declarative data independence. The landscape shifted dramatically with Edgar F. Codd's seminal 1970 paper, "A Relational Model of Data for Large Shared Data Banks," which proposed representing data as mathematical relations—essentially tables with rows (tuples) and columns (attributes)—grounded in set theory and predicate logic to eliminate physical navigation and reduce redundancy.^[3] Codd's model aimed to insulate users from storage details, enabling declarative querying via relational algebra, a procedural query language he introduced in the same paper.^[3] By the late 1970s, Codd's principles had gained traction, with practical prototypes such as IBM's System R project (1974–1979) implementing relational principles and developing an early version of SQL, bridging theory and practice; Codd's contributions were recognized with the 1981 ACM Turing Award.^[12]^[13] This culminated in his 1985 articulation of 12 rules (plus a zeroth rule) for true relational systems, which traced back to his efforts to ensure data integrity and independence in shared environments.^[14]

Evolution Through the 1980s and Beyond

The 1980s marked a period of standardization and theoretical consolidation in database theory, building on the relational model's foundations. The American National Standards Institute (ANSI) published the first SQL standard (ANSI X3.135-1986) in 1986, which formalized query language syntax and semantics for relational databases, enabling portability across implementations. This was followed by the International Organization for Standardization (ISO) adopting SQL as an international standard (ISO/IEC 9075:1987) in 1987, incorporating refinements to Codd's 12 rules for relational database management systems outlined in 1985. Concurrently, refinements to normalization theory emerged, such as extensions to Boyce-Codd Normal Form (BCNF) that addressed dependency preservation issues in decomposition algorithms.^[15] Additionally, deductive databases gained prominence through the introduction of Datalog in the early 1980s, a logic-based query language extending relational algebra to support recursive queries and knowledge representation. Key textbooks synthesized these advancements, with Abraham Silberschatz and Henry F. Korth's Database System Concepts (first edition, 1986) providing a comprehensive framework integrating relational theory, query optimization, and system architectures.^[16] Similarly, Jeffrey D. Ullman's Principles of Database and Knowledge-Base Systems (1988) explored theoretical underpinnings of relational and deductive systems, emphasizing complexity analysis and inference rules.^[17] In the 1990s, database theory expanded to accommodate emerging paradigms like object-orientation and semistructured data. The Object Database Management Group (ODMG) released its first standard in 1993 (ODMG-93), proposing an object-oriented data model that integrated classes, inheritance, and methods with relational querying via OQL, aiming to bridge object-oriented programming and database persistence. Toward the late 1990s, semistructured data models emerged to handle irregular formats like XML, with theoretical foundations laid in works such as Serge Abiteboul's analysis of object representation and query languages for tree-structured data. The 2000s and beyond saw database theory adapt to distributed and large-scale systems, particularly with the rise of NoSQL models. Eric Brewer's CAP theorem, presented in 2000, formalized trade-offs in distributed systems between consistency, availability, and partition tolerance, influencing theoretical analyses of key-value stores and eventual consistency models. Big data paradigms further shaped the field, with the theoretical foundations of MapReduce—introduced by Jeffrey Dean and Sanjay Ghemawat in 2004—enabling parallel query processing over massive datasets through functional decomposition into map and reduce operations. In the 2010s, theoretical work on distributed query processing advanced, focusing on optimization techniques for federated and cloud-based systems, including cost-based planning across heterogeneous data sources. Recent developments in the 2020s have emphasized quantum and privacy-enhanced paradigms. Quantum database theory explores superposition and entanglement for query acceleration, with foundational models proposed for quantum relational algebras and indexing structures.^[18] Privacy-preserving query theory has advanced through homomorphic encryption and differential privacy integrations, enabling secure data analysis without exposing sensitive information, as detailed in protocols for quantum private queries.

Core Concepts

Data Models and Abstractions

Database theory employs data models as formal abstractions to represent the structure, semantics, and constraints of data, enabling the design of systems that manage information independently of specific hardware or software implementations. These models bridge the gap between real-world entities and computational storage, emphasizing concepts like relationships, hierarchies, and integrity rules to ensure efficient data organization and retrieval. Abstractions in this context allow developers and users to focus on logical representations without concern for underlying physical details, a principle central to modern database management systems (DBMS).^[19] A foundational framework for these abstractions is the ANSI/SPARC three-schema architecture, proposed by the ANSI/X3/SPARC Study Group on Database Management Systems in 1975. This architecture delineates three levels: the external schema, which provides tailored user views of the data; the conceptual schema, which defines the overall logical structure shared across users; and the internal schema, which describes the physical storage, indexing, and access paths. By separating these levels, the architecture facilitates modular design, where changes at one level minimally impact others, promoting scalability and maintainability in DBMS.^[20] Key to this architecture are the concepts of data independence, which shield applications from storage or structural modifications. Logical data independence ensures that alterations to the conceptual schema—such as adding new entity types—do not require changes to external views, allowing users to maintain consistent interfaces. Physical data independence, conversely, permits optimizations in the internal schema, like reorganizing file structures or switching storage devices, without affecting the conceptual or external levels. These independences reduce development costs and enhance system flexibility, as evidenced in early DBMS implementations adhering to the framework.^[21] Various data models emerged to operationalize these abstractions, each suited to different complexity levels in data relationships. The hierarchical model, pioneered by IBM in 1968 with the Information Management System (IMS), structures data as a tree with parent-child linkages, ideal for one-to-many associations like organizational charts or bill-of-materials. IMS, initially developed for NASA's Apollo program, represented segments as nodes in a rooted hierarchy, enforcing strict navigation from root to leaves.^[10] Building on this, the network model, formalized in the CODASYL Database Task Group (DBTG) report of 1971, extends to graph-like structures using record types connected via sets, supporting many-to-many relationships through owner-member pointers. This model, implemented in systems like Integrated Data Store (IDS), allowed more flexible querying than hierarchies but required explicit pointer management, complicating maintenance.^[22] The relational model, introduced by Edgar F. Codd in 1970, abstracts data into tables (relations) composed of rows (tuples) and columns (attributes), drawing from mathematical set theory to eliminate physical pointers in favor of declarative queries. As a prominent logical data model, it underpins SQL-based systems by treating relations as subsets of Cartesian products over domains.^[23] Complementing these, the entity-relationship (ER) model, developed by Peter P. Chen in 1976, serves as a high-level conceptual tool for capturing real-world semantics through entities (objects like "employee"), attributes (properties like "name"), and relationships (associations like "works in"). Unlike implementation-focused models, ER emphasizes diagrammatic representation to model domain knowledge before mapping to physical structures.^[24] Mathematical foundations underpin these models, rooted in set theory where domains represent atomic value sets (e.g., integers or strings), tuples are finite ordered sequences from those domains, and relations form sets of such tuples without duplicates. Value-based models, exemplified by the relational approach, focus on immutable data values without inherent identity, ensuring operations like union and projection preserve set properties. In contrast, object-based models introduce unique identifiers and behaviors, allowing encapsulation of complex types like multimedia, though they complicate equality comparisons.^[23]^[19] Views further exemplify abstractions by acting as virtual tables derived from queries on base relations, presenting filtered or joined data without storing it physically. In relational systems, views enhance security by restricting access and simplify complex schemas, aligning with the external schema in ANSI/SPARC to provide user-specific abstractions.^[22] Within the ER model, cardinality constraints refine relationship semantics: one-to-one (1:1) limits each entity instance to a single counterpart (e.g., person to passport); one-to-many (1:N) allows a single source to link multiple targets (e.g., department to employees); and many-to-many (M:N) permits multiple bidirectional associations (e.g., students to courses), often resolved via intermediate entities. Weak entities, lacking full primary keys, rely on partial discriminators and owner entity keys for uniqueness (e.g., a dependent tied to an employee), enforcing existence dependencies. Aggregation elevates a relationship to entity status, enabling it to participate in higher-level relationships (e.g., aggregating "project assignment" to relate to "budget approval"), thus modeling nested semantics hierarchically.^[21]

Database Schemas and Instances

In database theory, the schema represents the static, declarative specification of the database structure, defining elements such as table names, attributes, data types, and relationships, while the instance comprises the dynamic, actual data values that populate this structure at any given moment. This distinction ensures that the schema provides a blueprint for data organization independent of the specific content, allowing multiple instances to conform to the same schema over time. For example, in a relational database, the schema might define a "Employees" relation with attributes like EmployeeID (integer, primary key) and Name (string), whereas an instance would contain rows such as (101, "Alice Smith").^[23] Schemas are categorized into types based on the ANSI/SPARC three-level architecture, which separates concerns for better abstraction and maintenance. The logical schema, also known as the conceptual schema, describes the overall structure in terms of relations, attributes, and constraints, hiding physical storage details from users. The physical schema, or internal schema, specifies storage aspects such as file organization, indexing strategies, and access paths to optimize performance on hardware. This architecture promotes data independence, where changes to the physical schema do not affect the logical view.^[20] Integrity constraints are rules enforced by the schema to maintain data quality and consistency across instances. Domain constraints restrict attribute values to predefined types or ranges, such as ensuring ages are positive integers between 0 and 150. Key constraints include primary keys, which uniquely identify tuples in a relation, and foreign keys, which link relations while preserving uniqueness. Referential integrity ensures that foreign key values match existing primary keys in referenced relations, preventing orphaned data. These constraints are typically checked during insertions, updates, or deletions to uphold the schema's semantic rules.^[23] Update anomalies arise when schema designs lead to inconsistencies during modifications to the instance, often due to redundant data storage or improper relationship modeling. Insertion anomalies occur when adding new data requires extraneous information or is blocked by missing dependencies, such as inability to record a new department without an employee. Deletion anomalies happen when removing a tuple eliminates unrelated but valuable data, like losing supplier details upon deleting the last order. Modification anomalies result from updating one instance of repeated data without propagating changes elsewhere, causing discrepancies. These issues highlight the need for schemas that align structure with intended data behaviors to minimize such risks.^[25] Dynamic aspects of schemas and instances involve managing changes over time to accommodate evolving requirements without data loss. Schema evolution includes operations like adding or removing attributes, altering data types, or merging relations, often requiring compatibility rules to preserve existing instances. Instance versioning tracks historical states of data, enabling queries across temporal snapshots while adhering to the current schema. These mechanisms ensure long-term adaptability, as seen in systems that support backward-compatible schema changes to avoid application disruptions.^[26]

Relational Model

Relational Structure and Keys

In the relational model, a database is conceptualized as a collection of relations, where each relation corresponds to a table composed of attributes (columns) and tuples (rows).^[3] Each attribute is defined over a specific domain, which is the set of permissible values for that attribute, ensuring type consistency and semantic meaning.^[3] Formally, a relation of degree n is a finite subset of the Cartesian product of n domains, meaning it consists of ordered n-tuples where the i-th component of each tuple belongs to the i-th domain.^[3] Attributes within a relation are ordered and labeled by domain names for clarity. Tuples are ordered according to this attribute sequence, and the relation is an unordered set of such tuples with no duplicates allowed, preserving the mathematical set-theoretic foundation of the model.^[3] Keys play a central role in uniquely identifying and linking data within relations. A superkey is any set of one or more attributes whose values collectively uniquely identify each tuple in the relation, such that no two tuples share the same combination of values for those attributes. A candidate key, also known as a minimal superkey, is a superkey with no proper subset that is itself a superkey, ensuring irredundancy in identification. The primary key is a specifically chosen candidate key designated for each relation to serve as the principal means of uniquely identifying its tuples, often enforced through system constraints to maintain uniqueness. A foreign key, in contrast, is a set of attributes in one relation that refers to the primary key (or a candidate key) of another relation, enabling the establishment of relationships between tables. To ensure data consistency, the relational model incorporates key-based integrity constraints. Entity integrity mandates that no component of a primary key can contain a null value in any tuple, as this would undermine the ability to uniquely identify entities. Referential integrity requires that for every non-null value in a foreign key attribute, there must exist a matching value in the corresponding primary key of the referenced relation, preventing orphaned references and maintaining logical consistency across relations. These constraints, rooted in the model's foundational principles, are typically enforced by the database management system to safeguard the structural integrity of the data.^[3] Consider a simple example involving employee and department relations. The Employee relation might include attributes EmpID (primary key, domain: integers), Name (domain: strings), and DeptID (foreign key referencing Department). The Department relation includes DeptID (primary key, domain: integers) and DeptName (domain: strings). A sample instance of the Employee relation could be:

EmpID	Name	DeptID
101	Alice	10
102	Bob	20
103	Carol	10

And the Department relation:

DeptID	DeptName
10	HR
20	IT

Here, {EmpID} is the primary key and a candidate key for Employee, while DeptID serves as a foreign key ensuring referential integrity—e.g., only existing DeptID values from Department are allowed, and EmpID values are non-null.

Relational Algebra Operations

Relational algebra provides a procedural framework for querying and manipulating relations in the relational model, consisting of a set of operations that take one or more relations as input and produce a new relation as output. Introduced by Edgar F. Codd, it serves as a theoretical foundation for query languages like SQL, enabling the composition of complex queries through systematic combination of basic operations.^[3] The algebra's operations are grounded in set theory, ensuring that results remain within the domain of relations, which supports its use in defining the expressive power of relational query languages.^[27] The core operations include selection, projection, union, set difference, Cartesian product, and rename. Selection, denoted \sigma_{\theta}(R), retrieves tuples from relation R that satisfy a condition \theta, formally defined as \sigma_{\theta}(R) = \{ t \in R \mid t \text{ satisfies } \theta \}. For example, \sigma_{A=5}(R) yields all tuples in R where attribute A equals 5.^[3] Projection, denoted \pi_{A_1, \dots, A_k}(R), selects specified attributes from R and eliminates duplicates, given by \pi_{A_1, \dots, A_k}(R) = \{ (t[A_1], \dots, t[A_k]) \mid t \in R \}, where duplicates are implicitly removed to maintain set semantics.^[3] Union, R \cup S, combines tuples from compatible relations R and S (same arity and domain types), producing \{ t \mid t \in R \lor t \in S \}, while set difference R - S yields \{ t \in R \mid t \notin S \}.^[27] The Cartesian product R \times S generates all possible tuple pairs, resulting in a relation of degree equal to the sum of degrees of R and S, formally R \times S = \{ (t_r, t_s) \mid t_r \in R, t_s \in S \}.^[27] Rename, denoted \rho_{newname}(R), reassigns names to relations or attributes for clarity in compositions, preserving the underlying tuples.^[3] Derived operations extend the core set to handle common query patterns. Intersection can be expressed as R \cap S = R - (R - S), yielding tuples common to both compatible relations.^[27] The theta-join, R \bowtie_{\theta} S, combines R and S by selecting pairs satisfying condition \theta from their Cartesian product, defined as R \bowtie_{\theta} S = \sigma_{\theta}(R \times S); a natural join occurs when \theta equates common attributes.^[27] Division, R \div S, addresses "all" queries, such as finding values in R associated with every tuple in S, formally R \div S = \{ t \in \pi_K(R) \mid \forall u \in S, (t, u) \in R \}, where K are the non-dividend attributes.^[27] A key property of relational algebra is its closure: every operation outputs a relation, allowing arbitrary nesting of operations to form complex expressions without leaving the relational framework.^[27] This closure facilitates the construction of queries like identifying employees in a specific department: given an Employees relation (emp_id, name, dept_id) and Departments relation (dept_id, dept_name), the query \pi_{name}(\sigma_{dept_name = 'Sales'}(Employees \bowtie_{dept_id} Departments)) first joins on department ID, selects the 'Sales' department, and projects employee names.^[3] In terms of expressiveness, relational algebra is relationally complete, meaning it can express any query definable in the relational calculus, with a direct translation algorithm establishing their equivalence.^[27] This completeness ensures that procedural algebraic expressions capture the full declarative power of first-order logic-based queries on relations.^[27]

Query Languages and Logic

Relational Calculus and Datalog

Relational calculus provides a declarative foundation for querying relational databases, expressing what data to retrieve rather than how to retrieve it, using first-order logic predicates over relations. Introduced as part of the relational model, it contrasts with procedural approaches by focusing on logical conditions that tuples must satisfy. Tuple relational calculus (TRC) and domain relational calculus (DRC) are the primary variants, both ensuring expressive power equivalent to relational algebra while restricting to safe, domain-independent queries that yield finite results on finite databases.^[27] In TRC, a query is formulated as \{ t \mid \phi(t) \}, where t is a free tuple variable ranging over tuples in the database, and \phi(t) is a first-order logic formula built from atomic predicates (e.g., t \in R for relation R, or t.A = u.B for attribute comparisons), connected by logical operators \land, \lor, \lnot, and quantifiers \forall, \exists. The formula \phi(t) may involve bound variables ranging over relations, but safety requires domain independence: all values in the result must originate from the active domain of the database instance, preventing infinite or undefined outputs from unrestricted negation or existential quantification. This restriction ensures practical computability, as unrestricted TRC can express non-domain-independent queries that produce infinite relations even on finite inputs.^[27]^[28] DRC shifts the focus to individual attribute values (domains) rather than entire tuples, with queries of the form \{ \langle x_1, x_2, \dots, x_n \rangle \mid \phi(x_1, x_2, \dots, x_n) \}, where each x_i ranges over a specific domain (e.g., employee IDs or department names), and \phi is a first-order formula using atomic predicates like P(x_1, \dots, x_m) for relation P. For instance, to retrieve pairs of employees and their departments, the query might be \{ \langle x_1, x_2 \rangle \mid \exists y \, (Employee(x_1, y) \land Dept(y, x_2)) \}, where x_1 and x_2 are free variables over employee and department domains, respectively. Like TRC, DRC enforces domain independence for safety, limiting variables to values appearing in the database to avoid extraneous results. DRC was proposed to emphasize domain structures explicitly, aiding in type-safe query formulation.^[29]^[28] Codd's theorem establishes the theoretical equivalence between relational calculus (both TRC and DRC, when restricted to domain-independent expressions) and relational algebra: any query expressible in one can be translated into the other with identical semantics on relational databases. This relational completeness criterion measures a query language's power against the full capabilities of first-order logic over relations, excluding features like aggregation or ordering. The theorem underpins the design of practical query languages, confirming that declarative logic matches procedural operations in expressiveness without loss.^[27] Datalog extends relational calculus principles into a rule-based logic programming language for deductive databases, enabling inference over base relations to derive new facts recursively. As a subset of Prolog without function symbols or complex terms, Datalog programs consist of facts (e.g., Parent(Alice, Bob).) and rules of the form Head :- Body., where Head is a positive atom (e.g., Ancestor(X,Y)) and Body is a conjunction of atoms (positive or negated literals, e.g., Parent(X,Z) \land Ancestor(Z,Y)). Evaluation uses bottom-up fixed-point semantics, iteratively applying rules until no new facts are derived, capturing the least Herbrand model.^[28] To handle negation safely, Datalog employs stratified negation, partitioning predicates into strata where negation only references lower strata, avoiding recursion through negative literals that could lead to non-monotonic or unstable semantics. Stratification ensures a unique stable model computable via fixed-point iteration, preserving domain independence and decidability. Unstratified negation introduces well-founded semantics for broader expressiveness but increases complexity.^[28] Datalog excels in applications requiring recursive queries, such as computing transitive closures in graphs or hierarchies. For example, the ancestor relation can be defined as:

\begin{align*} &[Ancestor](/page/Ancestor)(X,Y) :- [Parent](/page/Parent)(X,Y). \\ &[Ancestor](/page/Ancestor)(X,Y) :- [Parent](/page/Parent)(X,Z), [Ancestor](/page/Ancestor)(Z,Y). \end{align*}

This program iteratively derives all ancestry paths, enabling queries over relational data like family trees or organizational structures without explicit loops. Such recursion addresses limitations in basic relational calculus, supporting deductive reasoning in knowledge bases.^[28]

Query Optimization Principles

Query optimization in database theory involves systematically selecting the most efficient execution plan for a given query from a multitude of possible alternatives, aiming to minimize resource consumption such as CPU time and I/O operations while preserving query semantics.^[30] This process is crucial in relational database management systems (RDBMS), where declarative queries expressed in languages like SQL must be translated into procedural operations on data structures. Theoretical foundations emphasize balancing query complexity with estimation accuracy to avoid suboptimal plans that could lead to exponential performance degradation.^[30] The optimization process typically unfolds in two primary phases: algebraic optimization, also known as logical optimization, and physical optimization. In the algebraic phase, the initial query representation in relational algebra is rewritten using equivalence rules to produce semantically identical but potentially more efficient expressions, focusing on restructuring without considering storage details.^[30] Physical optimization follows, mapping the logical plan to specific implementation strategies, such as selecting algorithms for operators and access paths like indexes or sequential scans, to generate a concrete execution plan.^[30] These phases ensure that optimizations are both transformationally sound and practically viable. Equivalence transformations form the core of algebraic optimization, leveraging rules derived from relational algebra properties to reorder or simplify operations. A fundamental rule is the pushdown of selections, where conjunctive predicates are applied as early as possible to reduce intermediate result sizes; for instance, \sigma_{c_1 \land c_2}(R) \equiv \sigma_{c_1}(\sigma_{c_2}(R)), allowing independent selections to be cascaded before costlier joins.^[30] Join reordering exploits associativity and commutativity, such as \pi(A,B)(R \bowtie S) \equiv \pi(A,B)(S \bowtie R), to explore different operator trees, as pioneered in the System R optimizer developed by IBM in the 1970s, which used dynamic programming to enumerate join orders while avoiding invalid sequences like Cartesian products between non-connected relations.^[31] These transformations generate a search space of equivalent plans, often represented as left-deep or bushy trees, to facilitate subsequent cost-based selection.^[30] Cost models provide the quantitative framework for comparing plans by estimating the resources required for execution, primarily focusing on I/O costs due to their dominance in traditional disk-based systems. Estimates incorporate selectivity (the fraction of tuples satisfying a predicate), cardinality (the number of tuples in relations or intermediates), and device-specific parameters like page size.^[30] For a nested-loop join between relations R and S, a basic cost model approximates the total as |R| \times |S|, assuming a full scan of S for each tuple in R, though refinements account for buffering and indexing to reduce inner probes.^[31] Such models enable ranking plans, with I/O often weighted highest, as in System R's formula for join cost: C_{\text{join}} = C_{\text{outer}} + N_{\text{outer}} \times C_{\text{inner}}, where N_{\text{outer}} is the estimated cardinality of the outer relation.^[31] Search strategies enumerate the plan space generated by transformations, balancing optimality with computational feasibility given the exponential growth in alternatives for queries with multiple joins. Exhaustive strategies, like dynamic programming in System R, systematically build optimal subplans bottom-up for all subsets of relations, achieving optimality for select-project-join (SPJ) queries at O(n 2^{n-1}) complexity, where n is the number of relations.^[31] Heuristic approaches, such as greedy algorithms, approximate solutions by iteratively selecting the lowest-cost partial plan, pruning the space to handle larger queries efficiently but risking suboptimal results.^[30] Randomized methods, including genetic algorithms, explore the space probabilistically by evolving populations of plans through mutation and crossover, offering scalability for complex queries beyond exhaustive enumeration.^[30] Accurate cardinality estimates underpin both cost models and search strategies, relying on database statistics to predict intermediate result sizes. Histograms partition attribute values into buckets to approximate data distributions, enabling selectivity estimates for predicates; for example, equi-depth histograms ensure uniform bucket sizes for uniform assumptions, while frequency histograms capture exact counts for skewed data.^[30] Sampling methods complement histograms by drawing random subsets of data to build or update statistics dynamically, as explored in foundational work on relational sampling, which supports unbiased estimates for large relations with controlled error bounds.^[32] In System R, basic statistics like relation cardinality and index key counts provide selectivity factors, such as F = 1 / \text{number of distinct keys} for equality predicates, forming the basis for propagating estimates through the plan.^[31]

Data Integrity and Normalization

Functional Dependencies

In relational database theory, a functional dependency (FD) is a constraint that specifies how the values of certain attributes determine the values of other attributes in a relation. Formally, for sets of attributes X and Y in a relation schema R, an FD X \to Y holds if, in every instance of R, two tuples that agree on all attributes in X also agree on all attributes in Y. This concept was introduced by Edgar F. Codd as part of the foundational relational model to capture semantic relationships among data attributes.^[23] FDs are classified as trivial if Y \subseteq X, in which case the dependency always holds regardless of the data instance, or non-trivial otherwise.^[33] A practical example of an FD arises in an employee relation schema with attributes {EmpID, Name, Dept, Salary}. Here, {EmpID} \to {Name, Dept} indicates that the employee ID uniquely determines the name and department, ensuring no two employees share the same ID but have different associated names or departments. Such FDs help identify candidate keys, which are minimal sets of attributes that functionally determine all other attributes in the relation.^[34] To reason about sets of FDs, denoted F, database theory employs inference rules known as Armstrong's axioms, established in 1974. These include:

Reflexivity: If Y \subseteq X, then X \to Y.
Augmentation: If X \to Y, then for any set Z, XZ \to YZ.
Transitivity: If X \to Y and Y \to Z, then X \to Z.

Derived rules, such as union and decomposition, follow from these basics. Armstrong's axioms are both sound, meaning they only derive valid FDs from F, and complete, meaning every FD logically implied by F can be derived using the axioms.^[33] Given a set of FDs F, the closure of an attribute set X with respect to F, denoted X^+, is the set of all attributes functionally determined by X using the inference rules. Computing X^+ involves an iterative algorithm: initialize X^+ = X; then, while possible, add an attribute A to X^+ if there exists an FD U \to V in F such that U \subseteq X^+ and A \in V. This process terminates when no further attributes can be added, yielding X^+. The algorithm runs in time linear in the size of the schema and F. This closure is crucial for tasks like key computation.^[34] A minimal cover (or canonical cover) of a set of FDs F is an equivalent set of FDs F_c that is minimal in structure: each right-hand side is a singleton attribute, no FD in F_c is redundant, and no attribute in the left-hand side of any FD is extraneous. To compute a minimal cover, first decompose each FD in F into singleton right-hand sides using decomposition; then, for each resulting FD, remove extraneous attributes from the left-hand side by checking if the FD still holds after removal (using attribute closure); finally, eliminate any redundant FDs by verifying if they are implied by the remaining set. This process ensures F_c captures the same dependencies as F but in a simplified form, facilitating database design. Seminal work on efficient computation of such covers appears in synthesis algorithms for normal forms.^[35]

Normal Forms and Decomposition

Normal forms in database theory provide a systematic approach to designing relational schemas that minimize data redundancy and prevent update anomalies, building on functional dependencies (FDs) to ensure data integrity. By decomposing relations into smaller, well-structured components, normalization eliminates undesirable dependencies that lead to insertion, deletion, and modification issues. The process begins with basic forms like first normal form (1NF) and progresses to higher forms such as Boyce-Codd normal form (BCNF), fourth normal form (4NF), and fifth normal form (5NF), each addressing specific types of dependencies. These forms guide the transformation of a universal relation into a set of relations that preserve the original semantics while reducing anomalies.^[23]^[25] First normal form (1NF) requires that all attributes in a relation contain only atomic (indivisible) values, eliminating repeating groups or nested relations. This ensures that each tuple represents a single, indivisible record, allowing the relation to be treated as a set of tuples without multivalued attributes within cells. For example, a relation storing employee skills as a list in one attribute violates 1NF; decomposing it into separate tuples for each skill achieves 1NF. Introduced in the foundational relational model, 1NF forms the basis for all higher normal forms by enforcing domain integrity at the attribute level.^[23] Second normal form (2NF) builds on 1NF by requiring that no non-prime attribute (one not part of any candidate key) is partially dependent on any candidate key. In other words, every non-prime attribute must depend on the entire candidate key, not just a subset, to avoid partial dependencies that cause redundancy. For instance, in a relation with attributes {OrderID, ProductID, Quantity, ProductName}, where {OrderID, ProductID} is the key and ProductName depends only on ProductID, the partial dependency leads to repeated ProductName values; decomposing into separate Order and Product relations resolves this. This form reduces update anomalies related to key subsets.^[25] Third normal form (3NF) extends 2NF by eliminating transitive dependencies, where no non-prime attribute depends on another non-prime attribute. Each non-prime attribute must depend directly on a candidate key, preventing chains of dependencies that propagate redundancies. Consider a relation {EmployeeID, Department, DepartmentLocation} with key EmployeeID and FD EmployeeID → Department, Department → DepartmentLocation; the transitive dependency causes redundant location data. Decomposing into Employee-Department and Department-Location relations achieves 3NF. This form balances normalization by allowing some redundancy if it preserves dependencies efficiently.^[25] Boyce-Codd normal form (BCNF), proposed by Raymond F. Boyce and Edgar F. Codd in 1974, strengthens 3NF by requiring that for every non-trivial FD X \to Y in the relation, X is a superkey (containing a candidate key). Unlike 3NF, which permits determinants that are not superkeys if they do not create transitive dependencies, BCNF ensures every determinant is a candidate key, further eliminating anomalies from overlapping keys. For example, consider a relation R(A, B, C) with candidate keys {A, B} and {A, C}, and FD B → C. Here, B → C violates BCNF because B is not a superkey (it does not determine A), although the relation is in 3NF since C is a prime attribute. Decomposing into {A, B} and {B, C} resolves the violation. BCNF may not always allow dependency-preserving decompositions, unlike 3NF.^[36] Decomposition techniques ensure that normalizing a relation does not lose information or violate constraints. A decomposition is lossless-join if the natural join of the decomposed relations equals the original relation, preserving all tuples without spurious ones; this is tested using the chase algorithm, which applies FDs to a tableau representation until equality or contradiction is reached. Dependency-preserving decompositions maintain all original FDs locally in the subrelations, avoiding the need for joins to enforce them. The synthesis algorithm for 3NF computes a canonical cover of FDs, groups them into minimal relations covering each FD, and ensures the result is lossless and dependency-preserving by including a key if necessary. For instance, given a relation R with FDs A → B, C → D, the algorithm yields relations (A, B) and (C, D), joined losslessly via a shared key. These properties enable practical schema design without excessive query complexity.^[37]^[38] Higher normal forms address more complex dependencies beyond FDs. Fourth normal form (4NF), introduced by Ronald Fagin in 1977, targets multivalued dependencies (MVDs), denoted X \to\to Y, where for a given X, the set of Y values is independent of other non-key attributes. A relation is in 4NF if it is in BCNF and has no non-trivial MVDs, preventing redundancy from independent multivalued facts. For example, in {Employee, Skill, Child}, if Employee →→ Skill and Employee →→ Child hold independently, decomposing into Employee-Skill and Employee-Child eliminates repetition. Introduced to handle such cases, 4NF ensures no MVD violations cause anomalies.^[39] Fifth normal form (5NF), also known as project-join normal form and proposed by Edgar F. Codd in 1979, eliminates join dependencies (JDs), where a relation can be decomposed into projections that join back to the original without loss or spurious tuples. A relation is in 5NF if it is in 4NF and every non-trivial JD is implied by the candidate keys. This form addresses cyclic dependencies across multiple attributes; for instance, a relation {Agent, Company, Product} with a JD decomposing into pairwise projections requires full 5NF decomposition to avoid redundancy in assignments. 5NF provides the highest level of normalization for eliminating all decomposition-related anomalies.^[40]

Concurrency and Recovery

Transaction Properties and ACID

In database theory, a transaction is defined as a sequence of read and write operations on the database that is executed as a single logical unit of work, appearing atomic from an external perspective to ensure data integrity and consistency.^[41] This atomicity means that the effects of the transaction are indivisible, either fully applied or not at all, preventing partial updates that could lead to inconsistent states.^[42] The ACID properties formalize the guarantees required for reliable transaction processing in databases. Atomicity ensures that a transaction is treated as an indivisible unit: all operations succeed (commit) or none do (abort), with rollback mechanisms restoring the database to its pre-transaction state if failure occurs.^[41] Consistency requires that a transaction brings the database from one valid state to another, preserving all integrity constraints such as primary keys, foreign keys, and user-defined rules.^[42] Isolation guarantees that concurrent transactions do not interfere with each other, making them appear to execute serially even when overlapping in time, thus avoiding anomalies like dirty reads or lost updates.^[43] Durability ensures that once a transaction commits, its changes persist in stable storage even after system failures, typically achieved through logging to non-volatile media.^[41] These properties, originally conceptualized with atomicity, consistency, and durability by Jim Gray and later formalized as the ACID acronym by Theo Härder and Andreas Reuter, provide a theoretical foundation for robust database systems.^[42] Serializability serves as the primary correctness criterion for concurrent transaction execution, stating that a schedule of transactions is correct if its outcome is equivalent to some serial execution of those transactions in a specific order.^[43] Conflict serializability, a sufficient condition for serializability, holds if the schedule can be transformed into a serial one by swapping non-conflicting operations, where conflicts arise from operations on the same data item by different transactions (e.g., two writes or a read followed by a write).^[43] This can be tested efficiently using a precedence graph, where nodes represent transactions and directed edges indicate conflicting operation orders; the schedule is conflict serializable if and only if the graph is acyclic.^[43] Recovery mechanisms underpin the ACID properties, particularly atomicity and durability, by enabling the database to recover from failures like crashes or power losses. Logging records all transaction actions to stable storage, with write-ahead logging (WAL) requiring log entries for changes to be written before the corresponding data modifications, allowing redo or undo during recovery.^[44] Checkpoints periodically flush dirty buffer pages to disk and record the log position, bounding recovery time by limiting the log to scan; fuzzy dumping, or fuzzy checkpointing, performs this asynchronously without halting transactions, producing a consistent backup image despite ongoing modifications.^[45] The ARIES algorithm exemplifies these techniques, integrating WAL with fuzzy checkpoints and compensating actions for partial rollbacks to support fine-grained locking and efficient recovery.^[44] A classic example is a bank transfer transaction, where debiting one account and crediting another must occur atomically: if the debit succeeds but the credit fails, atomicity triggers rollback to prevent lost funds, while durability ensures committed transfers survive a subsequent system crash.^[41]

Concurrency Control Mechanisms

Concurrency control mechanisms in database theory address the challenge of managing simultaneous access to shared data by multiple transactions, ensuring isolation while maintaining performance in multi-user environments. These mechanisms prevent conflicts such as dirty reads, non-repeatable reads, and phantom reads by enforcing serializability, where the outcome of concurrent transactions matches some serial execution order. Pessimistic approaches, like locking and timestamp ordering, proactively detect and resolve conflicts, whereas optimistic methods defer validation until commit time.^[46] Locking protocols form a foundational pessimistic mechanism, where transactions acquire locks on data items before accessing them to ensure mutual exclusion. Shared locks allow multiple transactions to read a data item concurrently but block writes, while exclusive locks permit a single transaction to read or write, blocking all other accesses. The two-phase locking (2PL) protocol divides transaction execution into a growing phase, where locks are acquired but none released, and a shrinking phase, where locks are released but none acquired; this ensures conflict serializability by preventing cycles in the precedence graph of transactions. Deadlocks arise in 2PL when transactions form a cycle of mutual waiting, detected via wait-for graphs where nodes represent transactions and directed edges indicate one transaction waiting for a lock held by another; resolution typically involves aborting one transaction in the cycle, often the youngest based on timestamps.^[46] Timestamp ordering assigns a unique timestamp to each transaction upon initiation, ordering operations based on these timestamps to simulate a serial schedule. For a read operation by transaction T on data item X, if T's timestamp is later than X's write timestamp, the read proceeds; otherwise, T aborts to avoid reading "too late" values that violate serializability. Write operations similarly check against read and write timestamps, with conflicts leading to aborts. The Thomas' write rule optimizes this by ignoring a write if a later write has already occurred, reducing unnecessary aborts without compromising serializability, as the obsolete write would not affect the final state.^[47] Optimistic concurrency control assumes low conflict rates and allows transactions to proceed without locks, validating consistency only at commit via read, validation, and write phases. In the forward validation variant, a transaction reads committed data, then at commit checks for conflicts with concurrently committing transactions; if none, it writes updates. Backward validation instead verifies against active transactions' reads, aborting if inconsistencies are found. This approach enhances throughput in read-heavy workloads by avoiding upfront blocking, though aborts increase under high contention.^[48] Multiversion concurrency control (MVCC) mitigates read-write conflicts by maintaining multiple versions of each data item, each tagged with a transaction timestamp, allowing readers to access a consistent snapshot without blocking writers. A read retrieves the version with the latest write timestamp not exceeding the reader's timestamp, ensuring repeatable reads. Writes create new versions without overwriting existing ones, with garbage collection periodically removing obsolete versions. This non-locking method achieves serializability and is theoretically analyzed for correctness in multiversion settings, influencing implementations like PostgreSQL's snapshot isolation.^[49] Lock granularity refers to the level at which locks are applied—such as entire tables, pages, or individual rows—affecting the trade-off between concurrency and overhead. Coarse granularity, like table-level locking, simplifies management but reduces parallelism by blocking unrelated operations; fine granularity, like row-level, permits higher concurrency but incurs greater lock management costs and potential deadlocks. Multiple granularity locking uses a hierarchy with intention locks (e.g., intention-shared for descending to finer levels), enabling efficient mixed access patterns while bounding overhead.^[50]

Advanced Theoretical Foundations

Complexity of Database Queries

The computational complexity of database queries is a central concern in database theory, as it determines the feasibility of evaluating queries over large datasets. Two primary measures are used to analyze this complexity: data complexity, which treats the query as fixed and measures resources as a function of the input database size, and combined (or expression) complexity, which considers both the database and query sizes as inputs.^[51] For first-order queries, data complexity is in AC^0 (constant depth circuits), while combined complexity is PSPACE-complete.^[51] Fundamental operations in relational algebra, such as selection and projection, exhibit low complexity, often linear in the database size. Joins, however, vary significantly: a naive nested-loop join requires O(n^2) time in the worst case for relations of size n, but hash-based joins achieve O(n + m) time for relations of sizes n and m by building a hash table on one relation and probing with the other. These bounds highlight practical optimizations that mitigate worst-case scenarios in query evaluation. Decision problems related to queries often exhibit high complexity. For instance, determining whether one conjunctive query is contained in another, especially when including inequalities (≤ or <), is \Pi_2^p-complete.^[52] Similarly, query optimization, particularly selecting the optimal join order for a chain of joins, is NP-hard, necessitating heuristic approaches in practice.^[53] A seminal result by Moshe Y. Vardi characterizes the complexity of Datalog programs, a fixpoint extension of relational algebra used for recursive queries. The data complexity of evaluating a fixed Datalog program is in PTIME, but combined complexity is PSPACE-complete, as the recursion depth can be exponential in the program size.^[51] Dichotomy theorems provide sharp classifications for query complexity on finite structures. For instance, in consistent query answering under primary key constraints, there is a dichotomy for Boolean conjunctive queries with exactly two atoms and no self-joins: the complexity is PTIME if the query satisfies certain structural conditions on the atoms and keys (e.g., the query literals are contained in the positive closure of one relation or the union of keys is contained in the query literals), and coNP-complete otherwise.^[54] This result exemplifies broader efforts to classify query answering precisely based on structural properties.^[54] In resource-constrained settings, such as streaming data, space-efficient algorithms enable approximate query answering. For example, sketching techniques for aggregate queries over streams achieve polylogarithmic space while providing (1 + ε)-approximations with high probability, trading accuracy for sublinear memory usage.^[55] These methods are crucial for real-time processing in distributed databases.^[55]

Incomplete Information and Nulls

Incomplete information in relational databases arises when some data values are missing or uncertain, typically represented by null values. Edgar F. Codd, the originator of the relational model, initially proposed handling missing information through a single null marker in his 1979 work, but later refined this in 1986 to distinguish between two types: applicable but unknown (denoted as "U" or ⊥^A, indicating a value exists but is currently unavailable) and inapplicable (denoted as "I" or ⊥^I, indicating no value is relevant for the attribute in that tuple).^[56]^[57] This distinction addresses semantic nuances, such as a person's phone number being unknown versus not applicable if they do not own a phone. In practice, most database systems, including SQL, collapse these into a single null type but employ three-valued logic—true, false, and unknown (often denoted as U)—derived from Kleene's strong three-valued logic to evaluate predicates involving nulls.^[56] For example, a comparison like "age > 30" where age is null evaluates to unknown, propagating through queries to produce results that may include, exclude, or mark tuples as uncertain. The theoretical foundation for querying incomplete databases relies on possible worlds semantics, where an incomplete database instance is interpreted as a set of possible complete (null-free) relations consistent with the given nulls and any constraints. Seminal work by Imieliński and Lipski in 1984 formalized this model using conditional tables, where each tuple is annotated with a condition specifying the possible values that can replace its nulls, representing the minimal set of possible worlds.^[58] Certain answers to a query are then defined as the intersection of the query results over all possible worlds—tuples that appear in every complete instantiation—ensuring results are true regardless of how nulls are resolved. This approach contrasts with SQL's three-valued logic, which may return more tuples (including those that could be true in some worlds) and is computationally simpler but less precise for capturing certainty. Query answering under possible worlds semantics extends relational calculus by incorporating nulls, but computing certain answers can be coNP-complete even for conjunctive queries.^[58] A key challenge in incomplete databases is the implication problem: determining whether one set of schema constraints (e.g., functional dependencies) implies another in the presence of nulls. Under possible worlds semantics, this problem is undecidable in general, particularly when combining functional dependencies with inclusion dependencies, as shown in foundational results on dependency implication.^[59] To address query answering efficiently, techniques like the chase and backward chase (or backchase) are employed. The chase procedure, originally developed for testing dependency implication, is adapted here to expand conditional tables by enforcing constraints across possible values, while the backchase retracts or prunes inconsistent worlds to compute certain answers without enumerating all possibilities.^[58] These methods, building on Imieliński and Lipski's conditional tables, enable practical approximations for conjunctive queries, though full decidability requires restrictions on null placements or query classes. Extensions of null-based incomplete information include probabilistic databases, where uncertainty is quantified by assigning probabilities to tuples or attribute values, often at the tuple level assuming independence. Seminal frameworks, such as those by Cavallo and Pittarelli in 1987 and later refined by Dalvi and Suciu in 2004, model databases as probability distributions over possible worlds, with query answers computed via probabilistic inference (e.g., summing probabilities of worlds yielding true results).^[60] Another advancement is lineage tracking for query provenance, which annotates results with expressions tracing back to source tuples, allowing uncertainty propagation; Green et al.'s 2007 provenance semirings provide a algebraic foundation for this, enabling exact computation of certain answers in polynomial time for semiring-positive queries. These extensions enhance handling of real-world data integration scenarios, such as sensor networks or data cleaning, where nulls alone insufficiently capture graded uncertainty.

References

[1]
Database Theory - an overview | ScienceDirect Topics
Database theory is defined as the mathematical and logical study of finite structures, particularly focusing on developing representations for databases.
[2]
Perspectives on database theory | ACM SIGACT News
Perspectives on database theory. Author: Mihalis Yannakakis.
[3]
A relational model of data for large shared data banks
A relational model of data for large shared data banks. Readings in database systems (3rd ed.) Read More. Comments.
[4]
Database theory—past and future - ACM Digital Library
One important branch is the theory of relational databases, including such areas as dependency theory, universal-relation theory, and hypergraph theory.
[5]
A Brief History of Database Management - Dataversity
Oct 25, 2021 · The CODASYL approach was a very complicated system and required substantial training. It depended on a “manual” navigation technique using a ...Table Of Contents · Mysql · Nosql
[6]
History of DBMS - GeeksforGeeks
Jul 28, 2025 · The first database management systems (DBMS) were created to handle complex data for businesses in the 1960s.
[7]
How Charles Bachman Invented the DBMS, a Foundation of Our ...
Jul 1, 2016 · The Legacy of IDS. IDS and CODASYL systems did not use the relational data model, formulated years later by Ted Codd, which underlies today's ...<|control11|><|separator|>
[8]
Origin of Integrated Data Store (IDS): First Direct-Access DBMS
Aug 6, 2025 · An early system to manage data was the Integrated Data Store (IDS) designed by Charles Bachman in 1963. 2 The IDS system maintained a collection ...
[9]
CODASYL Database Task Group; Conference Proceedings
CODASYL ; A Survey of Generalized Database Management Systems; Technical Report;1969 CODASYL Database Task Group; October 69 Report; 1969.
[10]
Information Management Systems - IBM
For the commercial market, IBM renamed the technology Information Management Systems and in 1968 announced its release on mainframes, starting with System/360.
[11]
The Most Important Database You've Never Heard of - Two-Bit History
Oct 7, 2017 · By 1968, IBM had installed a working version of IMS at NASA ... IMS works according to a hierarchical model, meaning that, instead ...
[12]
Edgar Codd Publishes the Definitive Model for Relational Database ...
In June 1970 Edgar F. Codd Offsite Link of IBM published "A Relational Model of Data for Large Shared Data Banks Offsite Link ," Communications of the ACM, ...
[13]
Codd's Twelve Rules - Simple Talk - Redgate Software
Apr 14, 2020 · Dr. Codd, the creator of relational databases, was bothered by this, so he set up a set of 13 rules that a product had to match to be considered relational.
[14]
[PDF] Reflections on Boyce-Codd Normal Form
A recent study showed that under common assumptions, BCNF no longer guaran- tees freedom from various 'anomalies', one of its purported virtues. Second, a BCNF ...
[15]
Database System Concepts - Henry F. Korth, Abraham Silberschatz
Authors, Henry F. Korth, Abraham Silberschatz ; Photographs by, Abraham Silberschatz ; Publisher, McGraw-Hill, 1986 ; Original from, the University of Michigan.
[16]
[PDF] Principles of Database and Knowledge-base Systems
... Theory ... This book is the first of a two-volume set that is intended as a replacement for my earlier book Principles of Database Systems (Ullman [1982] in the ...
[17]
(PDF) Quantum Data Management: From Theory to Opportunities
Mar 8, 2024 · Quantum computing has emerged as a transforma-tive tool for future data management. Classical problems in database domains, including query ...
[18]
[PDF] Data Models 1 Introduction 2 Object-Based Logical Models
A data model is a collection of tools for describing real-world entities and their relationships. Object-based models use entities/objects and relationships.
[19]
[PDF] Reference model for DBMS standardization: database architecture ...
the ANSI/SPARC three-schema architecture of data representa- tion, conceptual, external,. .and internal, and is used in the development of the DBMS RM. A ...Missing: 1975 | Show results with:1975<|separator|>
[20]
[PDF] Chapter 2: Entity-Relationship Model - Purdue Computer Science
UML Class Diagrams (Contd.) □ Cardinality constraints are specified in the form l..h, where l denotes the minimum and h the maximum number of relationships ...
[21]
[PDF] Network Model - Database System Concepts
The first database-standard specification, called the CODASYL DBTG 1971 report, was written in the late 1960s by the Database Task Group. Since then, a number.
[22]
[PDF] A Relational Model of Data for Large Shared Data Banks
Future users of large data banks must be protected from having to know how the data is organized in the machine. (the internal representation). A prompting.
[23]
The entity-relationship model—toward a unified view of data
A data model, called the entity-relationship model, is proposed. This model incorporates some of the important semantic information about the real world.
[24]
[PDF] Further Normalization of the Data Base Relational Model
E. F. Codd, "A Relational. Model of Data for Large Shared Data Banks",. Comm. ACM 13 6, June 1970, 377-387. -. 2. E. F. Codd, "Relational. Completeness of Data ...
[25]
Schema evolution in database systems: an annotated bibliography
Schema Evolution is the ability of a database system to respond to changes in the real world by allowing the schema to evolve. In many systems this property ...Missing: seminal work
[26]
[PDF] Relational Completeness of Data Base Sublanguages
This paper attempts to provide a theoretical basis which may be used to determine how complete a selection capability is provided in a proposed data sublanguage.
[27]
[PDF] What You Always Wanted to Know About Datalog (And Never Dared ...
Abstract-Datalog is a database query language based on the logic programming paradigm; it has been designed and intensively studied.
[28]
Domain-Oriented Relational Languages. VLDB 1977 - SIGMOD
This paper defines two domain-oriented relational languages. DRC is a many-sorted calculus, where the structural role of domains is emphasized by defining types ...
[29]
[PDF] An Overview of Query Optimization in Relational Systems
Query optimization involves choosing the best execution plan from many possible plans, using a search space, cost estimation, and an enumeration algorithm.
[30]
[PDF] Access Path Selection in a Relational Database Management System
This paper describes how System R chooses access paths for both simple (single relation) and complex que- ries (such as joins), given a user specifi- cation of ...Missing: reordering | Show results with:reordering
[31]
[PDF] Random Sampling from Databases - UC Berkeley
I describe and analyze algorithms for sampling from relational operators in Chapter Five: selection, intersection, union, projection, set di erence, and join.
[32]
[PDF] Dependency Structures of Data Base Relationships
Départementd 'Informatique,Universite'de Montréal. Montréal,Québec,Canada. An axiomatic description of dependency structure those families of dependencies.
[33]
Candidate keys for relations - ScienceDirect.com
An algorithm is presented that finds K, the set of all keys for a given set A of attribute names and a given set D[0] of functional dependencies.
[34]
[PDF] Synthesizing Third Normal Form Relations from Functional ...
It has been proposed that the description of a relational database can be formulated as a set of functional relationships among database attributes.
[35]
The theory of joins in relational databases - ACM Digital Library
In this paper we give efficient algorithms to determine whether the join of several relations has the intuitively expected value (is lossless) and to determine ...
[36]
Synthesizing third normal form relations from functional dependencies
An effective procedure for performing a relational scheme synthesis is presented and the schema that results from this procedure is proved to be in Codd's ...
[37]
Multivalued dependencies and a new normal form for relational ...
A new type of dependency, which includes the well-known functional dependencies as a special case, is defined for relational databases.
[38]
Normal forms and relational database operators - ACM Digital Library
We discuss the relationship between normal forms in a relational database and an allowed set of relational operators.Missing: original | Show results with:original
[39]
[PDF] Jim Gray - The Transaction Concept: Virtues and Limitations
ABSTRACT: A transaction is a transformation of state which has the properties of atomicity. (all or nothing), durability (effects survive failures) and ...
[40]
Principles of transaction-oriented database recovery
This paper, 'Principles of transaction-oriented database recovery', is by Theo Haerder and Andreas Reuter, published in ACM Computing Surveys.
[41]
The serializability of concurrent database updates | Journal of the ACM
The serializability of concurrent database updates. Author: Christos H. Papadimitriou.
[42]
ARIES: a transaction recovery method supporting fine-granularity ...
In this paper, we discuss the design and implementation of a concurrency control manager for the Tachyon, a main-memory DBMS. Since a main-memory DBMS ...
[43]
[PDF] Checkpointing Memory-Resident Databases
Fuzzy checkpoints require little or no synchronization with executing transactions. The backup database produced by such a checkpoint is called fuzzy because it ...
[44]
[PDF] TWO PHASE LOCKING - Microsoft
Two-phase locking is a type of scheduler, where locking synchronizes access to shared data. Each data item has a lock associated with it.
[45]
[PDF] TIMESTAMP-BASED ALGORITHMS FOR CONCURRENCY ...
In this paper we present a framework for the design and analysis of concurrency control algorithms for distributed database management systems. (DDBMS). This.
[46]
[PDF] On Optimistic Methods for Concurrency Control - Computer Science
In this paper, two families of nonlocking concurrency controls are presented. The methods used are “optimistic” in the sense that they rely mainly on ...Missing: seminal | Show results with:seminal
[47]
Multiversion concurrency control—theory and algorithms
This paper presents a theory for analyzing the correctness of concurrency control algorithms for multiversion database systems.Missing: original | Show results with:original
[48]
Granularity of locks in a shared data base - ACM Digital Library
This paper proposes a locking protocol which associates locks with sets of resources. This protocol allows simultaneous locking at various granularities by ...
[49]
[PDF] The complexity of relational query languages - Rice University
In the last years there has been a lot of interest in query languages for relational databases. Following. Codd's pioneering work [Codd] on the relational ...
[50]
[PDF] Lecture 2: Query Containment - cs.wisc.edu
The problem of evaluating a Conjunctive Query is NP-complete (combined complexity). Even though query containment is an NP-complete problem, it is a feasible ...
[51]
Query simplification: graceful degradation for join-order optimization
Join ordering is one of the most important, but also most challenging problems of query optimization. In general finding the optimal join order is NP-hard.
[52]
[PDF] A Dichotomy in the Complexity of Consistent Query Answering for ...
Sep 29, 2011 · Abstract. We establish a dichotomy in the complexity of computing the consistent answers of a Boolean conjunctive query.
[53]
[PDF] Processing Complex Aggregate Queries over Data Streams
This paper addresses processing aggregate SQL queries over continuous data streams with limited memory, using randomized techniques to provide approximate ...
[54]
Extending the database relational model to capture more meaning
In this paper we propose extensions to the relational model to support certain atomic and molecular semantics.
[55]
Missing information (applicable and inapplicable) in relational ...
{EFC2} E. F. Codd, "A Relational Model of Data for Large Shared Data Banks", Comm. ACM, Vol. 13, No. 6, June 1970. Digital Library · Google Scholar. [3]. {EFC3} ...<|control11|><|separator|>
[56]
Incomplete Information in Relational Databases | Journal of the ACM
Incomplete Information in Relational Databases. Authors: Tomasz Imieliński.
[57]
The Implication Problem for Functional and Inclusion Dependencies ...
We show that the implication problem is undecidable for the class of functional and inclusion dependencies.
[58]
[PDF] The Theory of Probabilistic Databases - VLDB Endowment
ABS!l'RACT. A theory of probabilistic databases is outlined. This theory is one component of an integrated approach to data-modelling that accomodates both ...Missing: seminal | Show results with:seminal