Fact-checked by Grok 2 weeks ago

Database theory

Database theory is the mathematical and logical study of finite structures underlying databases, focusing on formal representations, query processing, and data management principles to ensure efficient, correct, and scalable information handling in computer systems. It encompasses the foundational principles that guide the , , and optimization of database management systems (DBMS), drawing from areas such as , , and to address challenges in , storage, retrieval, and integrity. The field originated in the late 1960s with Edgar F. Codd's introduction of the , which proposed representing data as relations (tables) connected via keys, enabling declarative querying without procedural navigation of complex links. This model, detailed in Codd's seminal 1970 paper, revolutionized by emphasizing —separating logical views from physical storage—and laid the groundwork for , a procedural using operations like selection, , and join. Codd's work addressed limitations of earlier hierarchical and models, such as the CODASYL approach, by providing a mathematically rigorous that supports to minimize redundancy and anomalies. Key areas in database theory include , which analyzes functional, multivalued, and join dependencies to ensure data consistency; query optimization and equivalence, exploring how to transform queries for efficiency while preserving semantics; and , incorporating integrity rules like primary keys and foreign keys. More advanced topics extend to non-relational paradigms, such as models for XML or , and theoretical foundations for distributed databases, including consistency models like properties in transactions. The field continues to evolve, integrating concepts from for systems and probabilistic models for uncertain data, influencing modern applications in and .

History and Development

Origins in the and

The origins of database theory in the and were shaped by the need to manage growing volumes of in early environments, where file-based systems and punched-card storage led to significant challenges such as and inconsistency due to duplicated information across separate files for different applications. These limitations prompted innovations in structured , transitioning from file handling to more systematic approaches that could support shared access in business and scientific . A pivotal early development was Charles Bachman's creation of the Integrated Data Store (IDS) in 1963 at General Electric, recognized as the first direct-access database management system (DBMS). IDS introduced the network model, utilizing set types to represent owner-member relationships between data records, allowing navigation through linked structures rather than sequential file scans. This practical system addressed the inefficiencies of prior storage methods by enabling multiple pathways to data, influencing subsequent standards. Building on IDS, the Conference on Data Systems Languages (CODASYL) Data Base Task Group (DBTG) formalized the network model through conferences starting in 1969, publishing initial specifications in October of that year that defined a schema for interconnected record types and navigational queries. In parallel, hierarchical models emerged to organize data in tree-like structures, with IBM's Information Management System (IMS) released in 1968 for the System/360 mainframe, originally developed for NASA's . IMS emphasized parent-child relationships in a rigid , facilitating efficient access for but limiting flexibility for complex associations compared to network approaches. These pre-relational models dominated early DBMS implementations, prioritizing performance in batch-oriented environments over declarative . The landscape shifted dramatically with Edgar F. Codd's seminal 1970 paper, "A Relational Model of Data for Large Shared Data Banks," which proposed representing data as mathematical relations—essentially tables with rows (tuples) and columns (attributes)—grounded in and predicate logic to eliminate physical navigation and reduce redundancy. Codd's model aimed to insulate users from storage details, enabling declarative querying via , a procedural he introduced in the same paper. By the late , Codd's principles had gained traction, with practical prototypes such as IBM's System R project (1974–1979) implementing relational principles and developing an early version of SQL, bridging and practice; Codd's contributions were recognized with the 1981 ACM . This culminated in his 1985 articulation of 12 rules (plus a zeroth rule) for true relational systems, which traced back to his efforts to ensure and independence in shared environments.

Evolution Through the 1980s and Beyond

The 1980s marked a period of and in database theory, building on the relational model's foundations. The (ANSI) published the first SQL standard (ANSI X3.135-1986) in 1986, which formalized syntax and semantics for , enabling portability across implementations. This was followed by the (ISO) adopting SQL as an international standard (ISO/IEC 9075:1987) in 1987, incorporating refinements to for management systems outlined in 1985. Concurrently, refinements to normalization theory emerged, such as extensions to Boyce-Codd Normal Form (BCNF) that addressed dependency preservation issues in decomposition algorithms. Additionally, deductive databases gained prominence through the introduction of in the early 1980s, a logic-based extending to support recursive queries and knowledge representation. Key textbooks synthesized these advancements, with Abraham Silberschatz and Henry F. Korth's Database System Concepts (first edition, 1986) providing a comprehensive framework integrating relational theory, query optimization, and system architectures. Similarly, Jeffrey D. Ullman's Principles of Database and Knowledge-Base Systems (1988) explored theoretical underpinnings of relational and deductive systems, emphasizing complexity analysis and inference rules. In the 1990s, database theory expanded to accommodate emerging paradigms like object-orientation and . The Object Database Management Group (ODMG) released its first standard in 1993 (ODMG-93), proposing an that integrated classes, , and methods with relational querying via OQL, aiming to bridge and database persistence. Toward the late 1990s, models emerged to handle irregular formats like XML, with theoretical foundations laid in works such as Serge Abiteboul's analysis of object representation and query languages for tree-structured data. The 2000s and beyond saw database theory adapt to distributed and large-scale systems, particularly with the rise of models. Brewer's CAP theorem, presented in 2000, formalized trade-offs in distributed systems between consistency, availability, and partition tolerance, influencing theoretical analyses of key-value stores and models. paradigms further shaped the field, with the theoretical foundations of —introduced by Jeffrey Dean and in 2004—enabling parallel query processing over massive datasets through functional decomposition into map and reduce operations. In the , theoretical work on distributed query processing advanced, focusing on optimization techniques for federated and cloud-based systems, including cost-based planning across heterogeneous data sources. Recent developments in the have emphasized quantum and privacy-enhanced paradigms. Quantum database theory explores superposition and entanglement for query acceleration, with foundational models proposed for quantum relational algebras and indexing structures. Privacy-preserving query theory has advanced through and integrations, enabling secure data analysis without exposing sensitive information, as detailed in protocols for quantum private queries.

Core Concepts

Data Models and Abstractions

Database theory employs data models as formal abstractions to represent the , semantics, and constraints of data, enabling the design of systems that manage independently of specific or software implementations. These models bridge the gap between real-world entities and computational , emphasizing concepts like relationships, hierarchies, and integrity rules to ensure efficient data organization and retrieval. Abstractions in this context allow developers and users to focus on logical representations without concern for underlying physical details, a central to modern database systems (DBMS). A foundational framework for these abstractions is the ANSI/SPARC three-schema architecture, proposed by the ANSI/X3/ Study Group on Database Management Systems in 1975. This architecture delineates three levels: the external schema, which provides tailored user views of the data; the , which defines the overall logical structure shared across users; and the internal schema, which describes the physical storage, indexing, and access paths. By separating these levels, the architecture facilitates , where changes at one level minimally impact others, promoting and in DBMS. Key to this architecture are the concepts of , which shield applications from storage or structural modifications. Logical data independence ensures that alterations to the —such as adding new entity types—do not require changes to external views, allowing users to maintain consistent interfaces. Physical data independence, conversely, permits optimizations in the internal schema, like reorganizing file structures or switching storage devices, without affecting the conceptual or external levels. These independences reduce development costs and enhance system flexibility, as evidenced in early DBMS implementations adhering to the framework. Various data models emerged to operationalize these abstractions, each suited to different complexity levels in data relationships. The hierarchical model, pioneered by in with the Information Management System (IMS), structures data as a with parent-child linkages, ideal for one-to-many associations like organizational charts or bill-of-materials. IMS, initially developed for NASA's , represented segments as nodes in a rooted , enforcing strict from root to leaves. Building on this, the network model, formalized in the CODASYL Database Task Group (DBTG) report of 1971, extends to graph-like structures using record types connected via sets, supporting many-to-many relationships through owner-member pointers. This model, implemented in systems like Integrated Data Store (IDS), allowed more flexible querying than hierarchies but required explicit pointer management, complicating maintenance. The , introduced by in 1970, abstracts data into tables (relations) composed of rows (tuples) and columns (attributes), drawing from mathematical to eliminate physical pointers in favor of declarative queries. As a prominent logical , it underpins SQL-based systems by treating relations as subsets of Cartesian products over domains. Complementing these, the entity-relationship (ER) model, developed by Peter P. Chen in 1976, serves as a high-level conceptual tool for capturing real-world semantics through entities (objects like "employee"), attributes (properties like "name"), and relationships (associations like "works in"). Unlike implementation-focused models, ER emphasizes diagrammatic representation to model domain knowledge before mapping to physical structures. Mathematical foundations underpin these models, rooted in set theory where domains represent atomic value sets (e.g., integers or strings), tuples are finite ordered sequences from those domains, and relations form sets of such tuples without duplicates. Value-based models, exemplified by the relational approach, focus on immutable data values without inherent identity, ensuring operations like union and projection preserve set properties. In contrast, object-based models introduce unique identifiers and behaviors, allowing encapsulation of complex types like multimedia, though they complicate equality comparisons. Views further exemplify abstractions by acting as virtual tables derived from queries on base relations, presenting filtered or joined data without storing it physically. In relational systems, views enhance by restricting and simplify complex schemas, aligning with the external schema in ANSI/ to provide user-specific abstractions. Within the model, cardinality constraints refine semantics: (1:1) limits each instance to a single counterpart (e.g., to ); one-to-many (1:N) allows a single source to link multiple targets (e.g., department to employees); and many-to-many (M:N) permits multiple bidirectional associations (e.g., students to courses), often resolved via intermediate . Weak , lacking full primary keys, rely on partial discriminators and owner entity keys for (e.g., a dependent tied to an employee), enforcing existence dependencies. Aggregation elevates a to entity status, enabling it to participate in higher-level relationships (e.g., aggregating "project assignment" to relate to "budget approval"), thus modeling nested semantics hierarchically.

Database Schemas and Instances

In database theory, the schema represents the static, declarative specification of the database structure, defining elements such as table names, attributes, data types, and relationships, while the instance comprises the dynamic, actual data values that populate this structure at any given moment. This distinction ensures that the schema provides a blueprint for data organization independent of the specific content, allowing multiple instances to conform to the same schema over time. For example, in a relational database, the schema might define a "Employees" relation with attributes like EmployeeID (integer, primary key) and Name (string), whereas an instance would contain rows such as (101, "Alice Smith"). Schemas are categorized into types based on the ANSI/ three-level architecture, which separates concerns for better abstraction and maintenance. The , also known as the , describes the overall structure in terms of relations, attributes, and constraints, hiding physical storage details from users. The , or internal schema, specifies storage aspects such as file organization, indexing strategies, and access paths to optimize performance on hardware. This architecture promotes , where changes to the physical schema do not affect the logical view. Integrity constraints are rules enforced by the schema to maintain data quality and consistency across instances. Domain constraints restrict attribute values to predefined types or ranges, such as ensuring ages are positive integers between 0 and 150. Key constraints include primary keys, which uniquely identify tuples in a relation, and foreign keys, which link relations while preserving uniqueness. Referential integrity ensures that foreign key values match existing primary keys in referenced relations, preventing orphaned data. These constraints are typically checked during insertions, updates, or deletions to uphold the schema's semantic rules. Update anomalies arise when schema designs lead to inconsistencies during modifications to the instance, often due to redundant data storage or improper relationship modeling. Insertion anomalies occur when adding new data requires extraneous information or is blocked by missing dependencies, such as inability to record a new department without an employee. Deletion anomalies happen when removing a tuple eliminates unrelated but valuable data, like losing supplier details upon deleting the last order. Modification anomalies result from updating one instance of repeated data without propagating changes elsewhere, causing discrepancies. These issues highlight the need for schemas that align structure with intended data behaviors to minimize such risks. Dynamic aspects of schemas and instances involve managing changes over time to accommodate evolving requirements without . Schema evolution includes operations like adding or removing attributes, altering data types, or merging relations, often requiring compatibility rules to preserve existing instances. Instance versioning tracks historical states of data, enabling queries across temporal snapshots while adhering to the current . These mechanisms ensure long-term adaptability, as seen in systems that support backward-compatible schema changes to avoid application disruptions.

Relational Model

Relational Structure and Keys

In the relational model, a database is conceptualized as a collection of , where each corresponds to a composed of attributes (columns) and (rows). Each attribute is defined over a specific , which is the set of permissible values for that attribute, ensuring type consistency and semantic meaning. Formally, a of n is a finite of the of n , meaning it consists of ordered n- where the i-th component of each belongs to the i-th . Attributes within a are ordered and labeled by names for clarity. are ordered according to this attribute sequence, and the is an unordered set of such with no duplicates allowed, preserving the mathematical set-theoretic of the model. Keys play a central role in uniquely identifying and linking data within relations. A superkey is any set of one or more attributes whose values collectively uniquely identify each tuple in the relation, such that no two tuples share the same combination of values for those attributes. A candidate key, also known as a minimal superkey, is a superkey with no proper subset that is itself a superkey, ensuring irredundancy in identification. The primary key is a specifically chosen candidate key designated for each relation to serve as the principal means of uniquely identifying its tuples, often enforced through system constraints to maintain uniqueness. A foreign key, in contrast, is a set of attributes in one relation that refers to the primary key (or a candidate key) of another relation, enabling the establishment of relationships between tables. To ensure data consistency, the relational model incorporates key-based integrity constraints. Entity integrity mandates that no component of a primary key can contain a null value in any tuple, as this would undermine the ability to uniquely identify entities. Referential integrity requires that for every non-null value in a foreign key attribute, there must exist a matching value in the corresponding primary key of the referenced relation, preventing orphaned references and maintaining logical consistency across relations. These constraints, rooted in the model's foundational principles, are typically enforced by the database management system to safeguard the structural integrity of the data. Consider a simple example involving employee and department relations. The Employee relation might include attributes EmpID (primary key, domain: integers), Name (domain: strings), and DeptID (foreign key referencing Department). The Department relation includes DeptID (primary key, domain: integers) and DeptName (domain: strings). A sample instance of the Employee relation could be:
EmpIDNameDeptID
10110
102Bob20
103Carol10
And the Department relation:
DeptIDDeptName
10HR
20IT
Here, {EmpID} is the primary key and a candidate key for Employee, while DeptID serves as a foreign key ensuring —e.g., only existing DeptID values from are allowed, and EmpID values are non-null.

Relational Algebra Operations

Relational algebra provides a procedural framework for querying and manipulating relations in the , consisting of a set of operations that take one or more relations as input and produce a new relation as output. Introduced by , it serves as a theoretical foundation for query languages like SQL, enabling the composition of complex queries through systematic combination of basic operations. The algebra's operations are grounded in , ensuring that results remain within the domain of relations, which supports its use in defining the expressive power of relational query languages. The core operations include selection, , , set difference, , and rename. Selection, denoted \sigma_{\theta}(R), retrieves tuples from R that satisfy a \theta, formally defined as \sigma_{\theta}(R) = \{ t \in R \mid t \text{ satisfies } \theta \}. For example, \sigma_{A=5}(R) yields all tuples in R where attribute A equals 5. , denoted \pi_{A_1, \dots, A_k}(R), selects specified attributes from R and eliminates duplicates, given by \pi_{A_1, \dots, A_k}(R) = \{ (t[A_1], \dots, t[A_k]) \mid t \in R \}, where duplicates are implicitly removed to maintain set semantics. , R \cup S, combines tuples from compatible relations R and S (same arity and domain types), producing \{ t \mid t \in R \lor t \in S \}, while R - S yields \{ t \in R \mid t \notin S \}. The R \times S generates all possible tuple pairs, resulting in a relation of degree equal to the of degrees of R and S, formally R \times S = \{ (t_r, t_s) \mid t_r \in R, t_s \in S \}. Rename, denoted \rho_{newname}(R), reassigns names to relations or attributes for clarity in compositions, preserving the underlying tuples. Derived operations extend the core set to handle common query patterns. can be expressed as R \cap S = R - (R - S), yielding tuples common to both compatible relations. The -join, R \bowtie_{\theta} S, combines R and S by selecting pairs satisfying condition \theta from their , defined as R \bowtie_{\theta} S = \sigma_{\theta}(R \times S); a join occurs when \theta equates . , R \div S, addresses "all" queries, such as finding values in R associated with every tuple in S, formally R \div S = \{ t \in \pi_K(R) \mid \forall u \in S, (t, u) \in R \}, where K are the non-dividend attributes. A key property of relational algebra is its closure: every operation outputs a relation, allowing arbitrary nesting of operations to form complex expressions without leaving the relational framework. This closure facilitates the construction of queries like identifying employees in a specific department: given an Employees relation (emp_id, name, dept_id) and Departments relation (dept_id, dept_name), the query \pi_{name}(\sigma_{dept_name = 'Sales'}(Employees \bowtie_{dept_id} Departments)) first joins on department ID, selects the 'Sales' department, and projects employee names. In terms of expressiveness, is relationally complete, meaning it can express any query definable in the relational calculus, with a direct establishing their equivalence. This ensures that procedural algebraic expressions capture the full declarative power of logic-based queries on relations.

Query Languages and Logic

Relational Calculus and Datalog

Relational calculus provides a declarative foundation for querying relational databases, expressing what data to retrieve rather than how to retrieve it, using predicates over relations. Introduced as part of the , it contrasts with procedural approaches by focusing on logical conditions that tuples must satisfy. (TRC) and domain relational calculus (DRC) are the primary variants, both ensuring expressive power equivalent to while restricting to safe, domain-independent queries that yield finite results on finite databases. In TRC, a query is formulated as \{ t \mid \phi(t) \}, where t is a free variable ranging over tuples in the database, and \phi(t) is a formula built from atomic predicates (e.g., t \in R for relation R, or t.A = u.B for attribute comparisons), connected by logical operators \land, \lor, \lnot, and quantifiers \forall, \exists. The formula \phi(t) may involve bound variables ranging over relations, but requires domain independence: all values in the result must originate from the active domain of the database instance, preventing infinite or undefined outputs from unrestricted or . This restriction ensures practical , as unrestricted TRC can express non-domain-independent queries that produce infinite relations even on finite inputs. DRC shifts the focus to individual attribute values (domains) rather than entire tuples, with queries of the form \{ \langle x_1, x_2, \dots, x_n \rangle \mid \phi(x_1, x_2, \dots, x_n) \}, where each x_i ranges over a specific (e.g., employee IDs or department names), and \phi is a formula using atomic predicates like P(x_1, \dots, x_m) for P. For instance, to retrieve pairs of employees and their departments, the query might be \{ \langle x_1, x_2 \rangle \mid \exists y \, (Employee(x_1, y) \land Dept(y, x_2)) \}, where x_1 and x_2 are free variables over employee and department domains, respectively. Like TRC, DRC enforces independence for , limiting variables to values appearing in the database to avoid extraneous results. DRC was proposed to emphasize domain structures explicitly, aiding in type-safe query formulation. Codd's theorem establishes the theoretical equivalence between relational calculus (both TRC and DRC, when restricted to domain-independent expressions) and : any query expressible in one can be translated into the other with identical semantics on relational databases. This relational completeness criterion measures a query language's power against the full capabilities of over relations, excluding features like aggregation or ordering. The theorem underpins the design of practical query languages, confirming that declarative logic matches procedural operations in expressiveness without loss. Datalog extends relational calculus principles into a rule-based logic programming language for deductive databases, enabling inference over base relations to derive new facts recursively. As a subset of Prolog without function symbols or complex terms, Datalog programs consist of facts (e.g., Parent(Alice, Bob).) and rules of the form Head :- Body., where Head is a positive atom (e.g., Ancestor(X,Y)) and Body is a conjunction of atoms (positive or negated literals, e.g., Parent(X,Z) \land Ancestor(Z,Y)). Evaluation uses bottom-up fixed-point semantics, iteratively applying rules until no new facts are derived, capturing the least Herbrand model. To handle negation safely, employs stratified negation, partitioning predicates into strata where negation only references lower strata, avoiding recursion through negative literals that could lead to non-monotonic or semantics. Stratification ensures a unique stable model computable via , preserving domain independence and decidability. Unstratified negation introduces well-founded semantics for broader expressiveness but increases complexity. Datalog excels in applications requiring recursive queries, such as computing transitive closures in graphs or hierarchies. For example, the relation can be defined as: \begin{align*} &[Ancestor](/page/Ancestor)(X,Y) :- [Parent](/page/Parent)(X,Y). \\ &[Ancestor](/page/Ancestor)(X,Y) :- [Parent](/page/Parent)(X,Z), [Ancestor](/page/Ancestor)(Z,Y). \end{align*} This iteratively derives all ancestry paths, enabling queries over relational data like family trees or organizational structures without explicit loops. Such addresses limitations in basic relational , supporting in knowledge bases.

Query Optimization Principles

Query optimization in database theory involves systematically selecting the most efficient execution plan for a given query from a multitude of possible alternatives, aiming to minimize resource consumption such as and I/O operations while preserving query semantics. This process is crucial in management systems (RDBMS), where declarative queries expressed in languages like SQL must be translated into procedural operations on data structures. Theoretical foundations emphasize balancing query complexity with estimation accuracy to avoid suboptimal plans that could lead to exponential performance degradation. The optimization process typically unfolds in two primary phases: algebraic optimization, also known as logical optimization, and physical optimization. In the algebraic phase, the initial query representation in is rewritten using equivalence rules to produce semantically identical but potentially more efficient expressions, focusing on restructuring without considering storage details. Physical optimization follows, mapping the logical plan to specific implementation strategies, such as selecting algorithms for operators and access paths like indexes or sequential scans, to generate a concrete execution plan. These phases ensure that optimizations are both transformationally sound and practically viable. Equivalence transformations form the core of algebraic optimization, leveraging rules derived from relational algebra properties to reorder or simplify operations. A fundamental rule is the pushdown of selections, where conjunctive predicates are applied as early as possible to reduce intermediate result sizes; for instance, \sigma_{c_1 \land c_2}(R) \equiv \sigma_{c_1}(\sigma_{c_2}(R)), allowing independent selections to be cascaded before costlier joins. Join reordering exploits associativity and commutativity, such as \pi(A,B)(R \bowtie S) \equiv \pi(A,B)(S \bowtie R), to explore different operator trees, as pioneered in the System R optimizer developed by in the , which used dynamic programming to enumerate join orders while avoiding invalid sequences like Cartesian products between non-connected relations. These transformations generate a search space of equivalent plans, often represented as left-deep or bushy trees, to facilitate subsequent cost-based selection. Cost models provide the quantitative framework for comparing plans by estimating the resources required for execution, primarily focusing on I/O costs due to their dominance in traditional disk-based systems. Estimates incorporate selectivity (the fraction of tuples satisfying a ), cardinality (the number of tuples in or intermediates), and device-specific parameters like page size. For a nested-loop join between R and S, a basic cost model approximates the total as |R| \times |S|, assuming a full scan of S for each in R, though refinements account for buffering and indexing to reduce inner probes. Such models enable ranking plans, with I/O often weighted highest, as in System R's formula for join cost: C_{\text{join}} = C_{\text{outer}} + N_{\text{outer}} \times C_{\text{inner}}, where N_{\text{outer}} is the estimated of the outer relation. Search strategies enumerate the plan space generated by transformations, balancing optimality with computational feasibility given the exponential growth in alternatives for queries with multiple joins. Exhaustive strategies, like dynamic programming in System R, systematically build optimal subplans bottom-up for all subsets of relations, achieving optimality for select-project-join (SPJ) queries at O(n 2^{n-1}) complexity, where n is the number of relations. approaches, such as greedy algorithms, approximate solutions by iteratively selecting the lowest-cost partial plan, pruning the space to handle larger queries efficiently but risking suboptimal results. Randomized methods, including genetic algorithms, explore the space probabilistically by evolving populations of plans through mutation and crossover, offering scalability for complex queries beyond exhaustive enumeration. Accurate estimates underpin both cost models and search strategies, relying on database to predict intermediate result sizes. Histograms partition attribute values into buckets to approximate distributions, enabling selectivity estimates for predicates; for example, equi-depth histograms ensure uniform bucket sizes for uniform assumptions, while histograms capture exact counts for skewed . Sampling methods complement histograms by drawing random subsets of to build or update dynamically, as explored in foundational work on relational sampling, which supports unbiased estimates for large with controlled error bounds. In System R, basic like and key counts provide selectivity factors, such as F = 1 / \text{number of distinct keys} for predicates, forming the basis for propagating estimates through the plan.

Data Integrity and Normalization

Functional Dependencies

In relational database theory, a (FD) is a that specifies how the values of certain attributes determine the values of other attributes in a . Formally, for sets of attributes X and Y in a R, an FD X \to Y holds if, in every instance of R, two tuples that agree on all attributes in X also agree on all attributes in Y. This concept was introduced by as part of the foundational to capture semantic relationships among data attributes. FDs are classified as trivial if Y \subseteq X, in which case the dependency always holds regardless of the data instance, or non-trivial otherwise. A practical example of an arises in an employee with attributes {EmpID, Name, Dept, Salary}. Here, {EmpID} \to {Name, Dept} indicates that the employee ID uniquely determines the name and department, ensuring no two employees share the same ID but have different associated names or departments. Such FDs help identify candidate keys, which are minimal sets of attributes that functionally determine all other attributes in the . To reason about sets of s, denoted F, database theory employs inference rules known as Armstrong's axioms, established in 1974. These include:
  • Reflexivity: If Y \subseteq X, then X \to Y.
  • Augmentation: If X \to Y, then for any set Z, XZ \to YZ.
  • Transitivity: If X \to Y and Y \to Z, then X \to Z.
Derived rules, such as union and decomposition, follow from these basics. Armstrong's axioms are both sound, meaning they only derive valid FDs from F, and complete, meaning every FD logically implied by F can be derived using the axioms. Given a set of FDs F, the closure of an attribute set X with respect to F, denoted X^+, is the set of all attributes functionally determined by X using the inference rules. Computing X^+ involves an iterative : initialize X^+ = X; then, while possible, add an attribute A to X^+ if there exists an FD U \to V in F such that U \subseteq X^+ and A \in V. This process terminates when no further attributes can be added, yielding X^+. The algorithm runs in time linear in the size of the and F. This closure is crucial for tasks like computation. A minimal cover (or cover) of a set of FDs F is an equivalent set of FDs F_c that is minimal in structure: each right-hand side is a attribute, no FD in F_c is redundant, and no attribute in the left-hand side of any FD is extraneous. To compute a minimal cover, first decompose each FD in F into right-hand sides using ; then, for each resulting FD, remove extraneous attributes from the left-hand side by checking if the FD still holds after removal (using attribute ); finally, eliminate any redundant FDs by verifying if they are implied by the remaining set. This process ensures F_c captures the same dependencies as F but in a simplified form, facilitating . Seminal work on efficient computation of such covers appears in synthesis algorithms for normal forms.

Normal Forms and Decomposition

Normal forms in database theory provide a systematic approach to designing relational schemas that minimize and prevent update anomalies, building on functional dependencies (FDs) to ensure . By relations into smaller, well-structured components, eliminates undesirable dependencies that lead to insertion, deletion, and modification issues. The process begins with basic forms like (1NF) and progresses to higher forms such as Boyce-Codd normal form (BCNF), (4NF), and (5NF), each addressing specific types of dependencies. These forms guide the transformation of a universal relation into a set of relations that preserve the original semantics while reducing anomalies. First normal form (1NF) requires that all attributes in a contain only (indivisible) values, eliminating repeating groups or nested . This ensures that each represents a single, indivisible record, allowing the relation to be treated as a set of without multivalued attributes within cells. For example, a relation storing employee skills as a list in one attribute violates 1NF; decomposing it into separate for each skill achieves 1NF. Introduced in the foundational , 1NF forms the basis for all higher normal forms by enforcing domain integrity at the attribute level. Second normal form (2NF) builds on 1NF by requiring that no non-prime attribute (one not part of any ) is partially dependent on any . In other words, every non-prime attribute must depend on the entire , not just a , to avoid partial dependencies that cause . For instance, in a with attributes {OrderID, ProductID, , ProductName}, where {OrderID, ProductID} is the key and ProductName depends only on ProductID, the partial dependency leads to repeated ProductName values; decomposing into separate Order and Product s resolves this. This form reduces update anomalies related to key subsets. Third normal form (3NF) extends 2NF by eliminating transitive dependencies, where no non-prime attribute depends on another non-prime attribute. Each non-prime attribute must depend directly on a candidate key, preventing chains of dependencies that propagate redundancies. Consider a relation {EmployeeID, Department, DepartmentLocation} with key EmployeeID and FD EmployeeID → Department, Department → DepartmentLocation; the transitive dependency causes redundant location data. Decomposing into Employee-Department and Department-Location relations achieves 3NF. This form balances normalization by allowing some redundancy if it preserves dependencies efficiently. Boyce-Codd normal form (BCNF), proposed by and in 1974, strengthens 3NF by requiring that for every non-trivial FD X \to Y in the relation, X is a superkey (containing a candidate key). Unlike 3NF, which permits determinants that are not superkeys if they do not create transitive dependencies, BCNF ensures every determinant is a candidate key, further eliminating anomalies from overlapping keys. For example, consider a relation R(A, B, C) with candidate keys {A, B} and {A, C}, and FD B → C. Here, B → C violates BCNF because B is not a superkey (it does not determine A), although the relation is in 3NF since C is a prime attribute. Decomposing into {A, B} and {B, C} resolves the violation. BCNF may not always allow dependency-preserving decompositions, unlike 3NF. Decomposition techniques ensure that normalizing a does not lose information or violate constraints. A is lossless-join if the natural join of the decomposed equals the original , preserving all tuples without spurious ones; this is tested using the chase algorithm, which applies to a tableau until or contradiction is reached. Dependency-preserving decompositions maintain all original locally in the subrelations, avoiding the need for joins to enforce them. The synthesis algorithm for 3NF computes a canonical cover of , groups them into minimal covering each , and ensures the result is lossless and dependency-preserving by including a if necessary. For instance, given a R with A → B, C → D, the algorithm yields (A, B) and (C, D), joined losslessly via a shared . These properties enable practical design without excessive query complexity. Higher normal forms address more complex dependencies beyond FDs. (4NF), introduced by Ronald Fagin in 1977, targets multivalued dependencies (MVDs), denoted X \to\to Y, where for a given X, the set of Y values is independent of other non-key attributes. A is in 4NF if it is in BCNF and has no non-trivial MVDs, preventing redundancy from independent multivalued facts. For example, in {Employee, Skill, Child}, if Employee →→ Skill and Employee →→ Child hold independently, decomposing into Employee-Skill and Employee-Child eliminates repetition. Introduced to handle such cases, 4NF ensures no MVD violations cause anomalies. Fifth normal form (5NF), also known as project-join normal form and proposed by in 1979, eliminates join dependencies (JDs), where a relation can be decomposed into projections that join back to the original without loss or spurious tuples. A is in 5NF if it is in 4NF and every non-trivial JD is implied by the candidate keys. This form addresses cyclic dependencies across multiple attributes; for instance, a {Agent, Company, Product} with a JD decomposing into pairwise projections requires full 5NF decomposition to avoid redundancy in assignments. 5NF provides the highest level of normalization for eliminating all decomposition-related anomalies.

Concurrency and Recovery

Transaction Properties and ACID

In database theory, a transaction is defined as a sequence of read and write operations on the database that is executed as a single logical , appearing atomic from an external perspective to ensure and . This atomicity means that the effects of the transaction are indivisible, either fully applied or not at all, preventing partial updates that could lead to inconsistent states. The ACID properties formalize the guarantees required for reliable in databases. Atomicity ensures that a is treated as an indivisible unit: all operations succeed (commit) or none do (abort), with mechanisms restoring the database to its pre-transaction state if failure occurs. requires that a brings the database from one valid state to another, preserving all integrity constraints such as primary keys, foreign keys, and user-defined rules. guarantees that concurrent do not interfere with each other, making them appear to execute serially even when overlapping in time, thus avoiding anomalies like dirty reads or lost updates. ensures that once a commits, its changes persist in stable storage even after failures, typically achieved through to non-volatile media. These properties, originally conceptualized with atomicity, , and by Jim Gray and later formalized as the acronym by Theo Härder and Andreas Reuter, provide a theoretical foundation for robust database . Serializability serves as the primary correctness criterion for concurrent execution, stating that a schedule of transactions is correct if its outcome is equivalent to some serial execution of those transactions in a specific order. Conflict serializability, a sufficient condition for serializability, holds if the schedule can be transformed into a serial one by swapping non-conflicting operations, where conflicts arise from operations on the same data item by different transactions (e.g., two writes or a read followed by a write). This can be tested efficiently using a , where nodes represent transactions and directed edges indicate conflicting operation orders; the schedule is conflict serializable if and only if the graph is acyclic. Recovery mechanisms underpin the ACID properties, particularly atomicity and durability, by enabling the database to recover from failures like crashes or power losses. Logging records all actions to stable storage, with (WAL) requiring log entries for changes to be written before the corresponding data modifications, allowing redo or undo during recovery. Checkpoints periodically flush dirty buffer pages to disk and record the log position, bounding recovery time by limiting the log to scan; fuzzy dumping, or fuzzy checkpointing, performs this asynchronously without halting transactions, producing a consistent image despite ongoing modifications. The algorithm exemplifies these techniques, integrating WAL with fuzzy checkpoints and compensating actions for partial rollbacks to support fine-grained locking and efficient recovery. A classic example is a transfer , where debiting one account and another must occur atomically: if the debit succeeds but the credit fails, atomicity triggers to prevent lost funds, while ensures committed transfers survive a subsequent system crash.

Concurrency Control Mechanisms

Concurrency control mechanisms in database theory address the challenge of managing simultaneous access to shared by multiple , ensuring while maintaining in multi-user environments. These mechanisms prevent conflicts such as dirty reads, non-repeatable reads, and phantom reads by enforcing , where the outcome of concurrent matches some serial execution order. Pessimistic approaches, like locking and ordering, proactively detect and resolve conflicts, whereas optimistic methods defer validation until commit time. Locking protocols form a foundational pessimistic , where transactions acquire locks on items before accessing them to ensure . Shared locks allow multiple transactions to read a item concurrently but block writes, while exclusive locks permit a transaction to read or write, blocking all other accesses. The (2PL) protocol divides execution into a growing phase, where locks are acquired but none released, and a shrinking phase, where locks are released but none acquired; this ensures conflict serializability by preventing in the of transactions. Deadlocks arise in 2PL when transactions form a of mutual waiting, detected via wait-for graphs where nodes represent transactions and directed edges indicate one transaction waiting for a lock held by another; resolution typically involves aborting one in the , often the youngest based on timestamps. Timestamp ordering assigns a unique to each upon initiation, ordering s based on these timestamps to simulate a serial schedule. For a read by T on item X, if T's timestamp is later than X's write timestamp, the read proceeds; otherwise, T aborts to avoid reading "too late" values that violate . Write s similarly check against read and write timestamps, with conflicts leading to aborts. The Thomas' write rule optimizes this by ignoring a write if a later write has already occurred, reducing unnecessary aborts without compromising , as the obsolete write would not affect the final state. Optimistic concurrency control assumes low conflict rates and allows to proceed without locks, validating consistency only at commit via read, validation, and write phases. In the forward validation variant, a transaction reads committed data, then at commit checks for conflicts with concurrently committing transactions; if none, it writes updates. Backward validation instead verifies against active transactions' reads, aborting if inconsistencies are found. This approach enhances throughput in read-heavy workloads by avoiding upfront blocking, though aborts increase under high contention. Multiversion concurrency control (MVCC) mitigates read-write conflicts by maintaining multiple versions of each data item, each tagged with a , allowing readers to access a consistent snapshot without blocking writers. A read retrieves the version with the latest write not exceeding the reader's , ensuring repeatable reads. Writes create new versions without overwriting existing ones, with garbage collection periodically removing obsolete versions. This non-locking method achieves and is theoretically analyzed for correctness in multiversion settings, influencing implementations like PostgreSQL's snapshot isolation. Lock granularity refers to the level at which locks are applied—such as entire tables, pages, or individual rows—affecting the trade-off between concurrency and overhead. Coarse granularity, like table-level locking, simplifies management but reduces parallelism by blocking unrelated operations; fine granularity, like row-level, permits higher concurrency but incurs greater lock management costs and potential deadlocks. Multiple granularity locking uses a hierarchy with intention locks (e.g., intention-shared for descending to finer levels), enabling efficient mixed access patterns while bounding overhead.

Advanced Theoretical Foundations

Complexity of Database Queries

The computational of database queries is a central concern in database theory, as it determines the feasibility of evaluating queries over large datasets. Two primary measures are used to analyze this complexity: complexity, which treats the query as fixed and measures resources as a function of the input database size, and combined (or expression) complexity, which considers both the database and query sizes as inputs. For queries, complexity is in AC^0 (constant depth circuits), while combined complexity is . Fundamental operations in , such as selection and , exhibit low complexity, often linear in the database size. Joins, however, vary significantly: a naive nested-loop join requires O(n^2) time in the worst case for relations of size n, but hash-based joins achieve O(n + m) time for relations of sizes n and m by building a on one and probing with the other. These bounds highlight practical optimizations that mitigate worst-case scenarios in query evaluation. Decision problems related to queries often exhibit high complexity. For instance, determining whether one conjunctive query is contained in another, especially when including inequalities (≤ or <), is \Pi_2^p-complete. Similarly, query optimization, particularly selecting the optimal join order for a chain of joins, is NP-hard, necessitating heuristic approaches in practice. A seminal result by Moshe Y. Vardi characterizes the complexity of programs, a fixpoint extension of relational algebra used for recursive queries. The data complexity of evaluating a fixed program is in PTIME, but combined complexity is PSPACE-complete, as the recursion depth can be exponential in the program size. Dichotomy theorems provide sharp classifications for query complexity on finite structures. For instance, in consistent query answering under primary key constraints, there is a dichotomy for Boolean conjunctive queries with exactly two atoms and no self-joins: the complexity is PTIME if the query satisfies certain structural conditions on the atoms and keys (e.g., the query literals are contained in the positive closure of one relation or the union of keys is contained in the query literals), and coNP-complete otherwise. This result exemplifies broader efforts to classify query answering precisely based on structural properties. In resource-constrained settings, such as streaming data, space-efficient algorithms enable approximate query answering. For example, sketching techniques for aggregate queries over streams achieve polylogarithmic space while providing (1 + ε)-approximations with high probability, trading accuracy for sublinear memory usage. These methods are crucial for real-time processing in distributed databases.

Incomplete Information and Nulls

Incomplete information in relational databases arises when some data values are missing or uncertain, typically represented by null values. Edgar F. Codd, the originator of the , initially proposed handling missing information through a single null marker in his 1979 work, but later refined this in 1986 to distinguish between two types: applicable but unknown (denoted as "U" or ⊥^A, indicating a value exists but is currently unavailable) and inapplicable (denoted as "I" or ⊥^I, indicating no value is relevant for the attribute in that tuple). This distinction addresses semantic nuances, such as a person's phone number being unknown versus not applicable if they do not own a phone. In practice, most database systems, including , collapse these into a single null type but employ three-valued logic—true, false, and unknown (often denoted as U)—derived from to evaluate predicates involving nulls. For example, a comparison like "age > 30" where age is null evaluates to unknown, propagating through queries to produce results that may include, exclude, or mark tuples as uncertain. The theoretical foundation for querying incomplete databases relies on possible worlds semantics, where an incomplete database instance is interpreted as a set of possible complete (null-free) relations consistent with the given nulls and any constraints. Seminal work by Imieliński and Lipski in 1984 formalized this model using conditional tables, where each tuple is annotated with a condition specifying the possible values that can replace its nulls, representing the minimal set of possible worlds. Certain answers to a query are then defined as the intersection of the query results over all possible worlds—tuples that appear in every complete instantiation—ensuring results are true regardless of how nulls are resolved. This approach contrasts with SQL's , which may return more tuples (including those that could be true in some worlds) and is computationally simpler but less precise for capturing . Query answering under possible worlds semantics extends relational by incorporating nulls, but computing certain answers can be coNP-complete even for conjunctive queries. A key challenge in incomplete databases is the problem: determining whether one set of schema constraints (e.g., functional dependencies) implies another in the presence of s. Under possible worlds semantics, this problem is undecidable in general, particularly when combining functional dependencies with dependencies, as shown in foundational results on dependency . To address query answering efficiently, techniques like and backward chase (or backchase) are employed. The chase procedure, originally developed for testing dependency , is adapted here to expand conditional tables by enforcing constraints across possible values, while the backchase retracts or prunes inconsistent worlds to compute certain answers without enumerating all possibilities. These methods, building on Imieliński and Lipski's conditional tables, enable practical approximations for conjunctive queries, though full decidability requires restrictions on null placements or query classes. Extensions of null-based incomplete include probabilistic , where is quantified by assigning probabilities to or attribute values, often at the tuple level assuming independence. Seminal frameworks, such as those by Cavallo and Pittarelli in 1987 and later refined by Dalvi and Suciu in 2004, model as probability distributions over possible worlds, with query answers computed via probabilistic inference (e.g., summing probabilities of worlds yielding true results). Another advancement is tracking for query , which annotates results with expressions tracing back to source , allowing propagation; et al.'s 2007 semirings provide a algebraic foundation for this, enabling exact computation of certain answers in polynomial time for semiring-positive queries. These extensions enhance handling of real-world scenarios, such as networks or data cleaning, where nulls alone insufficiently capture graded .

References

  1. [1]
    Database Theory - an overview | ScienceDirect Topics
    Database theory is defined as the mathematical and logical study of finite structures, particularly focusing on developing representations for databases.
  2. [2]
    Perspectives on database theory | ACM SIGACT News
    Perspectives on database theory. Author: Mihalis Yannakakis.
  3. [3]
    A relational model of data for large shared data banks
    A relational model of data for large shared data banks. Readings in database systems (3rd ed.) Read More. Comments.
  4. [4]
    Database theory—past and future - ACM Digital Library
    One important branch is the theory of relational databases, including such areas as dependency theory, universal-relation theory, and hypergraph theory.
  5. [5]
    A Brief History of Database Management - Dataversity
    Oct 25, 2021 · The CODASYL approach was a very complicated system and required substantial training. It depended on a “manual” navigation technique using a ...Table Of Contents · Mysql · Nosql
  6. [6]
    History of DBMS - GeeksforGeeks
    Jul 28, 2025 · The first database management systems (DBMS) were created to handle complex data for businesses in the 1960s.
  7. [7]
    How Charles Bachman Invented the DBMS, a Foundation of Our ...
    Jul 1, 2016 · The Legacy of IDS. IDS and CODASYL systems did not use the relational data model, formulated years later by Ted Codd, which underlies today's ...<|control11|><|separator|>
  8. [8]
    Origin of Integrated Data Store (IDS): First Direct-Access DBMS
    Aug 6, 2025 · An early system to manage data was the Integrated Data Store (IDS) designed by Charles Bachman in 1963. 2 The IDS system maintained a collection ...
  9. [9]
    CODASYL Database Task Group; Conference Proceedings
    CODASYL ; A Survey of Generalized Database Management Systems; Technical Report;1969 CODASYL Database Task Group; October 69 Report; 1969.
  10. [10]
    Information Management Systems - IBM
    For the commercial market, IBM renamed the technology Information Management Systems and in 1968 announced its release on mainframes, starting with System/360.
  11. [11]
    The Most Important Database You've Never Heard of - Two-Bit History
    Oct 7, 2017 · By 1968, IBM had installed a working version of IMS at NASA ... IMS works according to a hierarchical model, meaning that, instead ...
  12. [12]
    Edgar Codd Publishes the Definitive Model for Relational Database ...
    In June 1970 Edgar F. Codd Offsite Link of IBM published "A Relational Model of Data for Large Shared Data Banks Offsite Link ," Communications of the ACM, ...
  13. [13]
    Codd's Twelve Rules - Simple Talk - Redgate Software
    Apr 14, 2020 · Dr. Codd, the creator of relational databases, was bothered by this, so he set up a set of 13 rules that a product had to match to be considered relational.
  14. [14]
    [PDF] Reflections on Boyce-Codd Normal Form
    A recent study showed that under common assumptions, BCNF no longer guaran- tees freedom from various 'anomalies', one of its purported virtues. Second, a BCNF ...
  15. [15]
    Database System Concepts - Henry F. Korth, Abraham Silberschatz
    Authors, Henry F. Korth, Abraham Silberschatz ; Photographs by, Abraham Silberschatz ; Publisher, McGraw-Hill, 1986 ; Original from, the University of Michigan.
  16. [16]
    [PDF] Principles of Database and Knowledge-base Systems
    ... Theory ... This book is the first of a two-volume set that is intended as a replacement for my earlier book Principles of Database Systems (Ullman [1982] in the ...
  17. [17]
    (PDF) Quantum Data Management: From Theory to Opportunities
    Mar 8, 2024 · Quantum computing has emerged as a transforma-tive tool for future data management. Classical problems in database domains, including query ...
  18. [18]
    [PDF] Data Models 1 Introduction 2 Object-Based Logical Models
    A data model is a collection of tools for describing real-world entities and their relationships. Object-based models use entities/objects and relationships.
  19. [19]
    [PDF] Reference model for DBMS standardization: database architecture ...
    the ANSI/SPARC three-schema architecture of data representa- tion, conceptual, external,. .and internal, and is used in the development of the DBMS RM. A ...Missing: 1975 | Show results with:1975<|separator|>
  20. [20]
    [PDF] Chapter 2: Entity-Relationship Model - Purdue Computer Science
    UML Class Diagrams (Contd.) □ Cardinality constraints are specified in the form l..h, where l denotes the minimum and h the maximum number of relationships ...
  21. [21]
    [PDF] Network Model - Database System Concepts
    The first database-standard specification, called the CODASYL DBTG 1971 report, was written in the late 1960s by the Database Task Group. Since then, a number.
  22. [22]
    [PDF] A Relational Model of Data for Large Shared Data Banks
    Future users of large data banks must be protected from having to know how the data is organized in the machine. (the internal representation). A prompting.
  23. [23]
    The entity-relationship model—toward a unified view of data
    A data model, called the entity-relationship model, is proposed. This model incorporates some of the important semantic information about the real world.
  24. [24]
    [PDF] Further Normalization of the Data Base Relational Model
    E. F. Codd, "A Relational. Model of Data for Large Shared Data Banks",. Comm. ACM 13 6, June 1970, 377-387. -. 2. E. F. Codd, "Relational. Completeness of Data ...
  25. [25]
    Schema evolution in database systems: an annotated bibliography
    Schema Evolution is the ability of a database system to respond to changes in the real world by allowing the schema to evolve. In many systems this property ...Missing: seminal work
  26. [26]
    [PDF] Relational Completeness of Data Base Sublanguages
    This paper attempts to provide a theoretical basis which may be used to determine how complete a selection capability is provided in a proposed data sublanguage.
  27. [27]
    [PDF] What You Always Wanted to Know About Datalog (And Never Dared ...
    Abstract-Datalog is a database query language based on the logic programming paradigm; it has been designed and intensively studied.
  28. [28]
    Domain-Oriented Relational Languages. VLDB 1977 - SIGMOD
    This paper defines two domain-oriented relational languages. DRC is a many-sorted calculus, where the structural role of domains is emphasized by defining types ...
  29. [29]
    [PDF] An Overview of Query Optimization in Relational Systems
    Query optimization involves choosing the best execution plan from many possible plans, using a search space, cost estimation, and an enumeration algorithm.
  30. [30]
    [PDF] Access Path Selection in a Relational Database Management System
    This paper describes how System R chooses access paths for both simple (single relation) and complex que- ries (such as joins), given a user specifi- cation of ...Missing: reordering | Show results with:reordering
  31. [31]
    [PDF] Random Sampling from Databases - UC Berkeley
    I describe and analyze algorithms for sampling from relational operators in Chapter Five: selection, intersection, union, projection, set di erence, and join.
  32. [32]
    [PDF] Dependency Structures of Data Base Relationships
    Départementd 'Informatique,Universite'de Montréal. Montréal,Québec,Canada. An axiomatic description of dependency structure those families of dependencies.
  33. [33]
    Candidate keys for relations - ScienceDirect.com
    An algorithm is presented that finds K, the set of all keys for a given set A of attribute names and a given set D[0] of functional dependencies.
  34. [34]
    [PDF] Synthesizing Third Normal Form Relations from Functional ...
    It has been proposed that the description of a relational database can be formulated as a set of functional relationships among database attributes.
  35. [35]
    The theory of joins in relational databases - ACM Digital Library
    In this paper we give efficient algorithms to determine whether the join of several relations has the intuitively expected value (is lossless) and to determine ...
  36. [36]
    Synthesizing third normal form relations from functional dependencies
    An effective procedure for performing a relational scheme synthesis is presented and the schema that results from this procedure is proved to be in Codd's ...
  37. [37]
    Multivalued dependencies and a new normal form for relational ...
    A new type of dependency, which includes the well-known functional dependencies as a special case, is defined for relational databases.
  38. [38]
    Normal forms and relational database operators - ACM Digital Library
    We discuss the relationship between normal forms in a relational database and an allowed set of relational operators.Missing: original | Show results with:original
  39. [39]
    [PDF] Jim Gray - The Transaction Concept: Virtues and Limitations
    ABSTRACT: A transaction is a transformation of state which has the properties of atomicity. (all or nothing), durability (effects survive failures) and ...
  40. [40]
    Principles of transaction-oriented database recovery
    This paper, 'Principles of transaction-oriented database recovery', is by Theo Haerder and Andreas Reuter, published in ACM Computing Surveys.
  41. [41]
    The serializability of concurrent database updates | Journal of the ACM
    The serializability of concurrent database updates. Author: Christos H. Papadimitriou.
  42. [42]
    ARIES: a transaction recovery method supporting fine-granularity ...
    In this paper, we discuss the design and implementation of a concurrency control manager for the Tachyon, a main-memory DBMS. Since a main-memory DBMS ...
  43. [43]
    [PDF] Checkpointing Memory-Resident Databases
    Fuzzy checkpoints require little or no synchronization with executing transactions. The backup database produced by such a checkpoint is called fuzzy because it ...
  44. [44]
    [PDF] TWO PHASE LOCKING - Microsoft
    Two-phase locking is a type of scheduler, where locking synchronizes access to shared data. Each data item has a lock associated with it.
  45. [45]
    [PDF] TIMESTAMP-BASED ALGORITHMS FOR CONCURRENCY ...
    In this paper we present a framework for the design and analysis of concurrency control algorithms for distributed database management systems. (DDBMS). This.
  46. [46]
    [PDF] On Optimistic Methods for Concurrency Control - Computer Science
    In this paper, two families of nonlocking concurrency controls are presented. The methods used are “optimistic” in the sense that they rely mainly on ...Missing: seminal | Show results with:seminal
  47. [47]
    Multiversion concurrency control—theory and algorithms
    This paper presents a theory for analyzing the correctness of concurrency control algorithms for multiversion database systems.Missing: original | Show results with:original
  48. [48]
    Granularity of locks in a shared data base - ACM Digital Library
    This paper proposes a locking protocol which associates locks with sets of resources. This protocol allows simultaneous locking at various granularities by ...
  49. [49]
    [PDF] The complexity of relational query languages - Rice University
    In the last years there has been a lot of interest in query languages for relational databases. Following. Codd's pioneering work [Codd] on the relational ...
  50. [50]
    [PDF] Lecture 2: Query Containment - cs.wisc.edu
    The problem of evaluating a Conjunctive Query is NP-complete (combined complexity). Even though query containment is an NP-complete problem, it is a feasible ...
  51. [51]
    Query simplification: graceful degradation for join-order optimization
    Join ordering is one of the most important, but also most challenging problems of query optimization. In general finding the optimal join order is NP-hard.
  52. [52]
    [PDF] A Dichotomy in the Complexity of Consistent Query Answering for ...
    Sep 29, 2011 · Abstract. We establish a dichotomy in the complexity of computing the consistent answers of a Boolean conjunctive query.
  53. [53]
    [PDF] Processing Complex Aggregate Queries over Data Streams
    This paper addresses processing aggregate SQL queries over continuous data streams with limited memory, using randomized techniques to provide approximate ...
  54. [54]
    Extending the database relational model to capture more meaning
    In this paper we propose extensions to the relational model to support certain atomic and molecular semantics.
  55. [55]
    Missing information (applicable and inapplicable) in relational ...
    {EFC2} E. F. Codd, "A Relational Model of Data for Large Shared Data Banks", Comm. ACM, Vol. 13, No. 6, June 1970. Digital Library · Google Scholar. [3]. {EFC3} ...<|control11|><|separator|>
  56. [56]
    Incomplete Information in Relational Databases | Journal of the ACM
    Incomplete Information in Relational Databases. Authors: Tomasz Imieliński.
  57. [57]
    The Implication Problem for Functional and Inclusion Dependencies ...
    We show that the implication problem is undecidable for the class of functional and inclusion dependencies.
  58. [58]
    [PDF] The Theory of Probabilistic Databases - VLDB Endowment
    ABS!l'RACT. A theory of probabilistic databases is outlined. This theory is one component of an integrated approach to data-modelling that accomodates both ...Missing: seminal | Show results with:seminal