Multivalued dependency

In database theory, a multivalued dependency (MVD) is a full constraint between two sets of attributes in a relation, generalizing the concept of functional dependencies by capturing situations where the values of one attribute set are independent of another given a determinant set.^[1] Formally, for a relation schema R(X, Y, Z), an MVD X \to\to Y holds if, for any two tuples t_1 and t_2 in the relation that agree on X, the sets of Y-values associated with each are identical regardless of the Z-values, ensuring Yx z = Yx z' for all x, z, z' where the projections are nonempty.^[1] Introduced by Ronald Fagin in 1977, MVDs address limitations in earlier normalization forms by identifying redundancies and anomalies arising from independent multi-valued associations, such as in a relation tracking employees, their children, and salary histories where children are independent of salary details.^[1] This dependency plays a central role in defining fourth normal form (4NF), a stricter normalization level than Boyce-Codd normal form (BCNF), which requires that for every nontrivial MVD X \to\to Y in a relation schema R, the functional dependency X \to A must hold for every attribute A not in X.^[1] In 4NF, all dependencies stem from keys, preventing update anomalies from non-key-determined independencies, and any relation schema can be losslessly decomposed into 4NF components.^[1] MVDs are part of a broader framework of data dependencies, including functional and join dependencies, and their implication problem—determining if one set of MVDs infers another—has been axiomatized for efficient database design and integrity enforcement.^[1] While functional dependencies ensure unique mappings, MVDs handle scenarios like many-to-many relationships without implying uniqueness, making them essential for modeling complex real-world data, such as course prerequisites or product features.^[1]

Fundamentals

Informal Definition

A multivalued dependency is a constraint in relational database theory that generalizes functional dependencies to capture relationships where one set of attributes determines multiple independent values for another set, without reliance on additional attributes. Specifically, for attributes A and B in a relation, with other attributes denoted as C, the multivalued dependency holds when the set of B values associated with a particular A value is independent of the C values; thus, whenever tuples exist pairing A with certain B and C values, all combinations of those B and C sets must also appear in the relation.^[1] In contrast to functional dependencies, which enforce that a value of A determines a unique value of B, multivalued dependencies allow A to determine an entire set of B values that remains fixed and unaffected by variations in C, ensuring the relation reflects complete cross-products of these independent sets. This distinction addresses scenarios where attributes exhibit multi-valued associations rather than one-to-one mappings.^[2]

Historical Context

The concept of multivalued dependencies emerged within the framework of the relational model, first proposed by E.F. Codd in 1970, which emphasized relations as the primary data structure and introduced functional dependencies as a foundational constraint for ensuring data integrity.^[3] As the relational model evolved in the early 1970s, Boyce-Codd Normal Form (BCNF), defined by Codd in 1972, addressed certain anomalies in third normal form but proved insufficient for handling relations beyond first normal form (1NF), particularly those involving multivalued attributes that led to redundancy and update issues. Ronald Fagin introduced multivalued dependencies in 1977 to capture these limitations, formalizing them as a generalization of functional dependencies in his seminal paper published in ACM Transactions on Database Systems, where he also proposed fourth normal form (4NF) as a solution to eliminate multivalued dependency-induced redundancies in non-1NF relations.^[1] This work built directly on Codd's foundations, extending dependency theory to better support complex data associations without decomposing relations into purely atomic-valued tables. The complete axiomatization of functional and multivalued dependencies was provided by Catriel Beeri, Ronald Fagin, and John H. Howard in 1977.^[4] In 1979, the chase procedure for testing implications of data dependencies, including MVDs, was introduced by David Maier, Alberto O. Mendelzon, and others, solidifying their role in dependency-preserving decompositions.^[5] During the 1980s, the theory evolved with extensions like embedded multivalued dependencies, explored in works such as David S. Parker's 1980 analysis of inferences involving them alongside transitive dependencies, enabling more nuanced handling of nested and conditional data constraints in advanced relational designs.^[6]

Formal Framework

Notation and Syntax

The multivalued dependency (MVD) was introduced by Ronald Fagin in his seminal 1977 paper, where it is denoted using a double arrow as X \to\to Y, distinguishing it from the single arrow for functional dependencies (X \to Y).^[1] In contemporary relational database literature, the standard notation is X \to\to Y (or sometimes X \twoheadrightarrow Y), where X and Y are disjoint sets of attributes, and the double arrow emphasizes the multiplicity on the right-hand side compared to the single arrow for functional dependencies.^[7] In a relation schema R with attribute set partitioned into disjoint X, Y, and Z such that X \cup Y \cup Z = R and the intersections are empty, the MVD is expressed as X \to\to Y, implying that Z = R - (X \cup Y) is also multidetermined by X independently.^[1]^[7] Variations in notation appear in some texts, particularly for related concepts like join dependencies, which may use double-headed arrows (\twoheadrightarrow) to denote more general projections and joins.^[8] Standard conventions include denoting attributes in uppercase letters (e.g., A, B), sets as boldfaced or grouped (e.g., \mathbf{X}), and relation schemas as R(X_1, \dots, X_n) without assuming keys or superkeys in the notation itself.^[7]

Precise Definition

In the relational model, a relation r over schema R is a finite set of tuples, where each tuple assigns values from appropriate domains to the attributes in R, with no duplicate tuples or null values permitted.^[1] A multivalued dependency X \to\to Y holds in r(R) if, for every pair of tuples u, v \in r such that u[X] = v[X], there exist tuples w, z \in r satisfying:

w[X] = u[X] = v[X] = z[X],
w[Y] = u[Y],
z[Y] = v[Y],
w[Z] = v[Z],
z[Z] = u[Z],

where Z = R - (X \cup Y).^[1]^[8] This tuple-based condition ensures that the values for Y and Z are independent given X, capturing the essence of multivalued associations without implying functional determination.^[1] An equivalent relational algebra formulation states that X \to\to Y holds in r if and only if r = \pi_{XY}(r) \bowtie \pi_{XZ}(r), where \bowtie denotes the natural join operator.^[9]^[1] To sketch the equivalence, consider the forward direction: assuming the tuple condition, any tuple in \pi_{XY}(r) \bowtie \pi_{XZ}(r) arises from compatible projections that, by the condition, correspond to existing tuples in r, ensuring the join is contained in r; the reverse containment follows since projections of r yield subsets. The converse direction mirrors this by constructing witness tuples from the join components to satisfy the tuple matching requirements.^[9]^[1]

Illustrative Examples

Simple Example

Consider the relation Employee(EID, Skill, Child), which records employee identifiers (EID), their skills, and their children, where skills and children are independent attributes for each employee.^[10] This setup illustrates a multivalued dependency (MVD) EID →→ Skill | Child, meaning the set of skills for a given EID is independent of the set of children, and vice versa, requiring the relation to contain all combinations of these values.^[11] For employee EID=1, who possesses the skills "Programming" and "Design" and has children "Alice" and "Bob", the MVD holds if the following tuples are present, forming the full Cartesian product:

EID	Skill	Child
1	Programming	Alice
1	Programming	Bob
1	Design	Alice
1	Design	Bob

These tuples demonstrate the independence: the skills do not influence the children listed, and all pairings appear without redundancy beyond the necessary combinations.^[10] If the relation omits a tuple, such as (1, Design, Bob), the MVD is violated, as the incomplete set fails to capture the full independence between skills and children for that EID.^[11] In essence, the MVD enforces that, for any fixed EID, every skill must pair with every child in the relation.^[10]

Application in Relations

In relational schemas, multivalued dependencies often arise in scenarios modeling independent multi-valued attributes, such as enrollment data where a course can have multiple students and multiple textbooks without interdependence between those sets. Consider the schema Course(CID, Student, Instructor, Textbook), where each course (identified by CID) is taught by a specific instructor but enrolls multiple students and uses multiple textbooks independently; the multivalued dependency can be expressed as CID →→ Student | Textbook.^[1] To illustrate manifestation, suppose a dataset for CID=101 (taught by Instructor "Smith") includes the full cross-product of students and textbooks: tuples such as (101, Alice, Smith, Database Systems), (101, Alice, Smith, Algorithms), (101, Bob, Smith, Database Systems), and (101, Bob, Smith, Algorithms). This satisfies the MVD because every student-textbook pair appears, reflecting the independence. A violation occurs if, for instance, the tuple (101, Alice, Smith, Algorithms) is missing while (101, Alice, Smith, Database Systems) and (101, Bob, Smith, Algorithms) exist, as it breaks the required combinations for the given CID and instructor.^[1]^[12] Detection in such relations involves manually verifying the MVD by projecting the relation onto (CID, Student, Instructor) and (CID, Textbook, Instructor), then performing a natural join on the common attributes (CID, Instructor); the result must exactly reconstruct the original relation for the dependency to hold.^[1] Such MVDs introduce redundancy, as the instructor information ("Smith") is repeated across every student-textbook combination, leading to update anomalies—for example, changing the instructor requires modifying multiple tuples.^[1]^[12]

Theoretical Properties

Axioms and Inference Rules

The inference rules for multivalued dependencies (MVDs) form a sound and complete axiomatization, allowing derivation of all implied MVDs from a given set, analogous to Armstrong's axioms for functional dependencies. This system was established by Beeri, Fagin, and Howard in their seminal work, proving that the rules capture exactly the semantic implications of MVDs in relational schemes.^[13] The core axioms consist of four primary rules, denoted as MVDO through MVD3, which operate on sets of attributes X, Y, Z, W over a universal relation scheme U. The complementation rule (MVDO) states that if X \cup Y \cup Z = U and Y \cap Z \subseteq X, then X \to\to Y if and only if X \to\to Z.^[13] Reflexivity (MVD1) holds that if Y \subseteq X, then X \to\to Y.^[13] Augmentation (MVD2) provides that if Z \subseteq W and X \to\to Y, then XW \to\to YZ.^[13] Transitivity (MVD3) asserts that if X \to\to Y and Y \to\to Z, then X \to\to Z - Y.^[13] Note that MVD transitivity is conditional, applying only to the portion of Z excluding Y, unlike the full transitivity in functional dependencies.^[13] From these basic rules, additional derived rules follow, including union, decomposition, and pseudo-transitivity, which expand the inference capabilities. Union states that if X \to\to Y_1 and X \to\to Y_2, then X \to\to Y_1 Y_2.^[13] Decomposition implies that if X \to\to Y_1 Y_2, then X \to\to Y_1 and X \to\to Y_2.^[13] Pseudo-transitivity holds that if X \to\to Y and Y W \to\to Z, then X W \to\to Z - (Y W).^[13] Reflexivity covers trivial MVDs where the dependent set is a subset of the determinant.^[13] A set of MVDs F implies another set G if every MVD in G can be logically derived from F using these axioms and rules, ensuring closure under the implication problem for MVDs.^[13] This completeness theorem guarantees that no valid MVD is overlooked in the derivation process.^[13]

Equivalence to Join Dependencies

A join dependency (JD) is a generalization of multivalued dependencies in relational database theory. Formally, a relation r on schema R satisfies a join dependency *(X_1, X_2, \dots, X_n), where X_1, \dots, X_n are subsets of R whose union is R, if r equals the natural join of its projections onto each X_i, i.e., r = \pi_{X_1}(r) \bowtie \pi_{X_2}(r) \bowtie \dots \bowtie \pi_{X_n}(r).^[14] Multivalued dependencies represent a specific case of join dependencies. For a relation R(X, Y, Z), a single multivalued dependency X \to\to Y (or equivalently X \to\to Y \mid Z) holds if and only if R satisfies the join dependency *(XY, XZ), meaning r = \pi_{XY}(r) \bowtie \pi_{XZ}(r).^[14] This equivalence arises because the MVD condition—that for any tuples (x, y, z) and (x, y', z') in r, the tuples (x, y', z) and (x, y, z') must also be present—precisely ensures the lossless join of the projections on XY and XZ. This correspondence extends to sets of multivalued dependencies, which imply a corresponding join dependency, though the converse does not hold in general. Specifically, a set of MVDs over [R](/page/R) implies a JD whose components are formed by combining the determinants and dependent sets from the MVDs; however, not all JDs can be expressed solely in terms of MVDs, as JDs allow for more complex decompositions involving multiple components beyond binary splits.^[14] Multivalued dependencies are preserved under projection, meaning that if an MVD holds in a relation, it holds in any projection of that relation containing the relevant attributes. This property distinguishes MVDs from general JDs, which are not necessarily preserved under projection.^[14] The equivalence has significant implications for database normalization, as join dependencies provide a broader framework for higher-order normal forms such as fifth normal form (5NF), which requires lossless decomposition with respect to all JDs, including those generalizing MVDs. This generalization enables the analysis of more intricate inter-attribute independencies beyond pairwise MVDs.^[14]

Implications for Database Design

Role in Normalization

A relation schema is in fourth normal form (4NF) if it is in Boyce-Codd normal form (BCNF) and whenever a nontrivial multivalued dependency X \to\to Y holds in the schema, X is a superkey of the schema.^[1] This condition ensures that no independent multi-valued facts are mixed in a single relation, preventing anomalies arising from multivalued dependencies beyond those implied by candidate keys.^[1] To achieve 4NF, a relation R violating this form due to a nontrivial multivalued dependency X \to\to Y (where R = XYZ) is decomposed into two projections: R_1 = [XY](/page/XY) and R_2 = [XZ](/page/XZ).^[1] This binary decomposition is lossless, as the original relation can be recovered via the natural join of R_1 and R_2, since the multivalued dependency guarantees that R = R_1 \bowtie R_2.^[1] The process preserves the dependencies involved in the decomposition, with the multivalued dependency becoming trivial in each subschema.^[15] The primary benefit of this decomposition is the elimination of redundancy caused by independent multi-valued attributes, reducing update anomalies and storage waste while maintaining data integrity.^[1] For instance, consider a relation Employee(\emph{EID}, \emph{Skill}, \emph{Child}) where \emph{EID} \to\to \emph{Skill} and \emph{EID} \to\to \emph{Child}, modeling an employee's independent sets of skills and children.^[1] Decomposing it yields EmployeeSkill(\emph{EID}, \emph{Skill}) and EmployeeChild(\emph{EID}, \emph{Child}), each in 4NF, avoiding the need to repeat \emph{EID} for every skill-child combination in the original relation.^[1] The full algorithm to normalize to 4NF involves iteratively identifying all nontrivial multivalued dependencies in the current set of relations, decomposing each violating relation as described, and projecting the dependencies onto the subschemas until no violations remain.^[1] Unlike functional dependency-based normalizations up to third normal form, where dependency preservation is always achievable, decompositions for multivalued dependencies to 4NF do not guarantee preservation of all original multivalued dependencies in the resulting schemata.^[15]

Comparison with Other Dependencies

Multivalued dependencies (MVDs) differ from functional dependencies (FDs) in that FDs enforce unique determination, where a given value of attribute set X determines exactly one value for attribute set Y (denoted X → Y), whereas MVDs permit multiple independent values for Y associated with each X value (denoted X →→ Y).^[11] For instance, in an employee relation, an FD might specify that each employee has a unique salary (EMPLOYEE → SALARY), but an MVD could indicate multiple children per employee without uniqueness (EMPLOYEE →→ CHILD).^[11] This distinction allows MVDs to model set-valued associations more flexibly than FDs, which assume single-valued mappings.^[15] In relation to join dependencies (JDs), MVDs represent a specific binary form, where an MVD X →→ Y corresponds to a JD that decomposes the relation into exactly two projections over XY and XZ (with Z as the complement), ensuring the original relation is the natural join of these projections.^[11] JDs, however, generalize this to n-ary decompositions, capturing more complex independences among multiple attribute sets beyond pairwise separations.^[15] Thus, every MVD implies a JD, but JDs encompass scenarios requiring multi-way joins that MVDs alone cannot express.^[15] An MVD X →→ Y is trivial if Y is a subset of X or if the union of X and Y equals the entire relation schema R, meaning it holds in every possible relation without imposing constraints.^[16] In contrast, non-trivial MVDs introduce potential anomalies by enforcing the presence of specific tuple combinations, such as requiring "all-or-nothing" inclusions in the relation to maintain independence between attribute sets.^[11] Every FD is also an MVD, since if X → Y holds, then X →→ Y follows (provided X and Y are disjoint), but the converse does not hold, as MVDs tolerate multiplicity absent in FDs.^[16] In sets containing both FDs and MVDs, inference rules interact via mixed axioms, such as deriving an FD from an MVD combined with another FD, enabling comprehensive dependency closure computations.^[16] MVDs uniquely capture legitimate "all-or-nothing" multi-valued facts, where the full set of associated values must coexist independently, unlike multivalued interpretations of FDs, which signal semantic design errors by violating uniqueness.^[11] This property underscores MVDs' role in modeling orthogonal multi-valued relationships without implying determination hierarchies present in FDs.^[15]