Multiset
In mathematics, a multiset (also known as a bag or mset) is a generalization of a set that permits multiple instances of elements, where each element has an associated multiplicity or frequency indicating how many times it appears.[1] Unlike standard sets, which enforce uniqueness and disregard order, multisets emphasize the count of occurrences while remaining unordered collections.[2] Formally, a multiset can be constructed as a pair consisting of a base set of distinct elements and a function assigning a non-negative integer multiplicity to each element, such as the multiset represented by {a: 2, b: 3, d: 1}, meaning a appears twice, b three times, and d once.[1] Common operations on multisets include the union, which takes the maximum multiplicity for each element across the two multisets; the intersection, which takes the minimum multiplicity; and the sum (or multiset addition), which adds the multiplicities element-wise.[1] These operations extend set-theoretic concepts to handle repetitions systematically, enabling comparisons and manipulations based on cardinality, where the total cardinality of a multiset is the sum of all multiplicities.[3] The foundational development of multiset theory occurred in the late 20th century, with Wayne D. Blizard establishing a rigorous axiomatic framework in 1989 that defines multisets as structures allowing finite repetitions of elements within a first-order logic setting.[4] Blizard's subsequent 1991 survey traces the evolution of multiset theory from early informal uses in combinatorics and statistics to a formalized branch of set theory, highlighting its distinctions from related concepts like fuzzy sets or sequences.[5] Multisets play a crucial role in enumerative combinatorics, where they model selections with repetition, such as the number of multisets of size k from n types given by the binomial coefficient \binom{n+k-1}{k}, often called "stars and bars."[2] In computer science, they underpin data structures like bags or multisets in libraries (e.g., C++'s std::multiset) for efficient storage and querying of elements with duplicates, and they appear in algorithms for secure multiparty computation, such as multiset intersection cardinality protocols.[3][6] Further applications include description logics for knowledge representation, software verification with cardinality constraints, and pattern recognition in signal processing and machine learning, where multisets facilitate frequency-based similarity measures like the Jaccard index over repetitions.[7][8][1]Fundamentals
Definition
In mathematics, a multiset is a generalization of the concept of a set that allows multiple instances, or multiplicities, of elements, thereby distinguishing it from standard sets where elements are unique. Formally, a multiset M is defined as an ordered pair (X, \mu), where X is a set (often called the underlying set or the domain) and \mu: X \to \mathbb{N}_0 is a multiplicity function that assigns to each element x \in X a non-negative integer \mu(x) representing the number of times x occurs in the multiset.[9] This definition assumes familiarity with basic set theory, including the notions of sets and functions.[9] Alternative formalizations of a multiset include representing it as a set in which elements may appear more than once, such as \{a, a, b\}, where repetitions are explicitly listed; however, this notation is informal and less suitable for infinite or abstract contexts. Another equivalent approach specifies a multiset via its support—the subset of elements with positive multiplicity—and the corresponding multiplicity map restricted to that support, ensuring \mu(x) > 0 only for x in the support.[9] In a more general transfinite setting, as introduced by Richard Rado, a multiset can be viewed as a function from a set to the class of cardinal numbers, with the domain consisting of elements assigned nonzero cardinals.[9] The cardinality of a multiset M = (X, \mu) is defined as the sum of the multiplicities over its underlying set, given by |M| = \sum_{x \in X} \mu(x), which counts the total number of element occurrences, including repetitions.[9] The empty multiset, denoted \emptyset or $0, is the pair (\emptyset, \mu) where the underlying set is empty (equivalently, \mu(x) = 0 for all x in some universe), and it has cardinality 0.[9] A singleton multiset consists of an underlying set with one element x and \mu(x) = 1, yielding cardinality 1.[9]Examples
A shopping basket containing two apples and one banana exemplifies a multiset, which can be denoted as {apple, apple, banana} to indicate repetitions or more compactly as {apple: 2, banana: 1}, preserving the multiplicity of each item.[10][11] In comparison, viewing the same basket through the lens of a standard set yields {apple, banana}, which eliminates multiplicity information and thus fails to capture the full contents.[12] A deck of cards, particularly when multiple decks are combined in games allowing duplicates of suits and ranks, represents a multiset where identical cards coexist without regard to order.[13] In mathematics, the prime factorization of an integer provides a clear example: the number 12 decomposes into the multiset {2, 2, 3}, reflecting the repeated prime factor 2. Multisets are commonly notated by listing elements with repetitions, such as {a, a, b}, or by specifying multiplicities, as in {a: 2, b: 1} to denote two instances of a and one of b.[11] The cardinality of a multiset, or its total size, sums the multiplicities of its elements; for instance, the multiset {1: 3, 2: 1} has cardinality 4.[12]Properties and Operations
Basic Properties
A multiset M over a base set X is defined by its multiplicity function \mu_M: X \to \mathbb{N}_0, where \mathbb{N}_0 denotes the non-negative integers, and the multiplicity \mu_M(x) indicates the number of occurrences of each element x \in X. Two multisets M and N over the same base set are equal, denoted M = N, if and only if their multiplicity functions agree for every element, that is, \mu_M(x) = \mu_N(x) for all x \in X.[14][15] This equality condition directly follows from the definition of multisets via multiplicity functions, as any difference in multiplicities would distinguish the collections.[16] The inclusion relation for multisets generalizes set inclusion to account for multiplicities. A multiset M is a submultiset of another multiset N, denoted M \subseteq N, if \mu_M(x) \leq \mu_N(x) for every x \in X.[14][15] This relation is reflexive (M \subseteq M) and transitive (M \subseteq N and N \subseteq P imply M \subseteq P), forming a partial order on the collection of multisets over X.[16] A proper submultiset holds when the inequality is strict for at least one x, ensuring M \neq N. To verify inclusion, one compares multiplicities element-wise; if all \mu_M(x) \leq \mu_N(x) and equality holds everywhere, then M = N, otherwise M is a proper submultiset if the condition holds without equality.[14] The support of a multiset M, denoted \operatorname{supp}(M), is the underlying set consisting of elements with positive multiplicity: \operatorname{supp}(M) = \{ x \in X \mid \mu_M(x) > 0 \}.[14][16] This support forms an ordinary set without repetitions and determines the distinct elements present in M, relating directly to the base set X as a subset where multiplicities vanish outside \operatorname{supp}(M). For submultisets, \operatorname{supp}(M) \subseteq \operatorname{supp}(N) if M \subseteq N, since zero multiplicity in M is compatible with positive in N, but the converse does not hold without multiplicity checks.[15] When the base set X is equipped with a total order \leq, an ordered multiset inherits this order on its elements, allowing identification of maximal and minimal elements based on the order of distinct items in the support. The maximal element of M is the largest x \in \operatorname{supp}(M) under \leq, regardless of multiplicity, and similarly the minimal element is the smallest such x.[17] In this totally ordered context, these extremal elements coincide with the greatest and least elements of the support, providing endpoints for the ordered structure.[17]Arithmetic Operations
Multisets support a variety of arithmetic operations that extend their set-theoretic counterparts by accounting for element multiplicities, enabling the construction of new multisets from existing ones. These operations are defined via the multiplicity function \mu: U \to \mathbb{N}_0, where U is the universal set and \mathbb{N}_0 denotes non-negative integers, ensuring that the resulting multiplicity for each element is computed based on the input multiplicities. Such operations are fundamental in combinatorics and computer science for modeling collections with repetitions, like bags in databases or weighted selections.[16] The union of two multisets M and N, denoted M \cup N, is the multiset whose multiplicity function is given by \mu_{M \cup N}(x) = \max(\mu_M(x), \mu_N(x)) for all x \in U. This operation retains the highest occurrence count for each element, analogous to set union but preserving multiplicities. For example, if M = \{a: 2, b: 1\} and N = \{a: 1, c: 3\}, then M \cup N = \{a: 2, b: 1, c: 3\}. Union is commutative, as \max(\mu_M(x), \mu_N(x)) = \max(\mu_N(x), \mu_M(x)); associative, since \max(\max(a, b), c) = \max(a, \max(b, c)) for non-negative integers a, b, c; and idempotent, with M \cup M = M.[16] The intersection M \cap N takes the minimum multiplicity: \mu_{M \cap N}(x) = \min(\mu_M(x), \mu_N(x)) for all x \in U, including only elements common to both with their overlapping counts. Using the prior example, M \cap N = \{a: 1\}. Intersection shares the commutativity and associativity of union, via \min(\mu_M(x), \mu_N(x)) = \min(\mu_N(x), \mu_M(x)) and \min(\min(a, b), c) = \min(a, \min(b, c)), and is idempotent since M \cap M = M.[16] The sum, or disjoint union, M + N adds multiplicities directly: \mu_{M + N}(x) = \mu_M(x) + \mu_N(x) for all x \in U, useful in combinatorial enumeration for combining independent selections. For the example multisets, M + N = \{a: 3, b: 1, c: 3\}. This operation is commutative, as addition of non-negative integers is commutative, and associative, since (a + b) + c = a + (b + c).[16] The difference M - N subtracts multiplicities, flooring at zero: \mu_{M - N}(x) = \max(\mu_M(x) - \mu_N(x), 0) for all x \in U, removing elements from N as much as possible from M. In the example, M - N = \{a: 1, b: 1\}, but N - M = \{c: 3\}, highlighting its non-commutativity. Unlike the others, difference lacks general associativity or idempotence.[16] The Cartesian product M \times N extends pairs across elements, with multiplicity as the product: \mu_{M \times N}((x, y)) = \mu_M(x) \cdot \mu_N(y) for all x, y \in U, forming a multiset over U \times U. For M = \{a: 2\} and N = \{b: 1, c: 2\}, M \times N includes (a, b) with multiplicity 2 and (a, c) with multiplicity 4. This operation is commutative in the sense that M \times N = N \times M up to relabeling of pairs and is associative when extended to multiple factors.[18]Enumeration
Recurrence Relations
The number of multisets of cardinality k that can be formed from a base set of n distinct types, often denoted b(n, k), counts the combinations where repetitions of types are permitted and the order of elements is irrelevant.[19] A recursive approach computes b(n, k) via the relation b(n, k) = b(n-1, k) + b(n, k-1) for n \geq 1 and k \geq 1, with base cases b(n, 0) = 1 for all n \geq 0 (corresponding to the single empty multiset), b(0, 0) = 1, and b(0, k) = 0 for k > 0 (as no positive-sized multiset is possible without types).[20] This recurrence derives from partitioning multisets based on the highest type label, assuming types are labeled $1 through n. Multisets excluding type n match those of size k from n-1 types, yielding b(n-1, k) cases. Multisets including at least one type n map bijectively to those of size k-1 from n types by removing one instance of n, yielding b(n, k-1) cases. The addition principle then combines these disjoint cases.[20] For illustration, the table below shows computed values of b(n, k) for small n and k, obtained by applying the recurrence iteratively from the base cases:| n \setminus k | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 3 | 4 | 5 |
| 3 | 1 | 3 | 6 | 10 | 15 |
Generating Functions
Generating functions provide a powerful tool for enumerating multisets by encoding the number of such objects of various sizes into a formal power series. For multisets formed from n distinct types, the ordinary generating function arises from the independent choice of multiplicity for each type, where the multiplicity m of a type contributes the geometric series \sum_{m=0}^{\infty} x^m = \frac{1}{1-x}. Thus, the overall generating function is the product \prod_{i=1}^n \frac{1}{1-x} = \frac{1}{(1-x)^n}, which counts multisets of any size by the exponent of x. This product form reflects the independence of choices across types, as each factor corresponds to one type and the coefficient of x^k sums over all ways to distribute k indistinguishable elements into n types allowing repetitions.[19] The coefficient of x^k in \frac{1}{(1-x)^n}, denoted [x^k] \frac{1}{(1-x)^n}, gives the number of k-element multisets from n types, which equals the binomial coefficient \binom{n+k-1}{k}. This follows from the generalized binomial theorem, where (1-x)^{-n} = \sum_{k=0}^{\infty} \binom{n+k-1}{k} x^k. For unrestricted multiset size, the full series \sum_{k=0}^{\infty} b(n,k) x^k = \frac{1}{(1-x)^n}, with b(n,k) denoting the number of k-multisets, provides a compact representation for enumeration problems.[19] In the multivariate case, when types are labeled and multiplicities tracked separately, the generating function becomes \prod_{i=1}^n \sum_{m=0}^{\infty} x_i^m = \prod_{i=1}^n \frac{1}{1 - x_i}. Here, the coefficient of x_1^{a_1} \cdots x_n^{a_n} is 1 if each a_i \geq 0, corresponding to the unique multiset with those multiplicities. This form extends the univariate case and is useful for distinguishing types in applications like partition enumeration.[19] For example, consider integer partitions, which can be viewed as multisets of positive integers. The generating function for unrestricted partitions (allowing repeated parts) is \prod_{i=1}^{\infty} \frac{1}{1 - x^i}, reflecting unlimited repetitions of each part size. In contrast, partitions into distinct parts (equivalent to sets, not multisets) have generating function \prod_{i=1}^{\infty} (1 + x^i), where each part appears at most once, leading to different coefficient growth rates.[21]Connections to Binomial Expansions
The enumeration of multisets is intimately connected to the generalized binomial theorem, particularly through expansions involving negative exponents. The number of multisets of cardinality k chosen from n distinct elements is given by the binomial coefficient \binom{n + k - 1}{k}, which arises as the coefficient in the series expansion (1 - x)^{-n} = \sum_{k=0}^{\infty} \binom{n + k - 1}{k} x^k.[22] This identity follows from the generalized binomial theorem, where the binomial coefficient for negative upper index is \binom{-n}{k} = (-1)^k \binom{n + k - 1}{k}, linking the algebraic expansion directly to multiset counting upon accounting for the sign alternation.[23] The negative binomial series (1 - x)^{-n} converges for |x| < 1, providing an infinite power series whose coefficients precisely enumerate unrestricted multisets of varying sizes from n types.[22] For instance, the coefficient of x^k counts the ways to select k elements with repetition allowed from n options, equivalent to the number of non-negative integer solutions to x_1 + \cdots + x_n = k. A combinatorial proof of this coefficient uses the stars and bars theorem: represent the k selected elements as k stars (*) and the n-1 dividers between the n types as bars (| ), forming a sequence of n + k - 1 positions where n-1 are chosen for bars, yielding \binom{n + k - 1}{n-1} = \binom{n + k - 1}{k} arrangements.[24] This visualization maps directly to distributing indistinguishable selections into distinct bins, confirming the multiset count. For generalizations, when multiplicities are weighted or bounded, the exponent in the binomial expansion can be adjusted to reflect the structure, such as using fractional or variable exponents in the series to model weighted selections.[23] An example of the negative upper index connection is that the number of multisets of size k from n elements equals (-1)^k \binom{-n}{k}, illustrating how the combinatorial interpretation "negates" the usual subset selection to allow repetitions. The hockey-stick identity further relates multiset counts by summing them: \sum_{k=0}^{m} \binom{n + k - 1}{k} = \binom{n + m}{m}, which enumerates all multisets of cardinality at most m from n elements, as the partial sum of the negative binomial series coefficients.[25] This identity provides a closed form for cumulative multiset enumerations, bridging individual counts to total selections up to a limit.Applications
Combinatorics and Algebra
In partition theory, multisets provide a natural framework for representing integer partitions, where the parts of the partition correspond to the elements of the multiset, allowing repetitions to indicate multiplicity in the summands. This perspective is particularly useful for typed or colored partitions, where distinct types distinguish otherwise identical parts, extending classical unrestricted partitions to more general combinatorial objects. For instance, a multiset partition algebra has been developed to handle such structures, generalizing the Robinson-Schensted-Knuth correspondence to insertions on multiset partitions.[26] Multisets play a key role in the theory of symmetric functions by extending classical bases like power sums and elementary symmetric polynomials to account for multiplicities. The power sum symmetric functions p_k = \sum_i x_i^k and elementary symmetric polynomials e_k = \sum_{1 \leq i_1 < \cdots < i_k} x_{i_1} \cdots x_{i_k} can be generalized to multisets, where variables are weighted by their multiplicities, enabling the enumeration of symmetric objects with repeated labels. This extension aligns with Joyal's symmetric species framework, where multisets serve as the domain for functors mapping finite sets to combinatorial structures, yielding generating functions that capture multiplicity-adjusted symmetries. Seminal work in this area establishes operations like sum, product, and plethysm for symmetric species, directly corresponding to those on symmetric functions. In graph theory, multisets model the edges of multigraphs, where parallel edges between vertices form a multiset of edge incidences, allowing multiple connections without loops in simple cases. The degree sequence of a multigraph, defined as the nonincreasing list of vertex degrees, is inherently a multiset, as degrees can repeat across vertices, facilitating the study of graphical realizations and extremal properties. For example, realizing a given multiset as a degree sequence in a multigraph requires the sum of degrees to be even and no degree exceeding the possible connections.[27] Combinatorial identities involving multisets often center on permutation counts, where the number of distinct permutations of a multiset with n elements and multiplicities k_1, k_2, \dots for distinct types is given by the multinomial coefficient \frac{n!}{k_1! k_2! \cdots}. This formula arises from dividing the total permutations of a set by the repetitions within each type, providing a core identity for counting arrangements with indistinguishable objects. Such identities underpin broader results, like those in the multinomial theorem, and extend to colored multisets for multifactorial generalizations.[28] In algebraic structures, multisets appear in commutative algebra through multiset ideals, which generalize monomial ideals by allowing multiplicities in generators, particularly in zero-dimensional cases where they locally resemble monomial ideals under Gröbner bases. These ideals support efficient computations via algorithms like Cerlienco-Mureddu for lexicographic standard monomials. Additionally, in combinatorial species theory, multisets form the basis for species functors, as introduced by Joyal, enabling the algebraic manipulation of labeled structures through exponential generating functions that account for relabeling and multiplicity. Multisets are treated via extensions such as symmetric species, facilitating compositions like substitutions for complex enumerative problems.[29][30] A specific application arises in necklace counting, where multisets encode color repetitions to determine distinct configurations under rotation. Bijections to multisets with divisible subset sums simplify enumeration for bounded colors. This approach equates necklaces with at most q colors to multisets of integers modulo n whose subset sums are divisible by n, providing an explicit combinatorial correspondence.[31]Computer Science and Data Structures
In computer science, multisets are implemented using data structures that efficiently handle duplicate elements and their multiplicities, enabling operations like insertion, deletion, and querying counts. Balanced binary search trees, such as red-black trees, are commonly used for ordered multisets, as in the C++ standard library'sstd::multiset, which maintains elements in sorted order while allowing multiples of the same key. This structure supports logarithmic time complexity for key operations, making it suitable for scenarios requiring sorted access or range queries. Alternatively, hash maps provide an unordered implementation by associating each unique element with a counter for its multiplicity, as seen in libraries like Apache Commons Collections' HashMultiSet, which offers average-case constant-time insertions and lookups at the expense of order preservation.
Algorithms for multiset operations, such as union and intersection, leverage the underlying structure for efficiency. For sorted multisets, union and intersection can be computed in linear time O(n + m) using merge-like procedures on sorted sequences, similar to those in C++'s <algorithm> header for set operations adapted to handle multiplicities by taking maximum or minimum counts, respectively. These algorithms find applications in sorting with duplicates, where multisets preserve order and multiplicity during stable sorts, avoiding the need for auxiliary tracking structures. Arithmetic operations like multiset union and intersection, as defined mathematically, align with these implementations for practical computation.
Programming languages provide built-in or library support for multisets through specialized classes. In Python, the collections.Counter class implements a multiset as a dictionary subclass, enabling frequency counting of hashable objects with methods like update() for adding elements and most_common() for retrieving elements by count. Similarly, Java's Guava library offers a Multiset interface with implementations like HashMultiset and TreeMultiset, supporting operations such as count(), add(), and remove() to manage multiplicities efficiently.
In databases, multisets model duplicate records naturally, with SQL's GROUP BY clause combined with COUNT() aggregating rows by key while tracking frequencies, as standardized in SQL-92 and implemented in systems like SQL Server. For example, querying SELECT column, COUNT(*) FROM table GROUP BY column produces a multiset view of unique values and their occurrences, facilitating duplicate analysis without explicit multiplicity storage.
The time and space complexities of multiset operations vary by implementation. Tree-based multisets, like std::multiset, achieve O(log n) time for insert, delete, and search operations, with O(n) space for n elements, due to the balanced tree overhead. Hash-based versions offer amortized O(1) time for these operations but may degrade to O(n) in worst-case scenarios from hash collisions, still using O(n) space. As of 2025, modern applications include machine learning, where bag-of-words models in scikit-learn use CountVectorizer to represent documents as multisets of terms for frequency-based feature extraction in tasks like text classification. In search engines, term frequencies are handled as multisets to compute relevance scores, such as in TF-IDF weighting, powering inverted indexes in systems like Elasticsearch.