Pointwise mutual information
Pointwise mutual information (PMI) is a measure of association between two discrete events or random variables, quantifying how much more (or less) information one event provides about the other compared to independence.[1] Formally defined as \text{PMI}(x; y) = \log_2 \frac{P(x,y)}{P(x) P(y)}, where P(x,y) is the joint probability of x and y, and P(x) and P(y) are their marginal probabilities, it was introduced by Kenneth Ward Church and Patrick Hanks in 1989 to statistically derive word association norms from large text corpora.[1] Estimated using frequency counts from a corpus—such as P(x) = f(x)/N where f(x) is the frequency of x and N is the total number of observations—PMI enables objective analysis of co-occurrences, replacing subjective psychological association tests.[1] In natural language processing (NLP), PMI serves as a foundational tool for identifying significant word co-occurrences, such as in collocation extraction (e.g., "strong tea" over "powerful tea") and constructing term co-occurrence matrices for distributional semantics, where higher PMI values indicate stronger semantic or syntactic associations.[2] It has applications in speech recognition, parsing, information retrieval, and lexicography by highlighting lexico-syntactic patterns like verb-preposition pairs within defined windows (e.g., five words).[1] Despite its utility, PMI can produce negative values for unlikely co-occurrences (indicating repulsion) and is biased toward rare events; to mitigate this, variants like positive PMI (PPMI), defined as \max(\text{PMI}(w,c), 0), discard negatives, while normalized PMI (NPMI), given by \text{NPMI}(x;y) = \frac{\text{PMI}(x;y)}{-\log_2 P(x,y)} and ranging from -1 to 1, provides a bounded scale with 0 for independence and better handling of low frequencies.[2][3]Introduction and Definition
Historical Background
Pointwise mutual information (PMI) traces its origins to the foundational work in information theory established by Claude Shannon in 1948, where mutual information was introduced as a measure of the average amount of information one random variable provides about another, quantified as the expected value over joint probabilities.[4] This average formulation laid the groundwork for understanding dependencies in communication systems, but it was not until 1961 that Robert Fano explicitly formalized the pointwise counterpart in his book Transmission of Information: A Statistical Theory of Communications, where he described the instantaneous or event-specific measure of association between outcomes, initially referring to it under the term "mutual information" (now reserved for the average).[5] Fano's contribution extended Shannon's framework by focusing on the log-ratio of joint to marginal probabilities for specific events, enabling applications in statistical communication theory and early pattern recognition tasks during the 1960s and 1970s.[5] Warren Weaver played a key role in interpreting and disseminating these concepts through his collaboration with Shannon, co-authoring The Mathematical Theory of Communication in 1949, which popularized mutual information as a tool for analyzing redundancy and dependency in signals, thereby bridging technical theory with broader scientific interpretations. In the decades following, the PMI measure appeared in statistical contexts alongside other association metrics, such as log-likelihood ratios, for testing dependencies in contingency tables. The concept gained prominence in linguistics with the seminal paper by Kenneth Ward Church and Patrick Hanks in 1989, who coined the term "pointwise mutual information" and applied it to measure word associations in large corpora, demonstrating its utility for identifying collocations and informing lexicography by quantifying how much more often words co-occur than expected under independence.[1] This work marked a pivotal adoption in computational linguistics, shifting from theoretical statistics to practical, corpus-based methods in the 1990s, where PMI became integral to natural language processing tasks such as phrase extraction and semantic analysis. Mutual information, as the expected value of PMI, underscores this historical progression from average dependencies to pointwise specificity.[4]Formal Definition
Pointwise mutual information (PMI) between two discrete events x and y is formally defined as \text{PMI}(x; y) = \log_2 \left( \frac{P(x,y)}{P(x) P(y)} \right), where P(x,y) denotes the joint probability that both events x and y occur simultaneously, and P(x) and P(y) are the respective marginal probabilities that x or y occurs alone.[5] This formulation originates from information theory, where it serves as the pointwise contribution to the overall mutual information between random variables. PMI measures the deviation of the observed co-occurrence probability P(x,y) from the probability expected under statistical independence, P(x) P(y), on a per-event basis rather than as an average across all possible events. A positive value indicates that x and y co-occur more frequently than independence would predict, suggesting a positive association; a value of zero signifies exact independence; and a negative value implies co-occurrence less frequent than expected, indicating repulsion or negative association. The PMI can be expressed in terms of self-information as \text{PMI}(x; y) = I(x) + I(y) - I(x,y), where I(z) = -\log_2 P(z) is the self-information of event z. The range of PMI extends from -\infty, which occurs when P(x,y) = 0 but P(x) > 0 and P(y) > 0 (complete repulsion, as the events never co-occur despite individual occurrences), to a finite upper bound of \log_2 \left[ \min\left(1/P(x), 1/P(y)\right) \right], achieved when P(x,y) reaches its theoretical maximum of \min(P(x), P(y)) (maximum possible association). Due to the symmetry in the defining formula, \text{PMI}(x; y) = \text{PMI}(y; x). To illustrate, consider a small corpus of 100 word tokens where the target word "ice" appears 5 times, the context word "cream" appears 4 times, and the pair "ice cream" appears together 2 times. The probabilities are P(\text{ice}) = 0.05, P(\text{cream}) = 0.04, and P(\text{ice}, \text{cream}) = 0.02. Thus, \text{PMI}(\text{ice}; \text{cream}) = \log_2 \left( \frac{0.02}{0.05 \times 0.04} \right) = \log_2 10 \approx 3.32, indicating a strong positive association. For comparison, if the co-occurrence were independent (P(\text{ice}, \text{cream}) = 0.002), then \text{PMI} = 0. The following contingency table summarizes the example counts (with "not cream" comprising the remaining 96 tokens):| cream | not cream | Total | |
|---|---|---|---|
| ice | 2 | 3 | 5 |
| not ice | 2 | 94 | 96 |
| Total | 4 | 97 | 100 |