A correlated subquery is a subquery within a SQL query that references one or more columns from the enclosing outer query, creating a dependency that requires the subquery to be evaluated repeatedly for each row processed by the outer query.[1][2] This contrasts with non-correlated subqueries, which execute independently once and return a fixed result set usable by the outer query.[1][3]
Correlated subqueries typically appear in the WHERE, HAVING, or SELECT clauses of the outer query and are used to perform row-by-row comparisons or aggregations based on outer query values.[2] For instance, the database engine substitutes values from each outer query row into the subquery's conditions, such as matching an employee ID, before executing the subquery to filter or compute results specific to that row.[1][3] This mechanism enables complex filtering, like identifying records where a value exceeds the maximum in a related subset of data, but it demands careful correlation via shared column references (e.g., WHERE outer_table.id = subquery_table.id).[2]
Common applications include finding rows that meet conditional criteria relative to other rows, such as selecting sales records where the quantity sold equals the maximum for a given list ID. For example, in Amazon Redshift:
sql
SELECT salesid, listid, SUM(pricepaid)
FROM sales s
WHERE qtysold = (SELECT MAX(numtickets) FROM listing l WHERE s.listid = l.listid)
GROUP BY 1, 2
ORDER BY 1, 2
LIMIT 5;
SELECT salesid, listid, SUM(pricepaid)
FROM sales s
WHERE qtysold = (SELECT MAX(numtickets) FROM listing l WHERE s.listid = l.listid)
GROUP BY 1, 2
ORDER BY 1, 2
LIMIT 5;
This query correlates the outer sales table with the inner listing table via listid, executing the subquery for each outer row to compute the maximum tickets sold per list.[3] Similarly, in SQL Server, a correlated subquery might check for bonuses tied to specific employees:
sql
SELECT DISTINCT c.LastName, c.FirstName, e.BusinessEntityID
FROM Person.Person AS c
JOIN HumanResources.Employee AS e ON e.BusinessEntityID = c.BusinessEntityID
WHERE 5000.00 IN (
SELECT [Bonus](/page/Bonus)
FROM Sales.SalesPerson sp
WHERE e.BusinessEntityID = sp.BusinessEntityID
);
SELECT DISTINCT c.LastName, c.FirstName, e.BusinessEntityID
FROM Person.Person AS c
JOIN HumanResources.Employee AS e ON e.BusinessEntityID = c.BusinessEntityID
WHERE 5000.00 IN (
SELECT [Bonus](/page/Bonus)
FROM Sales.SalesPerson sp
WHERE e.BusinessEntityID = sp.BusinessEntityID
);
Here, the subquery re-evaluates for each employee to verify bonus eligibility.[1] In Snowflake, such subqueries often involve aggregates like MAX to compare row-specific values against group summaries from related tables.[2]
While powerful for expressing relational dependencies, correlated subqueries can impact performance, as the repeated executions—potentially scaling with the outer query's row count—may lead to higher computational costs compared to equivalent JOIN operations.[1][3] Database systems like SQL Server and Redshift optimize them through techniques such as subquery decorrelation, rewriting them into joins where possible, but limitations persist for complex correlations (e.g., multi-level or those in certain clauses).[3] In Snowflake, they must return a single value (as scalar subqueries) to avoid errors, and large datasets may benefit from alternatives like window functions for better efficiency.[2] Overall, they form a core part of SQL's expressive power for handling intra-table relationships without explicit self-joins.[1][2]
Fundamentals
Definition and characteristics
A correlated subquery is a type of nested query in SQL that references one or more columns from the enclosing outer query, establishing a direct dependency between the inner and outer query contexts.[4] This reference creates a situation where the subquery cannot be evaluated independently and must be re-executed for each qualifying row in the outer query's result set.[1] Unlike independent subqueries, this correlation allows the inner query to dynamically incorporate values from the outer query during processing.[5]
Key characteristics of correlated subqueries include their row-dependent execution model, where the subquery is invoked repeatedly—potentially inefficiently—for every row examined by the outer query.[4] They typically employ correlation names (table aliases) to unambiguously refer to columns in the outer query from within the subquery, such as using an alias like outer_alias.column_name to avoid scoping ambiguities.[5] Correlation is most commonly achieved through conditions in the subquery's WHERE clause that link inner table columns to outer table columns, enabling operations like row-wise comparisons or aggregations tied to specific outer values.[1] These subqueries are a standard feature of SQL, conforming to ANSI/ISO specifications for relational database query languages.
The basic syntax for a correlated subquery often appears in clauses like WHERE, integrating the inner query as a condition or filter. For instance:
SELECT column1, column2
FROM outer_table o
WHERE o.value > (
SELECT AVG(i.value)
FROM inner_table i
WHERE i.category = o.category
);
SELECT column1, column2
FROM outer_table o
WHERE o.value > (
SELECT AVG(i.value)
FROM inner_table i
WHERE i.category = o.category
);
In this structure, the subquery references o.category from the outer table via its alias o, correlating the average calculation to each outer row's category.[5] This pattern exemplifies how correlated subqueries facilitate complex, context-aware filtering without requiring explicit joins.[4]
A non-correlated subquery, also known as an independent or uncorrelated subquery, executes once independently of the outer query and returns a single value or set of values without referencing any columns from the enclosing query.[6][7][2] This independence allows the subquery to produce static results that are then used by the outer query, making it suitable for scenarios where a fixed aggregate or list is needed for filtering or comparison.[1] In contrast, a correlated subquery depends on values from the outer query, executing repeatedly for each row processed in the outer query to yield row-specific results.[6][7]
The primary differences between correlated and non-correlated subqueries lie in their execution frequency, dependency structure, and expressive complexity. Non-correlated subqueries run only once before the outer query evaluates, promoting efficiency for global computations, whereas correlated subqueries execute iteratively—once per qualifying row in the outer query—due to their reliance on outer column values.[6][1] This dependency in correlated subqueries enables more dynamic, context-aware logic but introduces greater complexity, as the inner query must synchronize with the outer one, often resulting in slower performance compared to their non-correlated counterparts.[8]
In terms of use cases, non-correlated subqueries are ideal for aggregate filters, such as retrieving records above a global maximum value computed once across the entire dataset.[2] Correlated subqueries, however, excel in row-wise comparisons, like identifying the maximum value within each group tied to an outer row, providing expressiveness for operations that mimic joins without explicit syntax.[7] Both types were introduced in the early SQL standards, with the ANSI SQL-86 specification (ISO/IEC 9075:1986) formalizing subqueries, including correlated ones to support advanced relational operations akin to nested joins in the absence of dedicated JOIN clauses.
Placement in SQL Statements
In the SELECT clause
A correlated subquery in the SELECT clause, also known as a scalar subquery in the projection list, allows for the computation of derived columns that depend on values from the outer query's current row. This placement enables row-specific calculations, such as aggregates or conditional values, to be included directly in the output alongside other columns from the outer table. For instance, it is commonly used to generate per-row metrics like rankings or comparisons within groups, ensuring that each outer row receives a tailored result from the inner query.[5][9]
The syntax typically involves embedding a subquery within the SELECT list, referencing an outer table alias in the inner query's WHERE clause to establish correlation. A general form is:
sql
SELECT outer_col,
(SELECT aggregate_func(inner_col)
FROM inner_table AS inner_alias
WHERE inner_alias.correlation_col = outer_alias.correlation_col) AS derived_col
FROM outer_table AS outer_alias;
SELECT outer_col,
(SELECT aggregate_func(inner_col)
FROM inner_table AS inner_alias
WHERE inner_alias.correlation_col = outer_alias.correlation_col) AS derived_col
FROM outer_table AS outer_alias;
This structure ensures the subquery executes once per outer row, using the correlated value to filter or aggregate data from the inner table. In Oracle SQL, for example, this can compute department averages per employee.[5] Similarly, PostgreSQL supports this for scalar expressions in the SELECT list.[9]
One practical application is determining an employee's salary rank within their department. Consider a query on an employees table:
sql
SELECT e1.employee_id, e1.name, e1.salary,
(SELECT COUNT(*) + 1
FROM employees e2
WHERE e2.department_id = e1.department_id
AND e2.salary >= e1.salary) AS salary_rank
FROM employees e1
ORDER BY e1.department_id, e1.salary DESC;
SELECT e1.employee_id, e1.name, e1.salary,
(SELECT COUNT(*) + 1
FROM employees e2
WHERE e2.department_id = e1.department_id
AND e2.salary >= e1.salary) AS salary_rank
FROM employees e1
ORDER BY e1.department_id, e1.salary DESC;
Here, the subquery counts employees in the same department with equal or higher salaries, assigning a rank to each row based on the outer query's context. This approach provides conceptual insight into intra-group positioning without requiring window functions.[5][9]
Another example involves comparing individual product prices to their category averages, useful for sales analysis. Using a products table:
sql
SELECT p1.product_name, p1.price,
(SELECT AVG(p2.price)
FROM products p2
WHERE p2.category_id = p1.category_id) AS category_avg
FROM products p1
ORDER BY p1.category_id, p1.price;
SELECT p1.product_name, p1.price,
(SELECT AVG(p2.price)
FROM products p2
WHERE p2.category_id = p1.category_id) AS category_avg
FROM products p1
ORDER BY p1.category_id, p1.price;
The correlated subquery calculates the average price for each product's category on a row-by-row basis, highlighting deviations in the output. Such uses emphasize the subquery's role in enriching projections with dynamic, context-aware computations.[5][9]
However, correlated subqueries in the SELECT clause must function as scalar subqueries, returning exactly one value (a single row and column) per outer row invocation; otherwise, the database raises an error, such as "subquery returns more than one row" in SQL Server or Oracle. This restriction limits their use to non-set-returning operations, often necessitating aggregates like AVG, MAX, or COUNT in the inner SELECT to ensure singularity. Additionally, over-reliance can impact performance due to repeated execution, though this is addressed in broader execution models.[1][9]
In the WHERE clause
Correlated subqueries in the WHERE clause serve as predicates that filter rows from the outer query by evaluating conditions dependent on values from those rows, effectively performing row-wise comparisons against the inner query's results.[4] This mechanism allows the database engine to execute the subquery repeatedly for each candidate row in the outer table, using correlated references to link the two queries.[1]
The basic syntax involves placing the subquery within a comparison operator in the WHERE clause, such as:
sql
SELECT * FROM outer_table
WHERE outer_col > (SELECT AVG(inner_col) FROM inner_table
WHERE inner_table.key = outer_table.key);
SELECT * FROM outer_table
WHERE outer_col > (SELECT AVG(inner_col) FROM inner_table
WHERE inner_table.key = outer_table.key);
Here, the subquery computes a value (e.g., an aggregate like AVG) filtered by the correlating condition, and the outer row qualifies only if it satisfies the comparison.[10] This structure enables precise row filtering based on dynamic, row-specific criteria from related data.[4]
A common purpose is to identify rows exceeding group-specific thresholds, such as selecting employees whose salary exceeds their department's average. For instance, the following query retrieves employees above their departmental salary average:
sql
SELECT EMPNO, LASTNAME, [SALARY](/page/Salary), WORKDEPT
FROM EMPLOYEE AS outer_emp
WHERE [SALARY](/page/Salary) > (SELECT AVG([SALARY](/page/Salary))
FROM EMPLOYEE AS inner_emp
WHERE inner_emp.WORKDEPT = outer_emp.WORKDEPT);
SELECT EMPNO, LASTNAME, [SALARY](/page/Salary), WORKDEPT
FROM EMPLOYEE AS outer_emp
WHERE [SALARY](/page/Salary) > (SELECT AVG([SALARY](/page/Salary))
FROM EMPLOYEE AS inner_emp
WHERE inner_emp.WORKDEPT = outer_emp.WORKDEPT);
This correlates on the department (WORKDEPT), computing the average anew for each outer row's department.[10]
Another example involves qualifying aggregated outer data against regional benchmarks, such as finding departments with total sales above their region's average sales. The query might look like:
sql
SELECT REGION_ID, DEPT_ID, [SUM](/page/Sum)(SALES) AS TOTAL_SALES
FROM SALES outer_sales
GROUP BY REGION_ID, DEPT_ID
HAVING [SUM](/page/Sum)(SALES) > (SELECT AVG(AVG_SALES)
FROM (SELECT REGION_ID, AVG(SALES) AS AVG_SALES
FROM SALES
GROUP BY REGION_ID) AS regional_avg
WHERE regional_avg.REGION_ID = outer_sales.REGION_ID);
SELECT REGION_ID, DEPT_ID, [SUM](/page/Sum)(SALES) AS TOTAL_SALES
FROM SALES outer_sales
GROUP BY REGION_ID, DEPT_ID
HAVING [SUM](/page/Sum)(SALES) > (SELECT AVG(AVG_SALES)
FROM (SELECT REGION_ID, AVG(SALES) AS AVG_SALES
FROM SALES
GROUP BY REGION_ID) AS regional_avg
WHERE regional_avg.REGION_ID = outer_sales.REGION_ID);
(Note: This assumes a grouped outer query with correlation on REGION_ID; the subquery derives region averages for comparison.) Such filtering highlights performant subgroups relative to their peers.[1][4]
Variations include the EXISTS form for semi-join semantics, which checks for the existence of matching inner rows without returning values, as in:
sql
SELECT * FROM outer_table
WHERE EXISTS (SELECT 1 FROM inner_table
WHERE inner_table.key = outer_table.key);
SELECT * FROM outer_table
WHERE EXISTS (SELECT 1 FROM inner_table
WHERE inner_table.key = outer_table.key);
This efficiently verifies row presence via correlation, often optimizing better than full joins for existence tests.[4] In contrast, IN, ANY, or ALL subqueries support set-based value comparisons, such as:
sql
SELECT * FROM outer_table
WHERE outer_col IN (SELECT inner_col FROM inner_table
WHERE inner_table.key = outer_table.key);
SELECT * FROM outer_table
WHERE outer_col IN (SELECT inner_col FROM inner_table
WHERE inner_table.key = outer_table.key);
or using ANY for comparisons like outer_col > ANY (subquery), which qualifies if the condition holds for at least one inner row, versus ALL requiring it for every inner row. These forms adapt the filtering to membership or quantified logic while maintaining correlation.[1][4]
In the FROM clause
A correlated subquery in the FROM clause functions as a derived table or inline view that references columns from the outer query, enabling row-by-row computation and integration into joins. This placement treats the subquery as a temporary table source, where the inner query depends on values from the preceding tables in the FROM list, often requiring special keywords like LATERAL (in PostgreSQL and Oracle) or APPLY (in SQL Server) to establish the correlation.[11][12][13]
The general syntax involves specifying the outer table first, followed by the correlated subquery as a derived table with an alias:
sql
SELECT ...
FROM outer_table,
LATERAL (SELECT inner_col, aggregate_func(...)
FROM inner_table
WHERE inner_table.ref_col = outer_table.ref_col) AS derived_table;
SELECT ...
FROM outer_table,
LATERAL (SELECT inner_col, aggregate_func(...)
FROM inner_table
WHERE inner_table.ref_col = outer_table.ref_col) AS derived_table;
This structure creates a correlated temporary table that generates results specific to each row of the outer table, facilitating operations like aggregating related data without multiple self-joins.[11][13]
The primary purpose is to produce row-correlated datasets for joining, such as computing per-customer order summaries or department-level statistics on the fly during the main query execution. For instance, this approach allows efficient intra-group analysis by deriving peer groups dynamically. An example query might join an employees table to a correlated subquery of departmental peers:
sql
SELECT e.name, e.[salary](/page/Salary), peer.avg_[salary](/page/Salary)
FROM employees e,
LATERAL (SELECT AVG([salary](/page/Salary)) AS avg_[salary](/page/Salary)
FROM employees p
WHERE p.department_id = e.department_id) AS peer;
SELECT e.name, e.[salary](/page/Salary), peer.avg_[salary](/page/Salary)
FROM employees e,
LATERAL (SELECT AVG([salary](/page/Salary)) AS avg_[salary](/page/Salary)
FROM employees p
WHERE p.department_id = e.department_id) AS peer;
Here, for each employee row, the subquery calculates the average salary of peers in the same department, enabling comparisons like salary rankings within groups.[11]
Support for correlated subqueries in the FROM clause is available in most relational database management systems, including PostgreSQL via LATERAL, Oracle via LATERAL inline views (introduced in Oracle Database 12c and enhanced in 19c), and SQL Server via CROSS APPLY or OUTER APPLY operators. These implementations typically require aliasing the outer table for clear correlation references and ensure the subquery appears after the referenced outer table in the FROM list.[11][13][12]
Execution Model
Computation process
The computation process for a correlated subquery involves the outer query iterating through its result set row by row, with the correlating values from each outer row being passed to the inner subquery for evaluation.[5] For each outer row, these values are substituted into the inner query's predicates, causing the inner query to execute independently and produce results specific to that outer row's context, which are then integrated back into the outer query's processing.[14]
The evaluation follows nested loop semantics, where the inner subquery is re-executed for every row processed by the outer query, ensuring that correlation dependencies are maintained even if the database management system (DBMS) applies internal rewrites such as unnesting.[5] Unlike non-correlated subqueries, which execute once independently, this row-by-row repetition aligns the inner results directly with the outer query's current row.[14]
This process can be represented in pseudocode as follows:
FOR each outer_row IN outer_query_result:
BIND outer_row.correlating_columns TO inner_query_parameters
inner_result = EXECUTE(inner_query)
PROCESS outer_row WITH inner_result
ENDFOR
FOR each outer_row IN outer_query_result:
BIND outer_row.correlating_columns TO inner_query_parameters
inner_result = EXECUTE(inner_query)
PROCESS outer_row WITH inner_result
ENDFOR
The pseudocode illustrates the iterative binding and execution, preserving the logical dependency on outer values.[14]
In edge cases, if a correlating value from the outer row is NULL, the inner query's conditions involving equality or similar operators evaluate to UNKNOWN under SQL's three-valued logic, typically resulting in no matching rows from the inner query.[5] Additionally, when the correlated subquery is used in a scalar context (e.g., as a single-value expression), it must return exactly one row; returning multiple rows triggers an error, such as ORA-01427 in Oracle Database, while returning zero rows yields NULL.[15]
Correlated subqueries incur significant performance overhead primarily because they are executed repeatedly for each qualifying row in the outer query, resulting in N executions where N represents the number of outer rows.[1] This repeated evaluation can lead to O(N²) time complexity in the worst case, as the inner query must scan or process relevant data anew for every outer row, exacerbating I/O and CPU costs through redundant table accesses or computations.[16] For instance, on a TPC-H scale factor 1 dataset (approximately 1 GB), a complex correlated subquery may take over 25 minutes to execute due to these nested dependent joins generating large intermediate results.[16]
Several factors influence the runtime efficiency of correlated subqueries. The availability of indexes on the correlation columns in the inner query is crucial, as it enables faster lookups during each execution; without them, full table scans amplify the overhead.[17] Larger data volumes intensify the problem, as the quadratic scaling becomes prohibitive for tables with millions of rows, leading to memory exhaustion or excessive disk I/O in unoptimized scenarios.[16] Additionally, decisions by the database management system's query planner play a key role; modern optimizers may rewrite correlated subqueries into equivalent joins to avoid repeated executions, though this depends on the query structure and available statistics.[1][18]
Evolving SQL standards have introduced mechanisms to address these pitfalls. The SQL:1999 standard (ISO/IEC 9075-2:1999) introduced LATERAL derived tables, which allow subqueries in the FROM clause to reference outer query columns explicitly, enabling optimizers to treat them more efficiently than traditional correlated subqueries in certain contexts, such as avoiding unnecessary repetitions through better join planning.[19] This feature mitigates some correlation-related inefficiencies by facilitating decorrelation techniques in compliant systems.[19]
Use Cases and Alternatives
Common applications
Correlated subqueries find practical use in scenarios requiring row-by-row computations that depend on outer query values, such as calculating running totals or cumulative sums, particularly in databases lacking native window function support. For instance, they enable the computation of cumulative sales for each product by aggregating prior transactions up to the current row, providing a way to emulate progressive totals without advanced analytic features.[20]
In hierarchical data structures, correlated subqueries facilitate traversal of direct relationships like employee-manager links in organizational tables, where each row's subquery references the outer row to identify direct reports or superiors in self-referential datasets.[5]
They also support conditional aggregations, such as identifying the top-N records within groups, like the highest-paid employees per department, by using the outer row's group identifier to filter and rank inner aggregates dynamically.[21]
These subqueries offer expressive power for handling intricate dependencies, especially in self-referential tables where joins would result in combinatorial explosions or require multiple self-joins, making them suitable for queries involving recursive-like logic without dedicated hierarchy operators.[5]
Historically, correlated subqueries were prevalent in legacy systems for tasks now handled by window functions, which were introduced in the SQL:2003 standard, serving as a foundational mechanism for analytic computations before broader adoption of advanced SQL features across database vendors.[22]
Correlated subqueries can often be replaced by more efficient SQL constructs such as joins combined with GROUP BY, window functions, and LATERAL joins, which improve performance and readability in many scenarios.[1][23][24]
Joins with GROUP BY are a common alternative for aggregations that would otherwise require correlated subqueries, such as computing row-wise maximums or sums across related tables. For instance, to find the highest salary per department, a correlated subquery might execute once per row, but a self-join or inner join followed by GROUP BY can process the data in a single pass, reducing computational overhead.[25][26] This approach leverages the optimizer's ability to use indexes efficiently, often resulting in faster execution for large datasets compared to repeated subquery evaluations.[27]
Window functions provide another powerful substitute, particularly for analytic queries involving rankings, running totals, or partitions, eliminating the need for subqueries altogether. For example, the ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) function can assign ranks within groups without correlating back to the outer query, allowing the database to compute results over the entire dataset in one scan.[23][28] Window functions typically outperform correlated subqueries by avoiding per-row executions and enabling vectorized processing in modern query engines.[23]
LATERAL joins, introduced in the SQL:1999 standard, extend this by allowing table expressions in the FROM clause to reference preceding tables, generalizing correlated subqueries while supporting multiple output rows and columns. Unlike traditional correlated subqueries limited to scalar results, a LATERAL join can unnest arrays or apply functions row-wise more flexibly, as in SELECT * FROM employees e CROSS JOIN LATERAL (SELECT * FROM salaries s WHERE s.emp_id = e.id LIMIT 1).[29] This construct maintains row dependencies but benefits from join optimization techniques, making it suitable for complex derivations.[24]
In comparison, joins reduce the need for multiple query executions to a single pass over the data, while window functions bypass subquery structures entirely for partition-based analytics, and LATERAL joins offer a more versatile correlated-like behavior without the performance pitfalls of traditional subqueries.[25][23][29]
These alternatives should be preferred for scalability in production queries involving large volumes of data, as they align better with query optimizer capabilities; however, correlated subqueries may still be retained for simple, ad-hoc cases where row-specific dependencies are straightforward and performance is not critical.[27][25]
Support for these alternatives varies by database management system: joins with GROUP BY are universally available across all major SQL DBMS since the early standards, window functions have been part of the SQL:2003 specification with broad adoption (e.g., PostgreSQL since 8.4, SQL Server 2005, Oracle 8i, and MySQL 8.0+), and LATERAL joins, from SQL:1999, are supported in PostgreSQL (since 9.3), Oracle (12c+), MySQL (8.0.14+), and Snowflake.[30][31][32]