Chapter 4

Chapter 4 — Subqueries

A subquery is a SELECT statement nested inside another SQL statement. It can appear almost anywhere an expression or table name is expected — in a SELECT list, a WHERE clause, a FROM clause, or even the target of an INSERT or UPDATE. The outer query sees the subquery's result as if it were a value, a row, or a temporary table.

Subqueries solve problems that joins can't easily express: "find customers who have never ordered," "find books priced above the average," "delete rows that match a condition on a different table." Understanding when to reach for a subquery — and when a join or CTE is better — is one of the most valuable judgement calls in SQL.

Chapter context: Chapter 3 covered joins; Chapter 5 covers CTEs. This chapter sits between them — subqueries pre-date both and are still the right tool in several specific situations. By the end you'll know exactly when that is.

1. Three Shapes of Subquery

Scalar subquery

Returns exactly one value (one row, one column). Can be used anywhere a single value is expected: SELECT list, WHERE, HAVING, ORDER BY.

Row subquery

Returns one row with multiple columns. Compared to a tuple of values using = (val1, val2) or row constructors. Rare in practice.

Table subquery

Returns multiple rows and columns. Used in FROM (derived table), with IN / NOT IN, EXISTS / NOT EXISTS, or ANY / ALL.

2. Scalar Subqueries

In the SELECT list — add a comparison column

-- Show each book's price alongside the overall average price SELECT title, price, (SELECT ROUND(AVG(price), 2) FROM books) AS avg_price, price - (SELECT AVG(price) FROM books) AS diff_from_avg FROM books ORDER BY diff_from_avg DESC;

title	price	avg_price	diff_from_avg
Dune	19.99	12.84	7.15
Foundation	16.50	12.84	3.66
Neuromancer	11.99	12.84	-0.85
1984	9.99	12.84	-2.85
The Hobbit	8.49	12.84	-4.35

Repeated scalar subqueries are evaluated once by the optimiser in most cases when they are not correlated. The two (SELECT AVG(price) FROM books) calls above are typically computed once and reused — but to be safe and explicit, a CTE (Chapter 5) or user variable is cleaner for complex queries.

In WHERE — filter against a computed value

-- Books priced above the catalogue average SELECT title, price FROM books WHERE price > (SELECT AVG(price) FROM books) ORDER BY price DESC;

In HAVING — filter groups against a computed value

-- Genres with more books than the average number of books per genre SELECT g.name AS genre, COUNT(bg.book_id) AS book_count FROM genres g JOIN book_genres bg ON bg.genre_id = g.genre_id GROUP BY g.genre_id, g.name HAVING COUNT(bg.book_id) > ( SELECT AVG(cnt) FROM ( SELECT COUNT(book_id) AS cnt FROM book_genres GROUP BY genre_id ) counts ) ORDER BY book_count DESC;

Scalar subqueries must return exactly one row and one column. If the subquery returns more than one row, MySQL raises error 1242: Subquery returns more than 1 row. If it returns zero rows, the result is NULL — which can silently break comparisons. Always be sure the subquery is truly scalar before using it this way.

3. IN and NOT IN

IN (subquery) tests whether a value appears in the set returned by the subquery. It's the most natural way to express "rows in table A whose key appears in table B."

IN — customers who have placed at least one order

SELECT customer_id, first_name, last_name, email FROM customers WHERE customer_id IN ( SELECT DISTINCT customer_id FROM orders ) ORDER BY last_name;

NOT IN — customers who have never ordered

SELECT customer_id, first_name, last_name, email FROM customers WHERE customer_id NOT IN ( SELECT DISTINCT customer_id FROM orders ) ORDER BY last_name;

NOT IN is broken when the subquery can return NULL. If orders.customer_id is nullable and even one row has NULL, NOT IN returns zero rows — the entire result set disappears silently. This is the most common subquery bug in production SQL.

The fix: add WHERE customer_id IS NOT NULL inside the subquery, or use NOT EXISTS instead (see below) — it handles NULLs correctly.

-- Safe version: guard against NULLs in the subquery SELECT customer_id, first_name, last_name FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM orders WHERE customer_id IS NOT NULL -- ← always add this );

4. EXISTS and NOT EXISTS

EXISTS (subquery) returns TRUE if the subquery produces at least one row — it doesn't matter what the columns are. MySQL stops as soon as it finds the first matching row, making EXISTS very efficient for existence checks.

EXISTS is always correlated — the subquery references columns from the outer query. This is the key difference from IN: IN computes its full set first, then filters; EXISTS short-circuits per outer row.

EXISTS — customers with at least one order

SELECT c.customer_id, c.first_name, c.last_name FROM customers c WHERE EXISTS ( SELECT 1 -- SELECT 1 is convention — the column doesn't matter FROM orders o WHERE o.customer_id = c.customer_id -- ← correlated reference ) ORDER BY c.last_name;

NOT EXISTS — customers who have never ordered (NULL-safe)

SELECT c.customer_id, c.first_name, c.last_name, c.email FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id ) ORDER BY c.last_name;

NOT EXISTS vs LEFT JOIN … IS NULL: Both express "rows with no match." They produce identical results. In modern MySQL the optimiser often rewrites them to the same plan. Prefer whichever reads more naturally — NOT EXISTS is explicit about intent; LEFT JOIN IS NULL is familiar to join-oriented thinkers.

EXISTS with a condition — customers who spent over £100 in a single order

SELECT c.customer_id, c.first_name, c.last_name FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o JOIN order_items oi ON oi.order_id = o.order_id WHERE o.customer_id = c.customer_id GROUP BY o.order_id HAVING SUM(oi.quantity * oi.unit_price) > 100 )

5. ANY and ALL

ANY and ALL compare a value against every value returned by a subquery using a comparison operator.

val > ANY (subquery) — true if val is greater than at least one value in the set (equivalent to val > MIN(subquery))
val > ALL (subquery) — true if val is greater than every value in the set (equivalent to val > MAX(subquery))
= ANY is equivalent to IN
<> ALL is equivalent to NOT IN (with the same NULL caveats)

ANY — books more expensive than at least one Penguin title

SELECT b.title, b.price FROM books b JOIN publishers p ON p.publisher_id = b.publisher_id WHERE p.name <> 'Penguin' AND b.price > ANY ( SELECT price FROM books b2 JOIN publishers p2 ON p2.publisher_id = b2.publisher_id WHERE p2.name = 'Penguin' ) ORDER BY b.price DESC;

ALL — books more expensive than every Penguin title

SELECT b.title, b.price FROM books b JOIN publishers p ON p.publisher_id = b.publisher_id WHERE p.name <> 'Penguin' AND b.price > ALL ( SELECT price FROM books b2 JOIN publishers p2 ON p2.publisher_id = b2.publisher_id WHERE p2.name = 'Penguin' ) ORDER BY b.price DESC;

In practice, most developers rewrite ANY/ALL with their aggregate equivalents (> ANY → > (SELECT MIN(…)), > ALL → > (SELECT MAX(…))) because the aggregate versions are more readable and unambiguously express intent. Know ANY/ALL for reading code; write the aggregate form.

6. Correlated vs Non-Correlated Subqueries

A non-correlated subquery is self-contained — it doesn't reference any column from the outer query. MySQL can evaluate it once and cache the result.

A correlated subquery references a column from the outer query. MySQL must re-evaluate it once for each row of the outer query. This makes them potentially slow on large tables.

Non-correlated — evaluated once

-- The inner SELECT has no reference to the outer query SELECT title, price FROM books WHERE price > (SELECT AVG(price) FROM books); -- ↑ no reference to outer "books" alias — runs once

Correlated — runs once per outer row

-- For each book, count how many reviews it has received SELECT b.title, b.price, (SELECT COUNT(*) FROM reviews r WHERE r.book_id = b.book_id -- ← b.book_id ties this to the outer row ) AS review_count FROM books b ORDER BY review_count DESC;

Correlated subquery in SELECT list → rewrite as LEFT JOIN. The correlated subquery above runs once per book row. The equivalent LEFT JOIN runs one pass over both tables — much faster at scale:

-- Equivalent — and faster for large tables SELECT b.title, b.price, COUNT(r.review_id) AS review_count FROM books b LEFT JOIN reviews r ON r.book_id = b.book_id GROUP BY b.book_id, b.title, b.price ORDER BY review_count DESC;

When correlated subqueries are the right tool

They shine when the logic genuinely requires row-by-row comparison and a join would produce fan-out. A classic case: "find the most recent order for each customer" without using a window function:

-- Each customer's most recent order — correlated approach SELECT c.first_name, c.last_name, o.order_id, o.order_date FROM customers c JOIN orders o ON o.customer_id = c.customer_id WHERE o.order_date = ( SELECT MAX(order_date) FROM orders o2 WHERE o2.customer_id = c.customer_id -- ← correlated ) ORDER BY c.last_name;

7. Derived Tables — Subqueries in FROM

A subquery in the FROM clause acts as a temporary, unnamed table — often called a derived table. It must be given an alias. Every column you want to use from it must have a name (alias expressions if needed).

SELECT outer_col FROM ( SELECT ... -- inner query FROM ... ) AS alias_required -- alias is mandatory in MySQL WHERE ...

Derived table — filter on an aggregate without a CTE

-- Authors with an average book price above £15 SELECT sub.author, sub.book_count, sub.avg_price FROM ( SELECT CONCAT(a.first_name, ' ', a.last_name) AS author, COUNT(b.book_id) AS book_count, ROUND(AVG(b.price), 2) AS avg_price FROM authors a JOIN book_authors ba ON ba.author_id = a.author_id JOIN books b ON b.book_id = ba.book_id GROUP BY a.author_id, a.first_name, a.last_name ) sub WHERE sub.avg_price > 15 ORDER BY sub.avg_price DESC;

Derived table vs CTE: Both do the same job. A CTE (Chapter 5) names the block at the top with WITH and is reusable. A derived table is inline and can only be referenced once. Prefer CTEs for anything beyond a single, simple subquery — they are easier to read and debug.

Lateral derived tables (MySQL 8.0.14+)

A LATERAL derived table can reference columns from tables listed earlier in the FROM clause — removing the need for correlated subqueries in many cases:

-- For each customer, the value of their single largest order SELECT c.first_name, c.last_name, top_order.order_id, top_order.order_total FROM customers c JOIN LATERAL ( SELECT o.order_id, SUM(oi.quantity * oi.unit_price) AS order_total FROM orders o JOIN order_items oi ON oi.order_id = o.order_id WHERE o.customer_id = c.customer_id -- ← reference outer table GROUP BY o.order_id ORDER BY order_total DESC LIMIT 1 ) top_order ON TRUE ORDER BY top_order.order_total DESC;

8. Subqueries in INSERT, UPDATE, and DELETE

INSERT … SELECT — copy rows from one table to another

-- Archive orders older than 2 years into an orders_archive table INSERT INTO orders_archive (order_id, customer_id, order_date, total_amount) SELECT order_id, customer_id, order_date, total_amount FROM orders WHERE order_date < CURDATE() - INTERVAL 2 YEAR;

UPDATE with a subquery in SET

-- Set each book's price to the average for its publisher UPDATE books b SET b.price = ( SELECT AVG(b2.price) FROM books b2 WHERE b2.publisher_id = b.publisher_id ) WHERE b.publisher_id = 4;

UPDATE with a subquery in WHERE

-- Mark orders as 'priority' if the customer is a VIP (spent > £500 lifetime) UPDATE orders SET status = 'priority' WHERE customer_id IN ( SELECT customer_id FROM v_customer_ltv WHERE lifetime_value > 500 ) AND status = 'pending';

DELETE with a subquery

-- Remove reviews left by customers who no longer exist -- (orphan cleanup after a non-cascading delete) DELETE FROM reviews WHERE customer_id NOT IN ( SELECT customer_id FROM customers WHERE customer_id IS NOT NULL );

MySQL restriction: You cannot SELECT from the same table you are modifying in a DELETE or UPDATE in older MySQL versions. Wrap the subquery in a derived table to work around it:

-- Delete duplicate reviews — keep the one with the lowest review_id DELETE FROM reviews WHERE review_id NOT IN ( SELECT min_id FROM ( SELECT MIN(review_id) AS min_id FROM reviews GROUP BY book_id, customer_id ) keeper -- ← derived table breaks the self-reference );

9. Subquery vs JOIN vs CTE — Decision Guide

Situation	Reach for	Why
"Does a related row exist?" with no columns needed from it	EXISTS / NOT EXISTS	Short-circuits. NULL-safe. Expresses intent clearly.
Filter on a single aggregate (avg price, max date)	Scalar subquery in WHERE	Concise. Non-correlated = evaluated once.
Need columns from both tables in the output	JOIN	Subqueries can't easily return data from both sides.
Complex multi-step logic, result reused more than once	CTE (WITH)	Named, readable, reusable. Replaces nested derived tables.
"Not in this list" — the list has no NULLs	NOT IN	Readable. Just ensure the subquery excludes NULLs.
"Not in this list" — NULLs possible	NOT EXISTS or LEFT JOIN IS NULL	NOT IN breaks silently on NULLs.
One result per outer row from a subquery with ORDER BY + LIMIT	LATERAL JOIN (MySQL 8.0.14+)	Only lateral derived tables can correlate and limit at the same time.
Add an aggregate alongside detail rows without collapsing them	Window function (Chapter 6)	Correlated subquery in SELECT is a common anti-pattern; window functions do it in one pass.

Quick Reference

Form	Returns	Used in	Watch out for
Scalar subquery	One value	SELECT list, WHERE, HAVING, ORDER BY, SET	Error if >1 row returned; NULL if 0 rows returned.
IN (subquery)	Set of values	WHERE	NULL in list makes IN always FALSE for that value.
NOT IN (subquery)	Set of values	WHERE	Any NULL in subquery returns zero rows. Always add `WHERE col IS NOT NULL`.
EXISTS (subquery)	Boolean	WHERE	Always correlated. Short-circuits on first match — fast.
NOT EXISTS (subquery)	Boolean	WHERE	NULL-safe alternative to NOT IN. Preferred for "no match" checks.
val > ANY (subquery)	Boolean	WHERE	True if val > at least one value. Equivalent to `val > (SELECT MIN(…))`.
val > ALL (subquery)	Boolean	WHERE	True if val > every value. Equivalent to `val > (SELECT MAX(…))`.
Derived table (FROM)	Virtual table	FROM clause	Must have an alias. Cannot be referenced more than once — use CTE instead.
LATERAL JOIN	Virtual table per outer row	JOIN … ON TRUE	MySQL 8.0.14+. Can reference outer columns. Useful with LIMIT inside the lateral.
Correlated subquery	Any of the above	WHERE, SELECT list	Runs once per outer row — expensive on large tables. Rewrite as JOIN or window function when possible.

Next: Chapter 5 — Common Table Expressions (CTEs): the WITH clause, chaining multiple CTEs, and recursive CTEs for hierarchical data.