Cornell Virtual Workshop > Relational Databases > Using SQL

SQL subqueries

A query returns a table or a single value; both of these entities can themselves be used in queries, meaning that you can embed a query inside a query! This embedded query is called a subquery and they can be very convenient. A common place for a subquery is in a WHERE clause, although it can also be used to generate the table which is the subject of the FROM part of a query.

As an example of a subquery used in the WHERE clause, consider this measurements table:

A Measurements table, with
columns for weatherstation_name, instrument_id, datetime and measurement — Our measurements table has columns for weatherstation_name, instrument_id, datetime—comprising the composite primary key—and a columns for the measurements themselves

If we wanted to see only the above-average measurements for a given instrument_id (say, instrument_id=5) then we could write a query to calculate the average measurement value for instrument_id=5 and then write it into a WHERE clause, but we can also just do it in a single query:

We use different aliases for the table inside the subquery and in the outer query so that the query parser can distinguish them. The subquery is executed first and the single numeric answer is then the comparison value for measurement values in the WHERE clause of the outer query.

This sort of subquery is only run once, and then the returned is used in the outer query; it is called a nested subquery. A more complicated and computationally intensive type of query is called a correlated subquery. Consider our Measurements table; if we want the maximum measurement for a given instrument, we know we can just use MAX and restrict to that instrument_id in the WHERE clause. But what if we want the highest measurement for each instrument and the weatherstation_name and also datetime at which it occurred? You can't do this with GROUP BY because the weatherstation and datetime information aren't part of the GROUP BY criteria, and if you did include them in the GROUP BY list it would then find the highest value for each weatherstation/datetime/instrument_id combination instead!

A correlated subquery will get information from the outer query to run; in principle this may mean the subquery is run many times as the outer query runs, which can be computationally expensive, but if we need a correlated subquery we just have to try it. The correlated subquery needed to get the highest measurement for each instrument and also the weatherstation_name and datetime is below:

The subquery looks for the maximum of the measurements for the instrument_id currently being considered by the outer query; this instrument_id is set with the "WHERE M2.instrument_id = M1.instrument_id" part of the subquery, as the necessary instrument_id is passed in from the outer query and once again we had to give the Measurements table in the outer query and subquery different aliases. If the value is indeed the MAX for that instrument_id, then that measurement value and also the weatherstation_name and datetime are also returned.

Back