/*
This is straight from the query window in Sql2k5, that is why these
lines are commented out. I'm using the pubs database as an
example.
*/
USE Pubs
/*
Every row in Titles has at least one or more matches in Roysched.
*/
SELECT T.*, RS.*
FROM titles T INNER JOIN
roysched RS ON RS.title_id = T.title_id
/*
Notice that the query returns 86 rows and title_id is duplicated. I
don't want duplicates, I want to return only the Royalty Schedule that
has the highest value in lorange for each title. How to do it?
*/
SELECT T.*, RS.*
FROM Titles T
INNER JOIN
(
SELECT TOP 1 *
FROM RoySched
) RS ON RS.title_id = T.title_id
/*
This doesn't work because it only returns 1 row from the child
table. Causing the INNER JOIN to match only one row in the parent
table. How to return the best match for all rows in the parent
table?
*/
--Identical to last query except for the WHERE clause in the sub-query
SELECT T.*, RS.*
FROM Titles T
INNER JOIN
(
SELECT TOP 1 *
FROM RoySched RsNest
WHERE RsNest.title_id = T.title_id
ORDER BY RsNest.LoRange DESC
) RS ON RS.title_id = T.title_id
/*
This doesn't work because we can't reference the parent table inside the sub-query. What can be done? Here is one option:
*/
SELECT T.*, RS.*
FROM Titles T
INNER JOIN RoySched RS ON
RS.title_id = T.title_id AND
RS.lorange =
(
SELECT MAX(LoRange)
FROM RoySched
WHERE title_id = T.title_id
)
/*
Now we have the data we want, but notice the sub-query in the Execution
Plan. It is performing an index seek which is good, but the
sub-query is executing once for every row in the parent, which
slaughters performance when the parent table has 2,000,000 rows in it.
The following pseudo-code is the feature I would like to see in SQL:
*/
SELECT T.*, RS.*
FROM Titles T
INNER JOIN TOP 1 RoySched RS ORDER BY LoRange ON
RS.title_id = T.title_id
/*
First of all, note that the code is not saying "join to the first row
in RoySched", it is saying "join to the first T.title_id match in
RoySched".
The Top-n clause let's me say I only want the best match, and the ORDER
BY clause allows me to guarantee that the best match is returned. Even
if the ORDER BY clause required a poorly optimized sort operation, it
would only have to be done once and under most circumstances would be
far more performant than 2,000,000 highly optimized sub-queries; and
once you built the appropriate index on your child table, the sort
would be very fast anyway.
Now at this point you're probably asking, why not just select the
title_id and max(lorange) from RoySched in to a temp table or table
variable and join from it? While the example shown above is the closest
I could find in Pubs that simulates my issue, it is still much simpler
than the logic required in my production scenario. Because of a
production requirement of wildcards in the criteria, extensive
processing is required before I can build the temp table. This
however is the best performing option so far and is what I'm
using. If the solution shown above were available I could
eliminate all the pre-processing and save significant processing time.
*/