Structured Query Language (SQL) Request

/*
This is straight from the query window in Sql2k5, that is why these
lines are commented out.  I’m using the pubs database as an
example.
*/

USE Pubs

/*

Every row in Titles has at least one or more matches in Roysched.

*/

SELECT     T.*, RS.*

FROM         titles T INNER JOIN

                     
roysched RS ON RS.title_id = T.title_id

/*
Notice that the query returns 86 rows and title_id is duplicated. I
don’t want duplicates, I want to return only the Royalty Schedule that
has the highest value in lorange for each title. How to do it?

*/

SELECT T.*, RS.*

FROM Titles T

    INNER JOIN

        (

        SELECT TOP 1 *

        FROM RoySched

        ) RS ON RS.title_id = T.title_id

/*
This doesn’t work because it only returns 1 row from the child
table.  Causing the INNER JOIN to match only one row in the parent
table.  How to return the best match for all rows in the parent
table?

*/

–Identical to last query except for the WHERE clause in the sub-query

SELECT T.*, RS.*

FROM Titles T

    INNER JOIN

        (

        SELECT TOP 1 *

        FROM RoySched RsNest

        WHERE RsNest.title_id = T.title_id

        ORDER BY RsNest.LoRange DESC

        ) RS ON RS.title_id = T.title_id

/*

This doesn’t work because we can’t reference the parent table inside the sub-query. What can be done? Here is one option:

*/

SELECT T.*, RS.*

FROM Titles T

    INNER JOIN RoySched RS ON

        RS.title_id = T.title_id AND

        RS.lorange =

        (

            SELECT MAX(LoRange)

            FROM RoySched

            WHERE title_id = T.title_id

        )

/*
Now we have the data we want, but notice the sub-query in the Execution
Plan.  It is performing an index seek which is good, but the
sub-query is executing once for every row in the parent, which
slaughters performance when the parent table has 2,000,000 rows in it.
The following pseudo-code is the feature I would like to see in SQL:

*/

SELECT T.*, RS.*

FROM Titles T

    INNER JOIN TOP 1 RoySched RS ORDER BY LoRange ON

        RS.title_id = T.title_id

/*
First of all, note that the code is not saying “join to the first row
in RoySched”, it is saying “join to the first T.title_id match in
RoySched”.

The Top-n clause let’s me say I only want the best match, and the ORDER
BY clause allows me to guarantee that the best match is returned. Even
if the ORDER BY clause required a poorly optimized sort operation, it
would only have to be done once and under most circumstances would be
far more performant than 2,000,000 highly optimized sub-queries; and
once you built the appropriate index on your child table, the sort
would be very fast anyway.

Now at this point you’re probably asking, why not just select the
title_id and max(lorange) from RoySched in to a temp table or table
variable and join from it? While the example shown above is the closest
I could find in Pubs that simulates my issue, it is still much simpler
than the logic required in my production scenario.  Because of a
production requirement of wildcards in the criteria, extensive
processing is required before I can build the temp table.  This
however is the best performing option so far and is what I’m
using.  If the solution shown above were available I could
eliminate all the pre-processing and save significant processing time.

*/

This entry was posted in Sql Development. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

7 Responses to Structured Query Language (SQL) Request

  1. Chaos83 says:

    John Papa, good work

  2. joosh says:

    god this was so helpful,

    it’s amazing it was posted back in 2005 and it’s the only clear explanation of Joining tables with TOP 1 from second table!

  3. Renee says:

    Did you ever find a decent SQL statement that could do this? I’m faced with a similar problem I’m trying to solve and I can’t for the life of me figure out how to do it in sql.

  4. sryope says:

    how can i display the name of the author with best royalty schedule?

  5. johnpapa says:

    Minor tweak to remove the @table variable….

    SELECT T.*,
    RS.*
    FROM Titles T
    INNER JOIN RoySched RS ON T.title_id = RS.title_id
    INNER JOIN (
    SELECT title_id,
    MAX(LoRange) highestlorange
    FROM RoySched
    GROUP BY
    title_id
    )
    hiroy ON RS.title_id = hiroy.title_id AND RS.lorange = hiroy.highestlorange
    ORDER BY
    T.title_id

  6. johnpapa says:

    Ah, just read the last paragraph you posted … sorry about that. I’ll think about this and get back to you to see if I can think of something more creative that avoids the temp table, if that is what you are looking to do.

  7. johnpapa says:

    Give this a shot.

    DECLARE @tblHiRoyalty TABLE
    (
    title_id VARCHAR(6),
    highestlorange INT
    )

    INSERT INTO @tblHiRoyalty
    SELECT title_id,
    MAX(LoRange) highestlorange
    FROM RoySched
    GROUP BY
    title_id

    SELECT T.*,
    RS.*
    FROM Titles T
    INNER JOIN RoySched RS ON T.title_id = RS.title_id
    INNER JOIN @tblHiRoyalty hiroy ON RS.title_id = hiroy.title_id AND RS.lorange = hiroy.highestlorange
    ORDER BY
    T.title_id

Leave a Reply