CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Raymond Lewallen

Framework Design, Agile Coach, President Oklahoma City Developers Group, Microsoft MVP C#, TDD, Continuous Integration, Patterns and Practices, Domain Driven Design, Speaker, VB.Net, C# and Sql Server

Structured Query Language (SQL) Request

/*
This is straight from the query window in Sql2k5, that is why these lines are commented out.  I'm using the pubs database as an example.
*/

USE Pubs

/*
Every row in Titles has at least one or more matches in Roysched.
*/

SELECT     T.*, RS.*
FROM         titles T INNER JOIN
                      roysched RS ON RS.title_id = T.title_id

/*
Notice that the query returns 86 rows and title_id is duplicated. I don't want duplicates, I want to return only the Royalty Schedule that has the highest value in lorange for each title. How to do it?
*/

SELECT T.*, RS.*
FROM Titles T
    INNER JOIN
        (
        SELECT TOP 1 *
        FROM RoySched
        ) RS ON RS.title_id = T.title_id

/*
This doesn't work because it only returns 1 row from the child table.  Causing the INNER JOIN to match only one row in the parent table.  How to return the best match for all rows in the parent table?
*/

--Identical to last query except for the WHERE clause in the sub-query
SELECT T.*, RS.*
FROM Titles T
    INNER JOIN
        (
        SELECT TOP 1 *
        FROM RoySched RsNest
        WHERE RsNest.title_id = T.title_id
        ORDER BY RsNest.LoRange DESC
        ) RS ON RS.title_id = T.title_id

/*
This doesn't work because we can't reference the parent table inside the sub-query. What can be done? Here is one option:
*/

SELECT T.*, RS.*
FROM Titles T
    INNER JOIN RoySched RS ON
        RS.title_id = T.title_id AND
        RS.lorange =
        (
            SELECT MAX(LoRange)
            FROM RoySched
            WHERE title_id = T.title_id
        )

/*
Now we have the data we want, but notice the sub-query in the Execution Plan.  It is performing an index seek which is good, but the sub-query is executing once for every row in the parent, which slaughters performance when the parent table has 2,000,000 rows in it. The following pseudo-code is the feature I would like to see in SQL:
*/

SELECT T.*, RS.*
FROM Titles T
    INNER JOIN TOP 1 RoySched RS ORDER BY LoRange ON
        RS.title_id = T.title_id

/*
First of all, note that the code is not saying "join to the first row in RoySched", it is saying "join to the first T.title_id match in RoySched".

The Top-n clause let's me say I only want the best match, and the ORDER BY clause allows me to guarantee that the best match is returned. Even if the ORDER BY clause required a poorly optimized sort operation, it would only have to be done once and under most circumstances would be far more performant than 2,000,000 highly optimized sub-queries; and once you built the appropriate index on your child table, the sort would be very fast anyway.

Now at this point you're probably asking, why not just select the title_id and max(lorange) from RoySched in to a temp table or table variable and join from it? While the example shown above is the closest I could find in Pubs that simulates my issue, it is still much simpler than the logic required in my production scenario.  Because of a production requirement of wildcards in the criteria, extensive processing is required before I can build the temp table.  This however is the best performing option so far and is what I'm using.  If the solution shown above were available I could eliminate all the pre-processing and save significant processing time.
*/


Comments

John Papa said:

Give this a shot.

DECLARE @tblHiRoyalty TABLE
(
title_id VARCHAR(6),
highestlorange INT
)

INSERT INTO @tblHiRoyalty
SELECT title_id,
MAX(LoRange) highestlorange
FROM RoySched
GROUP BY
title_id


SELECT T.*,
RS.*
FROM Titles T
INNER JOIN RoySched RS ON T.title_id = RS.title_id
INNER JOIN @tblHiRoyalty hiroy ON RS.title_id = hiroy.title_id AND RS.lorange = hiroy.highestlorange
ORDER BY
T.title_id
# April 4, 2005 9:16 AM

John Papa said:

Ah, just read the last paragraph you posted ... sorry about that. I'll think about this and get back to you to see if I can think of something more creative that avoids the temp table, if that is what you are looking to do.
# April 4, 2005 9:23 AM

John Papa said:

Minor tweak to remove the @table variable....

SELECT T.*,
RS.*
FROM Titles T
INNER JOIN RoySched RS ON T.title_id = RS.title_id
INNER JOIN (
SELECT title_id,
MAX(LoRange) highestlorange
FROM RoySched
GROUP BY
title_id
)
hiroy ON RS.title_id = hiroy.title_id AND RS.lorange = hiroy.highestlorange
ORDER BY
T.title_id
# April 4, 2005 9:41 AM

sryope said:

how can i display the name of the author with best royalty schedule?

# September 30, 2007 12:06 AM

Leave a Comment

(required)  
(optional)
(required)  

Enter the numbers above:
Add

About Raymond Lewallen

Working primarily in the public sector during his career, Raymond has designed and built several high profile enterprise level applications for all levels of the government. Raymond now works as a solutions architect for EMC. Raymond is an agile coach, Microsoft MVP C# and also president of the Oklahoma City Developers Group and Oklahoma Agile Developers Group. Raymond spends a lot of his time learning and teaching such things as Test Driven Development, Domain Driven Design, Design Patterns and Extreme Programming practices and principles, to name a few. Raymond is also an advocate of Alt.Net. Raymond is primarily a framework guy, so don't ask him anything about UI :) Check out Devlicio.us!