Optimizing Hierarchical data for Tree Controls

On one hand, nothing can substitute good data design when it comes to performance.  On the other hand, relational databases are terrible at describing hierarchical relationships.  Vendors have been including hierarchical relationship query tools for years (i.e. Oracle and MS Sql Server 2005), but while the semantics are cleaned up, performance is lacking because tree traversals on relational structures are time consuming.  So where does that leave us?

 

In putting together this post, I ended up solving this problem once again and realized just how little accessible information is on the topic. By accessible, I mean not requiring a degree in tree theory (half joking).  Personally, I find I learn best by seeing a complete solution that works, understanding it and then applying new theory on top of it.  That begs the questions – is this post more accessible?  To me it is – so I’ll run with it.

 

Let’s set a rough goal – when asked about an item, I should be able to answer questions about or return its related tree structure in one sql statement.  I should also be able to do so without using vendor-specific extensions.  The best books that deal with this process dive into trees and graphs and analyze the pros and cons of different structures.  However, I have found that (1) most of the times that I am looking at these situations I am looking at reference data that tends to be very static and (2) I am typically populating some type of tree control with this data.  In these situations, I would argue that some of these mechanisms are more than what I need.  This post looks at a problem of delivering a tree structure to a user interface and the process of refining the database tables.  For this post I am using the typical employee/boss relationship to illustrate the problem.  This approach also applies to other areas as well: manufacturing bill of materials, order operations in medical scenarios, etc. with some caveats.

 

Digging In

 

Let’s say we are given a table of the typical employee/boss relationship.  These tables are so boring it’s not even worth coming up with names so I’ll use letters:

  




























Employee Boss
G C
E B
B A
C A
D B
F C
A null
H D

Table 1.1 – Employee

 

If I were to populate a tree control, I need to consider the minimum amount of data to populate my control.  I am going to assume that I will populate this control sequentially and that the use of simple hash-tables will be sufficient for establishing parent/child relationships.  Therefore, I need to have this ordered with the parent-less node first and the child-less nodes last.  To process sequentially, I need to have these ordered by layer – therefore the first layer is fully understood so the successive layer is able to reach it’s parent when populating the tree.  Therefore, I need a result table that looks like this:

 





































Employee Boss Layer (optional)
A Null 0
B A 1
C A 1
D B 2
E B 2
F C 2
G C 2
H D 3

Table 1.2 – ProcessingResult

 

This structure, when processed sequentially, can be added to a tree control quite easily. Therefore what structure do I need to transform Table 1.1 into Table 2.2?  Before I answer that question- lets think about other operations that are useful. 



  • If I know employee, say  ‘E’, I should be able to grab the entire hierarchary (or n-levels up from the position of ‘E’ within the tree) in a single select statement. Selecting down the tree requires a different edge structure than what I am showing here.

  • I should also be able to determine that given two employees – are they part of the same reporting hiearchy? 

  • Is one subordinate to the other either directly or indirectly or are they peers?
 

As with any structure, these are a few of the questions that drive the design.  I then decide that all of these should be query-able using a single select statement and easily optimized via one or more indices.  We’ll come back to these questions with examples of SQL-SELECT statements in a bit.

 

Breaking down the problem

 

To accomplish this task, we’ll use basic tree theory – which is a special kind of directed graph.  Graphs are data structures where nodes (our Employee table) are connected by edges (defined below as EmployeeHierarchy).  Each edge represents a one-way relationship.  To satisify our query requirements, we need to be able to identify the larger structure that each node participates in.  This could be a shared id, but in this case we’ll call it the RootBoss.

 














































RootBoss Employee Boss Layer
A A Null 0
A B A 1
A C A 1
A D B 2
A E B 2
A F C 2
A G C 2
A H D 3

Table 1.3 – EmployeeHiearchy

  

At this point, we have two solid questions (1) how do we satisfy our query requirements given these two tables and (2) how do we populate this thing?  First, lets get the fun part out of the way and look at how we query this table.

 

Performing the stated queries

 

First – given a single employee (@employee) – lets pull out the entire hierarchy:

 
select a.RootBoss, b.Employee, c.employee, c.boss, b.Layer
  from EmployeeHierarchy a
  inner join EmployeeHierarchy b on a.RootBoss = b.RootBoss 
  inner join Employee c on c.employee = b.employee
 
  where a.employee = @employee
  order by b.Layer 



Next, let’s pull up the hierarchy above and equal to the employee:


select a.RootBoss, b.EmployeeName, c.employee, c.boss, b.Layer 
  from EmployeeHierarchy a
  inner join EmployeeHierarchy b on a.RootBoss = b.RootBoss
  inner join Employee c on c.employee = b.employee
  where a.employee = @employee 
    
and a.layer >=  b.layer
 



Given @employee1 and @employee2, are they part of the same reporting structure?


select case when (select count(*) from EmployeeHierarchy a inner join EmployeeHierarchy b on b.rootboss = a.rootboss where a.employee = @employee1   and b.employee = @employee2 ) > 0  then ‘t’ else ‘f’ end as is_true 


Finally, determine if employee1 is higher on the same reporting structure than employee2 (i.e. employee 2 reports to employee 1 either directly or indirectly).

 select case when (select count(*) from EmployeeHierarchy a inner join EmployeeHierarchy b on b.rootboss = a.rootboss where a.employee = @employee1    and b.employee = @employee2     and a.layer < b.layer) > 0  then ‘t’ else ‘f’ end as is_true  

Populating the EmployeeHierarchy table (the edges)

 

Because the number of layers is variable you can use an iteration process – or – this is a great place for using vendor specific hierarchy extensions.  In the spirit of keeping this as vendor-neutral as possible, I’ll ignore the extensions.  As such this is a simple, two-step process. 

 

First, we must find all of the parent-less nodes into the table using the following command.


insert into EmployeeHiearchy  (RootBoss, Employee, Boss, Layer)
      (select employee, employee, boss, 0
            from Employee
            where boss is null ) 


Next, we need to populate each level until no more levels exist.  Once again, the depth is not known so we must loop until the process is complete.  Note that you should provide an exit condition for cyclical references.  Cyclical references amongst three or more nodes are particularly hard to discern with raw sql – using a maximum depth count is one such mechanism.

declare @rowsprocessed int
set @rowsprocessed = 1
declare @layer int
set @layer = 1
while( @rowsprocessed > 0 )
begin
      insert into EmployeeHierarchy
            ( RootBoss, Employee, Boss, Layer )
          select  eh.RootBoss,  e.Employee, e.Boss, @layer
            from EmployeeHiearchy eh
            inner join Employee e on  eh.Employee = e.Boss
            where eh.Layer = @layer – 1       
     
Set @rowsprocessed = @@ROWCOUNT
      Set @currentlevel = @currentlevel + 1
end 

Summary

 

And there you have it.  A reasonable tree structure processing that is very performant for queries.  It’s not a one size fits all by any means – keep in mind that if you have a constantly updating structure then you need to keep an eye on the cost of updating your edge table.  While the approach above could be made transactional, the sheer resource cost of that process is prohibitive – especially in real-time conditions.  But I would contend that this is a good start to understand the problem and evaluate other approaches.


 


Keep in mind that the “edge” table is really a many-to-many resolve table, and as such it can be made to accomodate tree structures that vary by external conditions.  This could be the case with certain types of bill of materials problems or medical order components.  One good example is a veterinary office where a dental prophy has differing structures based on species – the edge table can typically handle these scenarios through additional fields and multi-column key considerations.


 


When looking for performant transcational base hierarchy, there are better solutions that use other areas of graph theory for a better fit.  Whether you need this extra performance or not, I highly recommend checking out Joe Celko’s Sql for Smarties books – he has one book dedicated to hierarchical structure processing.  Celko has been on the ANSI SQL committee just shy of forever and has been writing articles on SQL programming for years. His approach is fundamentally different from sequential-oriented programmers, focusing first on set-based theory and translating that approach to the SQL dialect. 

   

 

This entry was posted in Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post.

4 Responses to Optimizing Hierarchical data for Tree Controls

  1. robsid says:

    For a really interesting and flexible approach to building a tree hierarchy, Check out Tropashko’s nested intervals method.
    http://arxiv.org/vc/cs/papers/0402/0402051v1.pdf.
    also a number of articles by Tropashko in http://www.dbazine.com showing the evolution of this method

    And check out his book ‘SQL Design Patterns’ if you are really interested.

    @ Mike – CTE perfomance OK on smaller hierarchies, not so great on larger ones, especially in comparison to the nested sets method.

  2. Mike says:

    Have you tried the vendor specific stuff compared to this in regards to performance? I was going to rely on SQL 2005′s CTE WITH statement for an upcoming hierarchical query task, so I am interested in any performance statistics you might have.

  3. shebert says:

    Thanks Sergio – you are right. The configuration of the edge table only allows for upward navigation of the tree.

  4. sergiopereira says:

    “Next, let’s pull up the hierarchy below the employee”

    I think your query there may be wrong. If @employee is “C” then you will get not only the people under C (F and G) but also the people under B with level < 2 (D, E, and H).
    Unless I misunderstood the intent of the query.

Leave a Reply