CodeBetter.Com
CodeBetter.Com
RSS 2.0 via Feedburner
           Do you Twitter? Follow us @CodeBetter

Raymond Lewallen

Framework Design, Agile Coach, President Oklahoma City Developers Group, Microsoft MVP C#, TDD, Continuous Integration, Patterns and Practices, Domain Driven Design, Speaker, VB.Net, C# and Sql Server

Database Basics

  • Database Basics Part Four - Table Joins

    In an effort to provide information for those of you who are newcomers to databases, especially with the wonderul new Sql Server Express, I have committed myself to about half a dozen posts or so targeted directly at you to help you understand and build better databases.  Stay tuned to this blog for weekly updates to this series, or subscribe to the rss feed.

    Previous articles in this series:

    • The ACID Model – the cornerstone of databases and database transactions.
    • Normalization – the logical design and storage of data.
    • Isolation – degrees of seperation among concurrent transactions.

    Today we discuss joins.  Joins allow an easy way for us to retrieve information from multiple tables, presenting the data as a single table, based on the logical relationships among the tables.  Typically joins utilize columns that have an existing foreign key defined on one of its columns and an associated column in another table.

     

    INNER JOIN

    The inner join specifies that all matching rows, and only matching rows, are returned in the result set.  This is the default join type, although not the most common.  Let’s take a look at an inner join.

    Let’s say I have the following 2 tables: 

    Fig. 1Here is our cars for sale table, called CARS
    Make Color
    2001 Ford F-150 1
    1998 Chevrolet Corvette 3
    2003 Chrysler 300M 4
    2005 Kia Sedona 5

    Here is the lookup table for the colors
    ColorId Color
    1 Red
    2 Blue
    3 White
    4 Black

    For a quick explanation, if an actual constraint (relationship) between the color column of the cars table and the colorid column of the colors table existed, we wouldn’t have been able to put in the Kia Sedona with a color of 5, because 5 doesn’t exist in the lookup table.  For demonstration purposes, we are going to overlook the fact that we should have a relationship between the two tables.

    To demonstrate the inner join, let’s create a quick sql query that pulls back the rows from the cars table, along with the color description that matches from the colors table.

    Inner Join queryselect cars.make, colors.color
    from cars inner join colors
    on cars.color = colors.colorid

    And our resulting set looks like 

    Fig. 2Results of Inner Join query
    Make Color
    2001 Ford F-150 Red
    1998 Chevrolet Corvette White
    2003 Chrysler 300M Black

    You will notice two things.  First, the Kia Sedona did not come back in the result set.  This is because there was not a matching color in the colors table.  Second, the color blue was not returned in the result set.  This is because there is not a matching car with that color in the cars table.  This is the result set of an inner join.

     

    FULL JOIN

    The full join returns all rows, matching or not, into a result set.  This means that even rows that do not meet the condition of the join are returned, and the output columns that come from the other table are set to null.

    A full join sql query for the same tables from Fig. 1.

    Full Join queryselect cars.make, colors.color
    from cars full join colors
    on cars.color = colors.colorid

    The matching result set is

    Fig. 3Results of Full Join query
    Make Color
    2001 Ford F-150 Red
    1998 Chevrolet Corvette White
    2003 Chrysler 300M Black
    2005 Kia Sedona NULL
    NULL Blue

    All rows from both table were returned in the result set, including the ones that did not match our join statement.

     

    LEFT JOIN

    In order to understand the left join, you have to understand what left means in the query.  In a join condition, the left table is the table listed to the left of the JOIN statement.  The right table is, you guessed it, the table listed on the right.  In both queries above, the left table is the cars table and the right table is the colors table.  The left join is the most common join used.

    A left join specifies that all rows from the left table that do not meet the join condition are returned in the result set, and the output columns from the joining table that do not match the join condition are NULL.  Non-matching rows from the right table are not returned.

    Left Join queryselect cars.make, colors.color
    from cars left join colors
    on cars.color = colors.colorid
    The results of our left join query are
    Fig. 4Results of Left Join query
    Make Color
    2001 Ford F-150 Red
    1998 Chevrolet Corvette White
    2003 Chrysler 300M Black
    2005 Kia Sedona NULL

    From the colors table, the color Blue was not returned in the result set.  The Kia was returned, however, because we specified that all rows from the left table should be returned.

     

    RIGHT JOIN

    The right join works just like the left join, except that the full result set is returned from the right table, and non-matching rows from the left table are not returned.

    Right Join queryselect cars.make, colors.color
    from cars right join colors
    on cars.color = colors.colorid

     The results of our right join query are

    Fig. 5Results of Right Join query
    Make Color
    2001 Ford F-150 Red
    NULL Blue
    1998 Chevrolet Corvette White
    2003 Chrysler 300M Black

    This time, we got the Blue color from the colors table returned, but not the Kia Sedona.

  • Database Basics Part Three - Isolation

    In an effort to provide information for those of you who are newcomers to databases, especially with the wonderul new Sql Server Express, I have committed myself to about half a dozen posts or so targeted directly at you to help you understand and build better databases.  Stay tuned to this blog for weekly updates to this series, or subscribe to the rss feed.

    Previous articles in this series:

    Today we are going to look at isolation levels, specifically isolation levels supported within Sql Server.  By definition, an isolation level determines the degree of isolation, or seperation, of data among concurrent transactions.  See the post on the ACID model for how isolation is a cornerstone of database design and its transactions.

    There are 5 levels of isolation available in Sql Server, 1 of them being new in Sql Server 2005.

     

    Read Uncommitted Isolation

    This is also referred to as “dirty read” isolation.  This is the lowest level of isolation available and does nothing more than ensure that physically corrupt data cannot be read, but allows reads of logically corrupt data, if the data exists in that state.  Not even exclusive locks prevent other transactions from acting on this data.  Let’s take a look at what this means.

    Take the following table:

    Cars table
    Manufactuer Make Color
    GM Buick Black

    Now let’s examine what occurs given the following series of events, which will result in what are called “dirty reads”, which are a type of non-repeatable read; a type of data inconsistency.

    • TransactionA starts a transaction.
    • TransactionA updates the row to change the color to Red.
    • TransactionB starts a transaction.
    • TransactionB reads the row from the database.
    • TransactionA rolls back the transaction, therefore leaving the color intact as Black.
    • TransactionB ends.

    What data does TransactionB have?  TransactionB shows the color of the car as Red, even though, logically and transactionally, this data never really even existed in the database because TransactionA never committed that change to the database.  This is known as a “dirty read”. 

    You can also lose updates this way.  Let’s take the following scenario:

    • TransactionA starts a transaction.
    • TransactionA updates the row to change the color to Red.
    • TransactionB starts a transaction.
    • TransactionB updates the row to change the color to Blue.
    • TransactionA commits its transaction, changing the color to Red.
    • TransactionB commits its transaction, changing the color to Blue.

    What color is the car?  Its blue.  This is also known as the “last in wins” rule.  If multiple transactions change the same data at the same time, the last one to commit is going to win.

    You can see from this simple example what kind of troubles you can get into by using Read Uncommitted isolation levels.

     

    Read Committed Isolation

    This is the default isolation level used in Sql Server.  Sql Server issues shared locks under this isolation level which prevents dirty reads from occurring, like what we saw above in read uncommitted isolation levels.  In the same scenario given above, this time using read committed isolation, TransactionB would have in fact shown the color as Black, because it would be reading data as it existed in a committed state.

    What read committed isolation does not do is prevent other transactions from from changing the same data at the same time the data is already being held by another transaction.  There are 2 primary results for this type of behavior, known as phantom data and again, like we saw in read uncommitted, non-repeatable reads, but this time they are not the special type of non-repeatable reads called “dirty reads” like we saw above.

    • Phantom data occurs under the following circumstances under read committed
      • TransactionA begins a transaction.
      • TransactionA reads a row.
      • TransactionB begins a transaction.
      • TransactionB deletes the row read by TransactionA.
      • TransactionB commits its transaction.
      • TransactionA can no longer repeat its initial read in order to do an update, because the row no longer exists, resulting in phantom data.

    • Non-repeatable reads (not dirty reads) occur under the following circumstances under read committed
      • TransactionA begins a transaction.
      • TransactionA reads a row.
      • TransactionB begins a transaction.
      • TransactionB changes the color of our car to red.  TransactionA read the color of the row to be black.
      • TransactionB commits its transaction.
      • TransactionA reads the row again.
      • TransactionA has inconsistent data because the color now reads red instead of black, all within the scope of the same transaction that TransactionA began.

     

    Repeatable Read Isolation

    Here, Sql Server places locks on the data used in a query within the transaction, and this prevents other transactions from modifying data in that data set locked by the initial transaction.  This prevents non-repeatable reads, including dirty reads, from occurring.  However, another concurrent transaction can add new data to that scope of data of the original transaction, but the original transaction does not include the new row as part of its lock or data set, because it didn’t exist at the time the lock was issued.  The initial transaction will, however, get that inserted row in subsequent reads of the data.  Again, this is phantom data.

    Cars table
    Manufactuer Make Color
    GM Buick Black
    Ford Lincoln Black
    • Phantom data will occur under the following circumstances under repeatable read
      • TransactionA begins a transaction.
      • TransactionA reads all rows that have color = “Black”.
      • TransactionB begins a transaction.
      • TransactionB inserts new row with values “GM”, “Pontiac”, “Black”.
      • TransactionB commits its transaction.
      • TransactionA updates all the data from its query to color = “Red”.  This will also update the row that TransactionB inserted, because TransactionA must read the data again in order to update it.
      • TransactionA commits its transaction.

     

    Serializable Isolation

    This is the most restrictive isolation level available, and not a commonly used level of isolation, because it is also referred to as “the perfect transaction”, and perfect transactions just aren’t always the best choice.  Once a serializable transaction has started working on a particular data set, no other transaction can do anything to it until the initial transaction releases its locks.  This means no changing data, no adding new rows into the same data set, nothing.  Phantom data cannot exist and non-repeatable reads cannot occur.  However, this affects your performance because this isolation level reduces your concurrency levels because of the restrictiveness of the isolation.  Other transaction must wait for the locks to release. 

     

    Snapshot Isolation – New to Sql Server 2005

    Other databases have had a snapshot isolation level for awhile now.  Sql Server just caught up.  Snapshot isolation is an optimistic locking approach to the pessimistic locking schemes of read committed and serializable isolation levels.  There are 2 types of snapshot isolation levels:  Transaction-level snapshots provide the same level of isolation as serializable transactions, and statement-level snapshots provide read committed levels of isolation.

    Snapshot isolation basically works by row versioning.  Row versioning is nothing more than Sql Server keeping up with changes made to the database while a transaction is occuring.  What happens, and this is the beauty of snapshot isolation, a copy of a row is made for the transaction is that is where the read occurs.  This prevents a lock on the actual row in the database.  When the transaction wants to write the data, it checks to see if the actual data matches the copy of the data, and if it does, it commits the write.  If not, then the snapshot transaction fails.  This is the same as serializable isolation, except we have better performance because we didn’t have to place locks on the data that we read.  This helps to support more concurrent transactions.

    I haven’t played too much with snapshot isolation in Sql Server 2005, so please experiement for yourselves and as always, do your own research.  If you want to read what is probably the best source of information on snapshot isolation, read this 60 page whitepaper on it written by Kimberly L. Tripp.  That’s how much info there is to cover on the topic of just one isolation level, much less all 5.  Again, this is a series for newcomers to databases, so I don’t go into that level of depth.  Not to mention the fact that I can never go into Sql Server as deeply and with as much knowledge as the wonderful Kimberly L. Tripp!

  • Database Basics Part Two - Normalization

    In an effort to provide information for those of you who are newcomers to databases, especially with the wonderul new Sql Server Express, I have committed myself to about half a dozen posts or so targeted directly at you to help you understand and build better databases.  Stay tuned to this blog for weekly updates to this series, or subscribe to the rss feed.

    Normalization: organizing data in your database is an efficient way for both you and the database to use and understand.  What this really means is that normalization helps us to elimate redundant data and to store the data we have in a manner that makes sense.  Last week we talked about ACID, and you will recall the “atomicity” of a database.  Atomicity also must be maintained in tables via normalization.  Normalization helps us to create atomic tables by removing redundant data within rows of a table, thus creating atomic tables.

    Guidelines exist for normalizing data (referred to as normal forms) and exists as 5 sets of rules, labled first normal form through fifth normal form.  In most practical applications, you rarely see 4th or 5th normal forms (herein referred to as 1NF-5NF), so we’re going to look at 1NF, 2NF and 3NF, along with denormalized data.

    Denormalized Data

    Let’s take a look at some denormalized data.  I’m going to use the simplest example that comes to mind, and that is a auto repair shop that specializes in domestic automobiles (domestic to the USA, that is).  Our table lists automobile manufactuers and the makes of automobiles they manufactuer that the shop services.

    Fig. 1
    Manufactuer Make 1 Make 2 Make 3
    Ford Mercury Lincoln Ford
    GM Chevrolet Pontiac Saturn, Buick
    Chrysler Dodge      

    Fig. 1 shows us denormalized data.  For GM, we didn’t have enough columns to store all the data we wanted, so we put Buick in with Saturn.

    First Normal Form (1NF)

    We want to take fig. 1 and normalize it a bit.  Before we can do that, we need to know what the rules of 1NF are.  There are 2 rules to 1NF.

    • Remove duplicate columns.
    • Each column by row position must have a unique value.

    As you can see, 1NF is already enforcing the atomic nature of a table.  So first we need to remove duplicate columns from the table, at the same time creating column by row positions that have unique values, which means we also have to seperate Saturn, Buick into individual values.

    Fig. 2
    Manufactuer Make
    Ford Mercury
    Ford Lincoln
    Ford Ford
    GM Chevrolet
    GM Pontiac
    GM Saturn
    GM Buick
    Chrysler Dodge

    Fig. 2 meets the 2 rules of 1NF.  We have normalized our Fig. 1 data into First Normal Form.

    Second Normal Form (2NF)

    For each level of normal form, the next level must adhere to the rules of the previous level, in addition to adding its own rules.  This means in order to create a data structure that meets the requirements of 2NF, we must first meet the requirements of 1NF.  We have already discussed the rules for 1NF, so let’s look at the rules required to meet 2NF.

    • Remove duplicate data in a single column and place the data in seperate tables.
    • Create relationships between the sets of data.

    In Fig. 2, if GM gets bought by another manufactuer, say Honda, then we’d have to replace each row in the table with Honda everywhere we find GM.  That is a bit inefficient.  By meeting the 1st rule of 2NF, we can replace GM with Honda in our manufactuer table in Fig. 3, and this is must more efficient and a logical storage of data.  In order to meet the 2nd rule of 2NF, we must have keys and relationships.  The ManufactuerId in the manufactuer table is a Primary Key.  It is a unique identifier used to reference the description for the manufactuer.  We create a relationship beween it and the Make table using Manufactuer column in the Make table as the foreign key.  Fig. 4 shows the database diagram for this.  Now we are starting to adhere to consistency of our ACID model.  Fig. 3 satisfies all the rules of 1NF and 2NF.  We have normalized our Fig.2 to Second Normal Form.

    Fig. 3Manufactuer Table
    ManufactuerId Manufactuer
    1 Ford
    2 GM
    3 Chrysler

    Make table
    Manufactuer Make
    1 Mercury
    1 Lincoln
    1 Ford
    2 Chevrolet
    2 Pontiac
    2 Saturn
    2 Buick
    3 Dodge

    Fig. 4

     

    Third Normal Form (3NF)

    Again, in order to meet the rules of 3NF, we must first meet the rules of 1NF and 2NF.  In addition to meeting these requirements, 3NF introduces 1 other rule:

    • All columns that are not dependent on the primary key must be removed.

    Sounds simple enough.  First, lets take our Make table from Fig. 3 and add a column to it that shows how much the auto repair shop charges per hour for each different make, and also a column that shows if the shop works on cars or trucks for each particular make.

    Fig. 5Make table
    Manufactuer Make Charge Style
    1 Mercury $60 Cars
    1 Lincoln $60 Cars
    1 Ford $75 Trucks
    2 Chevrolet $55 Both
    2 Pontiac $60 Cars
    2 Saturn $60 Cars
    2 Buick $60 Cars
    3 Dodge $75 Trucks

    Not let’s examine the data and see if they meet the requirements of 3NF.  First, we must consider that Make is the primary key in this table.  It is a unique value that can be used as an identifier.  Ideally, that would be an integer column called MakeId and Make would be a description column.  We’ll look at that in a moment.  Knowing that Make is the primary key, are all columns in the Make table dependent on the primary key?  Manufactuer: yes.  The manufactuer depends on the make of the automobile.  Style?  Yes, the style the shop works on is dependent on the make of automobile.  Charge?  Hrm.  If we look closely, it appears as though the charge is dependent on the style, not the make.  This table does not meet 3NF.  We must remove the charge column and relate it to the Style, not to the make.  We do this by creating another table called charge.

    Fig. 6
    Style Charge
    Cars $60
    Trucks $75
    Both $55

    Now we can remove charge from the Make table in Fig. 5, because they style that is dependent on the make, gives us the charge.  The complete 3NF structure when all is said and done looks like Fig. 7 below.

    Fig. 7

    The Final DataManufactuer Table
    ManufactuerId (PK) Manufactuer
    1 Ford
    2 GM
    3 Chrysler

    Make table
    MakeId (PK) Manufactuer (FK) Make Style (FK)
    1 1 Mercury 1
    2 1 Lincoln 1
    3 1 Ford 2
    4 2 Chevrolet 3
    5 2 Pontiac 1
    6 2 Saturn 1
    7 2 Buick 1
    8 3 Dodge 2

    Style Table
    StyleId (PK) Style Charge
    1 Cars $60
    2 Trucks $75
    3 Both $55

    Next topic in this series: Isolation

  • Database Basics Quick Note - The difference in Varchar and Nvarchar data types

    I saw the following question posted on a forum and left the answer for the poster.  I thought I would share it with you all as I am in the process of the Database Basics posts, and this ties in.

    “What is the difference between nvarchar and varchar data types?”

    The difference in varchar and nvarchar datatypes is simple.  Nvarchar stores UNICODE data.  If you have requirements to store UNICODE or multilingual data, nvarchar is your choice.  Varchar stores ASCII data and should be your data type of choice for normal use.

    UNICODE requires 2 bytes for each character you store.  ASCII only requires 1 byte for each character.  This is important because of the row size limitations of Sql Server is the same as the page size limit, which is 8060 bytes.  This means a single row of a single varchar column can be varchar(8000), but a single row of a single nvarchar column can only be nvarchar (4000).

    Now, you can actually go and create a table with 5 columns of varchar(5000), and Sql Server will let you do that.  But once you go to put data into it, its going to puke down your leg.  This goes back to the size of the data page, which is 8K.  Rows cannot span pages (this does not apply to Sql Server 2005), so until the page size is increased, you have to adhere to that limitation.

    Also, you might bring up the argument of Text and Image fields.  This data is stored seperately and a 24 byte pointer is stored in the original data page.

  • Database Basics Part One - The ACID Model

    In an effort to provide information for those of you who are newcomers to databases, especially with the wonderul new Sql Server Express, I have committed myself to about half a dozen posts or so targeted directly at you to help you understand and build better databases.  Stay tuned to this blog for weekly updates to this series, or subscribe to the rss feed.

    The ACID model has been a cornerstone of database modeling for a very long time. Hopefully, its something that is still taught in CS courses. ACID lays out 4 goals that must be met before your database can even begin to be considered useful and reliable.

    Atomicity – Every transaction that occurs within the scope of the database is a single piece of work: Atomic. The atomic nature of transactions is maintained by ensuring that if any one part of a transaction fails, then the entire transaction fails. This is easy to do because of the singular purpose served by the transaction. This is also referred to as the "all or nothing" approach.

    Consistency - All data in the database must exist in a consistent state once a transaction completes. To help out with this, we use rules, keys, normalization, etc all help to maintain data integrity. All of these things combined helps to ensure our data always exists in a consistent state.

    Isolation – Modifications to your data must be isolated from other concurrent transactions. No concurrent transaction (Transaction A) should see any data being modified by another concurrent transaction (Transaction B) while it is in an intermediate state. Transaction A should see the data as it existed prior to being modified by Transaction B, or after Transaction B has completed and committed its changes.

    Durability – After a transaction has successfully completed and committed it changes, the data is permanently in place in the database, even in the event of hardware or software failures. Database backups and transaction logs help you out with this by facilitating a means of restoring committed transactions.

    Next topic in this series: Normalization.

More Posts