Sunday, March 9, 2014

5 Things a developer should know about databases

 

Database Normalization

Theoretical versus real-world

There is not a strong need for developers to know the theoretical definition of 1st thru 5th normal form, but there needs to be an understanding of some design practices. The keys are:

  1. No Duplicate data in columns in one table
  2. Relationships built as foreign keys
  3. Primary Key to identify a row
  4. Learn Many to Many relationships (this could be a whole blog)

Always look for natural key when using identity for Primary Key

The first of which is a primary key. Over the years, I have finally grasped the understanding of using an Identity column as the primary key but always design a table with a natural key. The following example that shows the difference in a Customer table.

CREATE TABLE [dbo].[Customer](
    [CustomerID] [int] IDENTITY(1,1) NOT NULL,
    [CustomerName] [varchar](40) NOT NULL,
    [AddressLine1] [varchar](60) NULL,
    [City] [varchar](30) NOT NULL,
    [StateAbbreviation] [varchar](2) NOT NULL,
    [PostalCode] [varchar](15) NOT NULL,
 CONSTRAINT [PK_Address_AddressID] PRIMARY KEY CLUSTERED 
( [CustomerID] ASC) 
) ON [PRIMARY] 
 
CREATE UNIQUE NONCLUSTERED INDEX [unc_Customer_CustomerName] 
    ON [dbo].[Customer]
([CustomerName] ASC)
ON [PRIMARY]
GO

The key is that I understand that the Identity column is not the column used to locate a customer, but the the Customer Name is really how we identify a customer. From my experience, I saw the identity column come into software development for 3 reasons:



  1. The int column takes up less space in non-clustered indexes
  2. It helps with lookups in Objects for Object Oriented coding
  3. The joins are faster and helps with related tables when used as part of the related table’s primary and/or foreign key

Lookup versus Parent-Child tables


The Customer table can be can be considered a Lookup table for customers that have Orders. The relationship between an Order Header and Order Detail would be a Parent-Child relationship. The Order Header and Detail still need to have a Natural Key just like the Customer table even though Identity is used strongly in these tables.


image


The example above show OrderID as the primary key of the OrderHeaders table, but there is a natural key to identify rows which is CustomerID + OrderDate. The OrderID column is carried down into the OrderDetail table. The column is combined with ProductID to form the primary key of the table.


 


Data Types



char vs. varchar


Remember that using a char data type will use the space required in the size in the data pages of ever row, where varchar will only take up the space used in the actual data in the column for that row.


Money not decimal


The money data type is a much better option for dollar amounts in a database than trying to specify the size and number of decimal places in a decimal data type.


nvarchar or nchar


The use of nvarchar and nchar are for storage of all Unicode characters while char and varchar are for non-Unicode characters. The nchar/nvarchar data types take up 2 times the space for a column in each row. In 25 years of IT work, I have never seen a need for Unicode characters, though some would say differently.


integers – tiny, small, int and bigint


The integer data type now have various size that are different in space in the rows of a table. If you know that a lookup table only has 10 possible rows, you do not need to use int or bigint for the ID column. You should go ahead and use smallint. See Books Online for more information.


Indexes



Learn Primary Key defaults to Clustered



As we saw in the diagram above, the CustomerID was the primary, clustered index of the Customer table without even specifying clustered. This is by default for SQL Server. You can change this to be a non-clustered primary key index, but you should always have a clustered index on a table in a transaction type database (OLTP). You could have easily specified the Unique Index on CustomerName as Clustered but be warned. The clustered index columns are included in all non-clustered indexes on a table.


Non-clustered indexes


The non-clustered index is usually created for improving the performance of queries that request data from a different column than the primary key. In the example above, a query on CustomerName in a WHERE clause of a query can return results faster because of the Unique Non-Clustered index. Another place to look for non-clustered indexes is on foreign key columns of related tables.



Overuse of Include clause


Creating covering indexes used to require including the additional columns in the actual index for help with eliminating a lookup in a query plan. The addition of the INLCUDE clause in SQL Server now lets you include additional columns at the leaf level of the index so the additional columns do not have to take of space in the tree of the index. Watch out not to overuse this option, because you can end up with more data space used in indexes than the table itself. Also, I have seen where include indexes cause Deadlocking, which is not a good thing.



T-SQL



Do not use SELECT * (specify columns)


Using SELECT * causes a couple of issues. First, it might take a full scan of the table (Clustered Index Scan). This has higher IO and Memory usage than just specifying the columns you need. Second, if someone adds additional columns to the table, it can break views and code that are not equipped to handle the additional columns


Learn difference between LEFT and INNER


Joins in T-SQL go back to the idea of Database Normalization. The INNER JOIN will return all the rows that match in both tables. The LEFT will return all rows from the table specified in the FROM part of the T-SQL and returns data to the columns specified in the JOIN table that have a match. The rows that do not have a match will have Nulls in those specified columns.

New Features of SQL Server



Windows Functions


Every new version of SQL Server comes out with improvements and additions. Yes, Microsoft does actually improve our lives. I am extremely excited to use the ROW_NUMBER() OVER PARTITION to find the latest or earliest rows in a data set. This eliminates the need to use MAX on a Date in a sub-query to join to the main query to find some matching data that is not related to the keys of the tables. There is even more functions like RANK, DENSE_RANK and NTILE.


Common Table Expressions (replaces cursors in some cases)


If you want to learn something really cool, try common table expressions. This can eliminate the use of CURSORS to loop through rows with T-SQL code. It will get you on path to understanding Set Based querying of your database.


There are many of features of SQL Server that a developer should learn, and these are just some of the common items I see new developers lack in their experience that are usually the first for me to help as a co-worker. Happy programming!!!

9 comments:

  1. Thanks Saeed, I hope to follow-up from comments I have received.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Joining is faster with an Index that is an integer? Are you sure? I would think it is either the same or just slightly faster. The big advantage with the Identity is you only have to update the "Natural Key" in one place.

    With "include indexes cause Deadlocking", are you sure it wasn't an index involved with a Deadlock that happened to have an include?

    ReplyDelete
    Replies



    1. Thomas LeBlancMarch 18, 2014 at 6:47 AM

      Maria, my point about "joining is faster with an Index that is an integer?" is about the history of what people believe an identity columns is used as a key. You are correct that sometime it is not significant.

      The index I was referring to is one with an Include clause. I have seen where this is over used and multiple ones become part of a deadlock.

      Thanks for the comments,
      Thomas

      Delete
  4. Interesting question about the statement "3.The joins are faster and helps with related tables when used as part of the related table’s primary and/or foreign key" may not be correct. If one joins on a low cardinality CHAR(2) column is it faster than using an INT or BIGINT? I don't have the time to dig right now, but I'd guess the CHAR(2) compare in machine language is a loop that might be unrolled whereas the INT compare is a single instruction. On a 64-bit implementation the BIGINT might be a single instruction too. Anyone know?

    ReplyDelete
    Replies



    1. Thomas LeBlancMarch 18, 2014 at 6:49 AM

      George,

      You are correct. That is actually a good point. Why create an identity for a column lke State Abbreviation? The Int or Bigint would be larger than the 2 character state column. I am actually for the StateAbbreviation for the key and not an identity column in this case.

      Thanks for the comments,
      Thomas

      Delete
  5. With respect to your example of a natural key, the index appears to be a unique key so how do you add 2 customers with the same name? i.e. John Smith.

    ReplyDelete
    Replies
    1. Chris,

      You would not be able to enter John Smith twice in this table. The creation of this unique constraint was to bring up this in conversation with others. You really have to think when creating tables in a transactional database and make sure you have a natural key to find the uniqueness.

      Thanks for the comments,
      Thomas

      Delete