Wednesday, June 20, 2012

Unique value for Primary Key

I used to defined the primary key in integer database until one day that I'm working on web project. I realized that the using integer as the data type for primary key might have some issues.

Issue #1 - passing the primary key in the URL. For example, the primary key value for customer A is "1001". If the user wants to edit the customer A's profile, you might use the query string to contain the primary key value so that the page can query the database and retrieve the appropriate record. This is very bad practice because the user will be able to change the query string and retrieve the customer profile that they want.

Issue #2 - bottleneck of inserting new record. For example, you are running system that requires handling many transactions per seconds. Whether you are using "identity" or generating the next ID value that stores in last ID table, it creates bottleneck while inserting new records. The bottleneck will become even worst if you have long process to be run in the triggers or stored procedure.

Issue #3 - multiple databases. In certain situation, you might have to run the same system in multiple locations and each location has it's own database. These databases requires to be merge into a huge centralized database at later time. Irregardless of the merging interval, by using integer data type in the primary key, you are facing a problem with a question: what is the appropriate range of values to be assign to each location?

After trial and error with integer primary key and nvarchar (i.e., add a "outlet code" field to make the composite key), we found an easier way to solve the above problem. Instead, we are using  UNIQUEIDENTIFIER data type (or GUID in .Net framework).

When showing the UNIQUEIDENTIFIER value in the URL query string, it's hard for the user to guess the primary key value for other customer. The bottleneck problem also disappear because UNIQUEIDENTIFIER value can be generated at anytime by calling NEWID (MSSQL) function. Finally, merging the records from various database will not be a problem and you don't have to scratch your head to look for the appropriate range of values to be allocated.

Of course, UNIQUEIDENTIFIER data type comes with some costs. The first one is that it requires more storage space and the next cost is the searching takes more time. But, my argument is that the hard disk price is cheap and will be cheaper in the future. The searching time can be reduce with proper indexing or increase the hardware specification. Another cost is the data fragmentation which slows down the query. This happens when we are using NEWID() function to generate a new random value causing the record to be inserted into somewhere in the table. To solve the fragmentation, you will to rebuild or reorganise the index (you might have to google for the related commands).

In our database design, the primary key for master and child tables are UNIQUEIDENTIFIER data type and it is single field primary key. For the child table, the foreign key is not part of the primary key. The advantage is that it allows the DBA to retrieve any records easily because each record has it's own unique value.

Lastly, in MSSQL, the primary key is a clustered index. This means, inserting new records will cause the MSSQL engine to look for the appropriate page position. Some people will say this is fast and some deny it. Some DBA even say that because of the records were not stored in sorted order, it is more beneficial to the parallelism when reading records from hard disk.

All other potential factors must be consider carefully before implementing this idea.

Technical specification:
Database size that has implemented this idea: < 10GB.
MSSQL Server: 2005/2008.
Environment: Win2003, Win2008, Win XP/7 (even in the virtual PC).

No comments:

Post a Comment