Saturday 28 January 2012

1) Introduction to normalisation

A relational database is made up of tables which consist of columns and rows. It may help to visualise the concept if you consider a table as being like a spreadsheet, where data is stored using the columns as fields. Each column in the table is of a certain data type, it could be text, a date, a number, a boolean value (true or false) or what's known as a BLOB (Binary Large OBject). These basic types are further categorised - text can be of fixed or varying lengths, a date can also have a time component, numbers can be integers, floating point, or fixed decimal. Further, a data item may have an unknown value and be NULL. It is important to remember that NULL itself is not a value, it is a state. In other words, a column in a row cannot equal NULL, it either IS NULL (unknown) or IS NOT NULL and contains valid data.

By storing the data across more than one table, you remove redundancy. Tables are linked by the use of keys, a primary key on the one side links to a foreign key on the many side. The first rule of database design is make primary keys meaningless. Ninety-nine times out of a hundred you will want to make a primary key an integer that increments by one for each row added to the table. Most databases even have an auto increment integer type or a sequence generator (that accomplishes the same thing) for just this reason. A sequence generator can be more flexible to use than the auto increment type, more on that later on.

So let's jump into the deep end and think about the design of our horse racing database.

First up, consider what is the purpose of the database. We want to be able to analyse racing history, and we want to be able to look at it from different angles - comparing individual horses in a race, comparing groups of horses (to get par times for example), comparing jockeys, trainers, tracks. We might want to be able to create our own ratings, be they class/weight à la Don Scott, time based, or purely statistical.

At a bare minimum, your database is going to need the following -

The date, start time, race no., track, distance, class and restrictions, prize money, going, and runtime of each race. The tab number, horse name, barrier, allocated weight, jockey, apprentice allowance, finish position, losing margin, and starting price of each horse in each race.

If we consider these as the columns of a single table, it would look like this:



This clearly shows that data redundancy will occur unless this data is split across more than one table.

The next post will explain how to do this. The process has a name too, it is called normalisation.

Cheers

No comments:

Post a Comment