Saturday 28 January 2012

1) Introduction to normalisation

A relational database is made up of tables which consist of columns and rows. It may help to visualise the concept if you consider a table as being like a spreadsheet, where data is stored using the columns as fields. Each column in the table is of a certain data type, it could be text, a date, a number, a boolean value (true or false) or what's known as a BLOB (Binary Large OBject). These basic types are further categorised - text can be of fixed or varying lengths, a date can also have a time component, numbers can be integers, floating point, or fixed decimal. Further, a data item may have an unknown value and be NULL. It is important to remember that NULL itself is not a value, it is a state. In other words, a column in a row cannot equal NULL, it either IS NULL (unknown) or IS NOT NULL and contains valid data.

By storing the data across more than one table, you remove redundancy. Tables are linked by the use of keys, a primary key on the one side links to a foreign key on the many side. The first rule of database design is make primary keys meaningless. Ninety-nine times out of a hundred you will want to make a primary key an integer that increments by one for each row added to the table. Most databases even have an auto increment integer type or a sequence generator (that accomplishes the same thing) for just this reason. A sequence generator can be more flexible to use than the auto increment type, more on that later on.

So let's jump into the deep end and think about the design of our horse racing database.

First up, consider what is the purpose of the database. We want to be able to analyse racing history, and we want to be able to look at it from different angles - comparing individual horses in a race, comparing groups of horses (to get par times for example), comparing jockeys, trainers, tracks. We might want to be able to create our own ratings, be they class/weight à la Don Scott, time based, or purely statistical.

At a bare minimum, your database is going to need the following -

The date, start time, race no., track, distance, class and restrictions, prize money, going, and runtime of each race. The tab number, horse name, barrier, allocated weight, jockey, apprentice allowance, finish position, losing margin, and starting price of each horse in each race.

If we consider these as the columns of a single table, it would look like this:



This clearly shows that data redundancy will occur unless this data is split across more than one table.

The next post will explain how to do this. The process has a name too, it is called normalisation.

Cheers

G'day

With this blog, I aim to create a tutorial on designing a relational database for storing horse racing data.

I will be using open source software for the database and also Python to connect to the database and manipulate the data. I will explain basic SQL, and how to write scripts in Python to extract data from free websites to populate your database.

As there's no point in storing the data unless you are going to use it, I will also show you how to run queries to generate reports or test system ideas.

I hope it will be of some interest and useful. Even if you are not a programmer and have no wish to learn, I think knowing what is possible should be helpful if you employ a coder to write your software for you.

Cheers