The Raw Database

The Raw Database

When you construct an analytical database, you're better off to analyze your data after they're in the database rather than before. That is, the data in your database are most useful if they are what are known as raw data – that is, individual scores rather than summary data like statistics (such as percentages) or ranges (such as age ranges).
For example, let's consider a database which consists of the names of twenty cities and their unemployment rates (which are, of course, percentages). If you want to work out the unemployment rate for all the cities or for a subset of them, you can't, because you don't know how many people are in the labour force in each city. If, however, the database consists of the names of the cities, the number of people in the labour force in each city, and the number of unemployed in each city, you can easily work out those figures as well as any you could have worked out with the other database.
That example is a simple one for illustrative purposes, but problems like the one in the example are not rare. Databases constructed with range data rather than raw data are also common. Often, for example, people's ages are entered according to an arbitrary range into which they fall – a 28-year-old might be entered as a 25-to-34-year-old, for example. You can discover useful relationships with data like that, but you can also miss relationships that you would find if you entered the actual ages. If you entered the actual ages you would still be able to investigate your age categories, as well as alternatives to them which might be more useful.
A database of raw data is a much more powerful analytical tool than one of summary data. In compiling a database of summary data you are essentially drawing conclusions about the nature of the data before they have even been entered. Keeping your options open is much the better strategy.
The Raw Database © 2000, John FitzGerald
Home page | Decisionmakers' index | E-mail