The Raw Database
When you construct an analytical database, you're better off to analyze your data after
they're in the database rather than before. That is, the data in your database are most useful if they
are what are known as raw data – that is, individual scores rather than summary data
like statistics (such as percentages) or ranges (such as age ranges).
For example, let's consider a database which consists of the names of twenty cities and
their unemployment rates (which are, of course, percentages). If you want to work out the
unemployment rate for all the cities or for a subset of them, you can't, because you don't
know how many people are in the labour force in each city. If, however,
the database consists of the names of the cities,
the number of people in the labour force in each city, and the number
of unemployed in each city, you can easily work out those figures as well as any you
could have worked out with the other database.
That example is a simple one for illustrative purposes, but problems like the one in
the example are not rare. Databases constructed with range data rather than raw data
are also common. Often, for example, people's ages are entered according to an arbitrary
range into which they fall – a 28-year-old might be entered as a 25-to-34-year-old, for
example. You can discover useful relationships with data like that, but you can also
miss relationships that you would find if you entered the actual ages. If you entered the
actual ages you would still be able to investigate your age categories, as well as
alternatives to them which might be more useful.
A database of raw data is a much more powerful analytical tool than one of summary data.
In compiling a database of summary data you are essentially drawing conclusions about the
nature of the data before they have even been entered. Keeping your options open is much the
better strategy.