PC Database Fitness: Part 1
A database of robust, accurate data can be a powerful decisionmaking tool. However, a database of weak, unreliable data is often a liability which results in fewer good decisions being made.
Although there are many techniques for assessing, improving, and maintaining the fitness of the data in your database, they are not usually mentioned in database manuals, or taught in database training courses. The purpose of this report will be to describe some of these techniques, especially the ones that you can apply immediately. A summary of the suggestions you can apply immediately is given at the end of this report.
Unreliable data create unreliable databases. One of the key concepts of measurement theory is reliability, which is essentially the same thing as consistency. Giving consistency a new name may not seem very helpful, but in measurement theory the study of reliability has led to a better understanding of how consistency is achieved. Consistency, of course, is a prerequisite of accuracy. If you understand the different aspects of reliability, you can ensure that you are getting as much as possible out of the data in your database.
Unreliable data are like junk food -- they make your database nice and fat, but they seriously impair its fitness. Junk food provides empty calories, and unreliable data provide empty information. As the term unreliable implies, you cannot depend on them in your decisionmaking.
Data can be unreliable for many different reasons. They may have been collected with faulty instruments, they may have been entered incorrectly, they may have been analyzed incorrectly once they were entered, and so on.
Whatever the reason, unreliable data create unreliable databases. Whether you use your databases for list maintenance, for recordkeeping, or for mathematical evaluation, unreliable data can seriously impair your ability to make decisions.
In this report we will look at several ways that the reliability of PC databases can be improved. The specific aspects of databases we will look at are:Some of the suggestions I will make may seem obvious. However, I can assure you that they are not obvious to all database users, and that they are based on problems I have seen in actual databases. On the other hand, the sections on weighting and on rating and scaling may surprise many.
designing and building the database entering and cleaning data auditing data dealing with missing data data petrification weighting rating and scaling
Designing and building the database. The key to designing a successful database is simple: know the reports you want to produce. Ideally you will be able to provide mock-ups of them, and to specify how to produce them with the software. By being able to specify the reports in that detail, you will be able to get a very clear idea of how the files in your database should be designed.
This approach will also save you from collecting more data than you need. The computer can store a lot of data, and people often decide to collect as much as they can, and to worry about how to analyze it after the data have all been entered. That decision is certain to lead to the collection of unnecessary data, the expenditure of excessive amounts of money, and the use of excessive amounts of staff time.
Data do not collect and enter themselves. The more data you collect, the more staff time is required just to get the data in the software. If the reports to be produced by the database have not been specified, the design of the database is likely to be less efficient than it could be. In my experience, database projects of this type have often collapsed under their own weight before data entry is complete.
When designing any database, the most efficient approach is to start by establishing the value that each item of information is going to provide, and the cost of collecting, entering, and auditing that item. If the value doesn't justify the cost, don't collect the data.
Entering and cleaning the data. As you probably know, error in data is commonly referred to in statistical analysis as dirt. The dirtier your data are, the less useful.
Error in data arises in two ways. The first is through mistakes in recording, transcribing, entering, and so on. If your database software has validation or verification options, you can markedly reduce the frequency of these mistakes. These options will, for example, refuse to allow the entry of values which are known to be incorrect. If sex is coded as 'F' or 'M', using the verification options will prevent another letter from being entered by accident.
Verification cannot catch all errors, though. For example, if sex is entered as 'F' when it should be 'M', the value will still be accepted. Furthermore, verification is sometimes a complex process, requiring difficult programming which should not be expected of most users. Whether or not you use the validation or verification options of your database software, you will save yourself a lot of trouble if you use other techniques to take the dirt out of your data.
The process of removing dirt is naturally called cleaning the data. The techniques used to do this are often simple, and there is a very good chance that you use them already. However, I have found data cleaning is often omitted from database preparation, so we will have a quick look at how to do it.
One basic data cleaning technique is pretty obvious, but often neglected. This is simply to check newly entered data in your file against the source of the data. For example, if you're entering data from a questionnaire, you can visually check a listing of the data in your file against the questionnaires.
If data have been entered without being verified, a couple of quick and simple techniques will catch many errors that verification normally catches. The first is simply to produce a summary report for each field of the values which have been entered in it. You can then quickly spot any values which shouldn't be there. For example, if sex is to be recorded as 'F' or 'M', a record with 'Z' in the field for sex would obviously be wrong.
You can use the database's sort function to find unusual values. For example, I recently worked with a database in which some of the linking codes were alphanumeric. The codes consisted of a letter followed by two numbers -- for example, R01, S02, T10, and so on. The problem was that a capital O had occasionally been substituted for what should have been a zero. I simply sorted each of the files by these alphanumeric codes. The records with incorrect codes finished up at the end of the file, where they could be easily corrected.
Another quick and simple technique for finding dirty data is to crosstabulate fields and investigate any unusual results. For example, if you have collected information on people's ages, incomes, and job titles, you would probably be interested in verifying the accuracy of a record which showed an eighteen-year-old working as a CEO and making $500,000 a year. The record may be correct, but the chances of running across an eighteen-year-old CEO making half a million a year are slim enough that it's worth your time to verify the accuracy of the record.
It is also often possible to check your data against other sources of the same information. For example, addresses in geographical planning databases compiled from assessment rolls can be checked against addresses in databases of customers of public utilities or telephone companies. Similar services are available at reasonable rates for PC database users.
The second source of error in data is faulty measurement. Crosstabulation of fields will also help you spot some errors of this type. For example, when Statistics Canada included in the census a question asking people if they owned automatic dishwashers, it found that residents of Quebec were more likely than the residents of other provinces to report that they did.
However, Statistics Canada also noticed that in other provinces francophones were also more likely than anglophones to report that they had automatic dishwashers. A comparison of the English and French versions of the census form found that the French question seemed to be easier to misinterpret than the English question. A little further research confirmed this explanation.
If you are required by law or by regulation to maintain a database containing a huge amount of information, try entering a few fields at a time. In the end that will be quicker than entering all the data at the same time. It also makes it easier to clean your data, using the procedures we have already looked at. Finally, it makes data entry less confusing, and reduces the possibility of entry errors.
While the techniques we have been looking at in this section are simple, they are powerful. They don't take long, they save you a lot of trouble, and they improve the information on which you base important decisions.
Auditing data. Data can be correctly entered in a database but become incorrect later. For example, you can enter an address correctly in a mailing list database, but it will become incorrect when the person moves.
Before setting up any database, determine which data will have to be audited regularly after being entered, and how the checking will be done. If the auditing is going to require a great deal of staff time, you might reconsider collecting the data. That advice may seem elementary, but it is often ignored, with unpleasant results.
PC Database Fitness
© John Fitzgerald, 1995