Title: Data cleaning
1Data cleaning
GAP Toolkit 5
Training in basic drug abuse data
management and analysis
Training session 12
2Objectives
- To establish methods of uncovering coding errors
- To discuss techniques for implementing logical
tests - To present methods of selecting cases
- To reinforce the SPSS skills presented to date
3Boolean operators AND
- The AND operator is a logical operator in Boolean
algebra - Imagine two statements X and Y
- For the operation (X AND Y) to be true X has to
be true and Y has to be true - The rules for Boolean operators are commonly
displayed in Truth Tables
4Truth table AND
5Boolean operators OR
- The OR operator is a logical operator in Boolean
algebra - Imagine two statements X and Y
- For the operation (X OR Y) to be true either X is
true or Y is true or both X and Y are true
6Truth table OR
7Data cleaning
- Check the data for errors
- Clean the data before any data analysis
8Types of error
- There are two broad areas of error
- Coding errors
- Logical errors
9Coding error
- Data entry errors
- Out-of-range values
10Detecting out-of-range values
- For categorical variables, having declared valid
values, frequency counts will highlight any
peculiar entries - For continuous variables, descriptive statistics,
in particular the range and a histogram, will
highlight any peculiar values
11Examples
- Age generate descriptive statistics
- Treatment type generate a frequency distribution
12Descriptives
Statistic Std. Error
Age Mean 31.78 .315
95 Confidence Interval for Mean Lower Bound Lower Bound 31.16
Upper Bound Upper Bound 32.40
5 Trimmed Mean 31.31
Median 31.00
Variance 154.614
Std. Deviation 12.434
Minimum 1
Maximum 77
Range 76
Interquartile Range 20.00
Skewness -.427 .062
Kurtosis -.503 .124
13(No Transcript)
14Treatment type
Frequency Percent Valid Percent Cumulative Percent
Valid Inpatient 1027 65.4 65.7 65.7
Outpatient 535 34.1 34.2 99.9
4 1 .1 .1 100.0
Total 1563 99.5 100.0
Missing System 8 .5
Total 1571 100.0
15Resolving errors
- The questionnaires should be checked
- If possible, return to the interviewer or
interviewee - If still unresolved, consider setting the value
as missing - Note the importance of ID numbers for linking the
computer to the questionnaire
16Selecting cases
- The ability to select a set of cases according to
a criterion is essential in data cleaning - Generating statistics for subsets of the data is
also a useful analytical tool
17Example Age
- Descriptive statistics of Age indicate that there
is a case with a value of 1 and a case with the
value 77 - It is advisable to check the extreme values
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Age 1563 1 77 31.78 12.434
Valid N (listwise) 1563
18Example Age
- It would be reasonable to check for values 10 and
under and 70 and over - The task is to select those cases and display the
results - Data/Select Cases generates the following
dialogue box
19Choose these options to define selection criteria.
20(No Transcript)
21Data/Select Cases
- SPSS creates a new variable in the data set
called filter_ which 1 when AGElt10 OR AGE gt
70 - All subsequent analysis will be on the reduced
data set until Data/Select Cases/All Cases is
chosen - The filtered cases are identified by a slash
through the case number
22Age
Frequency Percent Valid Percent Cumulative Percent
Valid 1 1 7.1 7.1 7.1
7 5 35.7 35.7 42.9
8 1 7.1 7.1 50.0
9 1 7.1 7.1 57.1
10 3 21.4 21.4 78.6
70 1 7.1 7.1 85.7
72 1 7.1 7.1 92.9
77 1 7.1 7.1 100.0
Total 14 100.0 100.0
23Generating a report
- Analyse/Reports/Case Summaries
- Select the variables to be included in the summary
24(No Transcript)
25Case summariesa
Case number ID Age Race Education Employment Marital status Treatment type 1st most frequently used drug
1 16 16 8 White Secondary Working full-time Married liv w. spouse Inpatient ALCOHOL
2 85 85 77 White Tertiary Pensioner Widowed Inpatient ALCOHOL
3 183 183 70 White Secondary Pensioner Married liv w. spouse Inpatient ALCOHOL
4 184 184 72 White Tertiary Pensioner Married liv w. spouse Inpatient ALCOHOL
5 903 903 1 White . Student/pupil Never married Inpatient DAGGA
6 1041 1041 7 African Primary Student/pupil Never married Outpatient DAGGA
7 1042 1042 7 African Primary Student/pupil Never married Outpatient DAGGA
8 1043 1043 7 African Primary Student/pupil Never married Outpatient DAGGA
9 1044 1044 7 African Primary Student/pupil Never married Outpatient DAGGA
10 1045 1045 7 African Primary Student/pupil Never married Outpatient DAGGA
11 1518 1518 9 African Primary Student/pupil Never married Outpatient WHITE PIPE
12 1519 1519 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
13 1520 1520 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
14 1521 1521 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
Total N 14 14 14 13 14 14 14 14
a. Limited to first 100 cases.
26Note All Cases
- Dont forget that, once certain cases have been
selected, all subsequent analysis is on the
selected cases only - Once you have finished working with the subset,
restore the file to All Cases before doing any
further analysis - Data/Select Cases
- Select the All Cases radio button
- OK
27Locating a case
- From the Data Editor
- Data/Go To Case OR
- Select a variable, then Edit/Find
28Logical errors
- Detecting logical errors involves comparing
answers to ensure that they are consistent - The type of logical checks appropriate to
identify particular errors will depend on the
questions in the questionnaire
29Detecting logical errors
- Cross-tabulations between categorical variables
can be used to highlight errors - Check criteria using conditional statements and
the Compute facility - Some software, such as SPSS Databuilder, allows
tests for logical and coding errors to be built
into a data entry form
30Example Cross-tabulation
- Cross-tabulations provide a simple method of
investigating the joint distribution of two
variables - The following slide is a cross-tabulation of
Drug1 against Mode1 to check that appropriate
modes of ingestion have been reported
31Most Frequently Used Drug (Cross-tabulation)
Mode of ingestion Drug1 Mode of ingestion Drug1 Mode of ingestion Drug1 Mode of ingestion Drug1
Swallow Smoke Snort Inject Inject Inject Inject Total
DAGGA 1 180 181
HEROIN 31 11 29 29 29 29 71
CODEINE 5 5
COCAINE 2 44 46
CRACK 97 1 98
AMPHETAMINE 4 1 2 7
ECSTASY 24 1 25
SEDATIVES TRANQUILLIZERS 3 3
BENZODIAZEPINES 16 16
MANDRAX 12 12
VALIUM 2 2
LSD 5 5
SOLVENTS INHALANTS 2 1 3 6
WHITE PIPE 309 309
ALCOHOL 717 717
ROHYPNOL 3 3
MISC. PRESCRIPTION DRUGS 9 1 1 1 1 10
MISC. DRUGS 1 1
Total 791 634 62 30 30 30 30 1517
Most frequently used drug
32Example conditional statements
- Main.sav contains information on the three most
frequently used drugs Drug1, Drug2 and Drug3 - In a single case, no drug should appear in more
than one of the three variables - To check this, generate a test variable on the
basis of a conditional statement the test
variable should take the value 0 if all three
drug variables are different and the value 1 if
there is any duplication
33Compute Test 0
- Transform/Compute
- Enter the name of the new variable TEST
- Click the Type and Label button and declare the
variable as numeric with the label TEST
VARIABLE FOR DRUG DUPLICATION - Set TEST 0
34Compute TEST 1
- If any of the drug options are the same, TEST
should equal 1 EXCEPT when Drug2 Drug3 77
(not applicable) - The condition is if
- Drug1 Drug2 OR
- Drug1 Drug3 OR
- (Drug2 Drug3 AND Drug2 ? 77)
- THEN Test 1
35(No Transcript)
36(No Transcript)
37Case summariesa
1st most frequently used drug 2nd most frequently used drug 3rd most frequently used drug ID
1 BENZODIAZEPINES MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS 734
2 CRACK CRACK ECSTASY 807
3 CRACK WHITE PIPE CRACK 835
4 HEROIN SEDATIVES TRANQUILLIZERS SEDATIVES TRANQUILLIZERS 1182
5 SEDATIVES TRANQUILLIZERS MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS 1230
6 SEDATIVES TRANQUILLIZERS SEDATIVES TRANQUILLIZERS MISC. PRESCRIPTION DRUGS 1231
7 MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS Not Applicable 1245
8 MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS ALCOHOL 1250
Total N 8 8 8 8
a. Limited to first 100 cases.
38Exercise
- Check for consistency between the drug reported
and the method of ingestion for the second and
third drugs of use - What additional logical tests could be completed
on the data in main.sav?
39Summary
- Data entry errors
- Out-of-range errors
- Logical errors
- Conditional statements
- Selecting cases
- Reports