Data cleaning - PowerPoint PPT Presentation

About This Presentation
Title:

Data cleaning

Description:

Age: generate descriptive statistics. Treatment type: generate a ... Example: Age. It would be reasonable to check for values 10 and under and 70 and over ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 40
Provided by: test323
Learn more at: https://www.unodc.org
Category:
Tags: age | cleaning | data

less

Transcript and Presenter's Notes

Title: Data cleaning


1
Data cleaning
GAP Toolkit 5
Training in basic drug abuse data
management and analysis
Training session 12
2
Objectives
  • To establish methods of uncovering coding errors
  • To discuss techniques for implementing logical
    tests
  • To present methods of selecting cases
  • To reinforce the SPSS skills presented to date

3
Boolean operators AND
  • The AND operator is a logical operator in Boolean
    algebra
  • Imagine two statements X and Y
  • For the operation (X AND Y) to be true X has to
    be true and Y has to be true
  • The rules for Boolean operators are commonly
    displayed in Truth Tables

4
Truth table AND
5
Boolean operators OR
  • The OR operator is a logical operator in Boolean
    algebra
  • Imagine two statements X and Y
  • For the operation (X OR Y) to be true either X is
    true or Y is true or both X and Y are true

6
Truth table OR
7
Data cleaning
  • Check the data for errors
  • Clean the data before any data analysis

8
Types of error
  • There are two broad areas of error
  • Coding errors
  • Logical errors

9
Coding error
  • Data entry errors
  • Out-of-range values

10
Detecting out-of-range values
  • For categorical variables, having declared valid
    values, frequency counts will highlight any
    peculiar entries
  • For continuous variables, descriptive statistics,
    in particular the range and a histogram, will
    highlight any peculiar values

11
Examples
  • Age generate descriptive statistics
  • Treatment type generate a frequency distribution

12
Descriptives
Statistic Std. Error
Age Mean 31.78 .315
95 Confidence Interval for Mean Lower Bound Lower Bound 31.16
Upper Bound Upper Bound 32.40
5 Trimmed Mean 31.31
Median 31.00
Variance 154.614
Std. Deviation 12.434
Minimum 1
Maximum 77
Range 76
Interquartile Range 20.00
Skewness -.427 .062
Kurtosis -.503 .124
13
(No Transcript)
14
Treatment type
Frequency Percent Valid Percent Cumulative Percent
Valid Inpatient 1027 65.4 65.7 65.7
Outpatient 535 34.1 34.2 99.9
4 1 .1 .1 100.0
Total 1563 99.5 100.0
Missing System 8 .5
Total 1571 100.0
15
Resolving errors
  • The questionnaires should be checked
  • If possible, return to the interviewer or
    interviewee
  • If still unresolved, consider setting the value
    as missing
  • Note the importance of ID numbers for linking the
    computer to the questionnaire

16
Selecting cases
  • The ability to select a set of cases according to
    a criterion is essential in data cleaning
  • Generating statistics for subsets of the data is
    also a useful analytical tool

17
Example Age
  • Descriptive statistics of Age indicate that there
    is a case with a value of 1 and a case with the
    value 77
  • It is advisable to check the extreme values

Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Age 1563 1 77 31.78 12.434
Valid N (listwise) 1563
18
Example Age
  • It would be reasonable to check for values 10 and
    under and 70 and over
  • The task is to select those cases and display the
    results
  • Data/Select Cases generates the following
    dialogue box

19
Choose these options to define selection criteria.
20
(No Transcript)
21
Data/Select Cases
  • SPSS creates a new variable in the data set
    called filter_ which 1 when AGElt10 OR AGE gt
    70
  • All subsequent analysis will be on the reduced
    data set until Data/Select Cases/All Cases is
    chosen
  • The filtered cases are identified by a slash
    through the case number

22
Age
Frequency Percent Valid Percent Cumulative Percent
Valid 1 1 7.1 7.1 7.1
7 5 35.7 35.7 42.9
8 1 7.1 7.1 50.0
9 1 7.1 7.1 57.1
10 3 21.4 21.4 78.6
70 1 7.1 7.1 85.7
72 1 7.1 7.1 92.9
77 1 7.1 7.1 100.0
Total 14 100.0 100.0
23
Generating a report
  • Analyse/Reports/Case Summaries
  • Select the variables to be included in the summary

24
(No Transcript)
25
Case summariesa
Case number ID Age Race Education Employment Marital status Treatment type 1st most frequently used drug
1 16 16 8 White Secondary Working full-time Married liv w. spouse Inpatient ALCOHOL
2 85 85 77 White Tertiary Pensioner Widowed Inpatient ALCOHOL
3 183 183 70 White Secondary Pensioner Married liv w. spouse Inpatient ALCOHOL
4 184 184 72 White Tertiary Pensioner Married liv w. spouse Inpatient ALCOHOL
5 903 903 1 White . Student/pupil Never married Inpatient DAGGA
6 1041 1041 7 African Primary Student/pupil Never married Outpatient DAGGA
7 1042 1042 7 African Primary Student/pupil Never married Outpatient DAGGA
8 1043 1043 7 African Primary Student/pupil Never married Outpatient DAGGA
9 1044 1044 7 African Primary Student/pupil Never married Outpatient DAGGA
10 1045 1045 7 African Primary Student/pupil Never married Outpatient DAGGA
11 1518 1518 9 African Primary Student/pupil Never married Outpatient WHITE PIPE
12 1519 1519 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
13 1520 1520 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
14 1521 1521 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
Total N 14 14 14 13 14 14 14 14
a. Limited to first 100 cases.
26
Note All Cases
  • Dont forget that, once certain cases have been
    selected, all subsequent analysis is on the
    selected cases only
  • Once you have finished working with the subset,
    restore the file to All Cases before doing any
    further analysis
  • Data/Select Cases
  • Select the All Cases radio button
  • OK

27
Locating a case
  • From the Data Editor
  • Data/Go To Case OR
  • Select a variable, then Edit/Find

28
Logical errors
  • Detecting logical errors involves comparing
    answers to ensure that they are consistent
  • The type of logical checks appropriate to
    identify particular errors will depend on the
    questions in the questionnaire

29
Detecting logical errors
  • Cross-tabulations between categorical variables
    can be used to highlight errors
  • Check criteria using conditional statements and
    the Compute facility
  • Some software, such as SPSS Databuilder, allows
    tests for logical and coding errors to be built
    into a data entry form

30
Example Cross-tabulation
  • Cross-tabulations provide a simple method of
    investigating the joint distribution of two
    variables
  • The following slide is a cross-tabulation of
    Drug1 against Mode1 to check that appropriate
    modes of ingestion have been reported

31
Most Frequently Used Drug (Cross-tabulation)
Mode of ingestion Drug1 Mode of ingestion Drug1 Mode of ingestion Drug1 Mode of ingestion Drug1
Swallow Smoke Snort Inject Inject Inject Inject Total
DAGGA 1 180 181
HEROIN 31 11 29 29 29 29 71
CODEINE 5 5
COCAINE 2 44 46
CRACK 97 1 98
AMPHETAMINE 4 1 2 7
ECSTASY 24 1 25
SEDATIVES TRANQUILLIZERS 3 3
BENZODIAZEPINES 16 16
MANDRAX 12 12
VALIUM 2 2
LSD 5 5
SOLVENTS INHALANTS 2 1 3 6
WHITE PIPE 309 309
ALCOHOL 717 717
ROHYPNOL 3 3
MISC. PRESCRIPTION DRUGS 9 1 1 1 1 10
MISC. DRUGS 1 1

Total 791 634 62 30 30 30 30 1517
Most frequently used drug
32
Example conditional statements
  • Main.sav contains information on the three most
    frequently used drugs Drug1, Drug2 and Drug3
  • In a single case, no drug should appear in more
    than one of the three variables
  • To check this, generate a test variable on the
    basis of a conditional statement the test
    variable should take the value 0 if all three
    drug variables are different and the value 1 if
    there is any duplication

33
Compute Test 0
  • Transform/Compute
  • Enter the name of the new variable TEST
  • Click the Type and Label button and declare the
    variable as numeric with the label TEST
    VARIABLE FOR DRUG DUPLICATION
  • Set TEST 0

34
Compute TEST 1
  • If any of the drug options are the same, TEST
    should equal 1 EXCEPT when Drug2 Drug3 77
    (not applicable)
  • The condition is if
  • Drug1 Drug2 OR
  • Drug1 Drug3 OR
  • (Drug2 Drug3 AND Drug2 ? 77)
  • THEN Test 1

35
(No Transcript)
36
(No Transcript)
37
Case summariesa
1st most frequently used drug 2nd most frequently used drug 3rd most frequently used drug ID
1 BENZODIAZEPINES MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS 734
2 CRACK CRACK ECSTASY 807
3 CRACK WHITE PIPE CRACK 835
4 HEROIN SEDATIVES TRANQUILLIZERS SEDATIVES TRANQUILLIZERS 1182
5 SEDATIVES TRANQUILLIZERS MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS 1230
6 SEDATIVES TRANQUILLIZERS SEDATIVES TRANQUILLIZERS MISC. PRESCRIPTION DRUGS 1231
7 MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS Not Applicable 1245
8 MISC. PRESCRIPTION DRUGS MISC. PRESCRIPTION DRUGS ALCOHOL 1250
Total N 8 8 8 8
a. Limited to first 100 cases.
38
Exercise
  • Check for consistency between the drug reported
    and the method of ingestion for the second and
    third drugs of use
  • What additional logical tests could be completed
    on the data in main.sav?

39
Summary
  • Data entry errors
  • Out-of-range errors
  • Logical errors
  • Conditional statements
  • Selecting cases
  • Reports
Write a Comment
User Comments (0)
About PowerShow.com