Title: Data Management Techniques: How to Scrub Your Data
1Data Management Techniques How to Scrub Your
Data
- Andrea D. Hart
- Catherine Callow-Heusser
2Quality Assurance The Overlooked and Underused
Part of Data Management
- Plan carefully prior to starting data collection.
- Careful planning will reduce later data problems
and cost increases. - Plan with data analysis in mind!
3Quality Assurance
- Field Edits
- Directly after an interview, the interviewer
should scan for missing data points, illogical
skip patterns, quantitative items that need more
complete answers, observational items, or any
item that needs clarification from a supervisor.
Field edits should be made in a different color
(green pencil) to denote the field edit as
separate from the data collection time point.
4Quality Assurance
- Quality Control (QC)
- Whether one uses paper instruments or direct
computer entry, there should be a second pair of
eyes checking the data before data entry. This
quality control person should be fully trained on
the interviewing procedures and the data entry
rules. Any edits made by the QC person should be
done in a consistent color (red pen). Erasures
should not be made beyond the interview, so a
history of decisions is documented.
5Quality Control
- Communication should be clear between researcher,
supervisor, interviewer, and quality control
personnel to maintain consistent data collection
rules and interpretations. - When the quality control person gives feedback to
the interviewer regarding problems with the
interview or needed clarification, this corrects
similar errors in the future and prevents
interviewer drift.
6Double Data Entry
- Entering data twice and then comparing the two
data files for differences has been a standard
practice for researchers for many years. - For example, datafile 1 should have the exact
information as datafile 2. So var1 from
datafile1 subtracted from var1 from datafile2
should equal zero. Simple code can check when
they are anything but zero. Then either a
scanned or hard copy of the original interview is
checked to resolve the difference. Often, at
this point, data entry rules are established that
deal with ambiguous data points.
7Double Data Entry
- PROS
- Leaves a paper trail for checking data errors
- Low tech, no specialized software needed
- Allows another quality control check on data
points (reading clarifying comments written in
margins of questionnaire) - CONS
- Time consuming
- Labor intensive
8Scanning Data
- Software programs are available that scan
interview forms and read data (generally filled
in dots) based on a programmed template. These
programs can assign variable and value labels and
provide data files for use in SAS and SPSS. - Pros
- Faster (depending on the design of the interview
form and the ease of making a template), thereby
reducing data entry budgets - Facilitates archiving hard copies as digital
media (searchable by ids) reducing storage space
and tedious filing
9Scanning Data
- Cons
- With few exceptions, generally requires expensive
equipment and software. Additionally, may require
higher levels of expertise from data entry
personnel. - Must redo forms not designed with a scanning
template in mind to allow for dots to be
filled. - Forms that have dots to fill in are not user
friendly to a variety of populations. - Does not allow for double data entry and the
reliability of accurate entry varies across
scanning programs and across the users ability
to fill in the dots.
10Direct Data Entry
- Direct data entry allows participants or
interviewers to directly enter data into a
computer via interactive screens. - Pros
- Circumvents error introduced by multiple data
handling steps - Requires little or no additional data entry
personnel - Eliminates out-of-range values and skip pattern
errors - Cons
- Generally, no paper trail to check data errors
or recreate corrupt data - Requires fairly sophisticated programming to
create a foolproof data entry interface - Computer equipment is needed
11Range Checks
- Queries or frequencies for values that are not
possible or highly unlikely should be run. For
example, a yearly income of 50,000 is not
impossible, but highly unlikely for a teen mother
who qualifies for Early Head Start.
Skip Patterns
Queries should be run for values that are
illogical. For example, if there is an item that
says, no, Ive never drunk alcohol on a screen
for alcohol use, then there should be no data for
the section on alcohol use.
12Naming Protocols
- In a longitudinal study, large amounts of data
will be collected. Some of this data will be
collected across several time points. It is
crucial to have consistent, easy to decipher file
names, variable names, and variable labels. - Its an art to devise variable names for large
data sets that can be decoded and fit into only
8 characters (if using SPSS for data analysis).
13Variable Naming Protocols
- One strategy is to divide the variable name into
3 parts a prefix, a stem, and a suffix. - Prefix 2 or 3 characters denoting the timepoint
the data was gathered and from whom it was asked
or how it was gathered. - M1 Time 1, asked of the mother
- F2 Time 2, asked of the father
- C3 Time 3, asked of the childcare provider
- V1 Time 1, video-coded data
14Variable Naming Protocols
- Stem After the prefix, the next set of
characters should denote the construct being
measured. - V1intr time 1, video-coded, parent
intrusiveness - M3dep time 3, mother data, depression
15Variable Naming Protocols
- Suffix The suffix can be used for scales with
item numbers or summary scores. - M3dep01 time 3, mother data, construct
depression, item number 1 - M3deptot time 3, mother data, construct
depression, total summary score - M3depsu time 3, mother data, construct
depression, suicide subscale score
16Calculate Summary Scores or Reliability Checks
- The computer should be used to calculate summary
scores, reliability scores, or check
hand-calculated summary scores. This allows
missing data to be treated consistently with
rules. - For example, a researcher may want to assign a
missing value to any summary score that is
missing more than 25 of the individual items.
17Reports
- Databases like MS Access can be used to produce
any useful collection of data that is supported
by the database. - These reports can be used for management of data
collection, particularly when data collection
time points during a longitudinal study overlap. - E.g., weekly reports of subjects who are within
the window of data collection for times 1 and 2. - They can also be used to produce data codebooks,
lists of variable name prefixes, stems, suffixes,
database rules, etc.
18Have all the data been entered into appropriate
data tables?
- Matching tracking data (completion codes of
various stages of data collection) with data
entry tables ensures all data that was collected
is entered. This allows cleanup of inconsistent
data OR tracking information. All data should
tell the same story. - If the tracking information states that Alice
completed all protocols for timepoint 1 but is
missing the video portion of timepoint 2, then
the data tables should also reflect this pattern.
19Business Rules
- Business rules are database management rules used
to reduce the proliferation of database junk. - When multiple people use a database and are
designing tables and queries, it becomes
necessary to have rules to live by so that the
database is transparent to those who use it.
20Business Rules
- These rules may include
- Naming protocols for permanent queries, tables,
reports, or syntax. - Articulating cleanup procedures like automatic
deletion of non-permanent queries, data tables,
or syntax after a designated amount of time. - Designating personal databases or folders for
individual use. - Keeping a computer file for logging data errors
or changes. - Maintaining strict control over your computer
directory structurehelps eliminate clutter and
confusion. - Automating a backup schedule.
21Data Codebooks
- A database like MS Access can store ANY type of
data. Thus you can integrate making a codebook as
part of the data collection/entry process. - These types of data can include your variable
labels, value labels, questionnaires, or
appendices. - In preparing a longitudinal dataset, it is
important to maintain consistent coding schemes
across time. MS Access can be used to maintain
definitions of coding schemes. - yes/no in EVERY DATASET coded no 0, yes 1.
22Two Examples of Codebooks Produced by MS Access
- Differing amounts of information on each screen
- Ease of readability
23(No Transcript)
24(No Transcript)
25Summary
- Your research credibility depends on sound data
management techniques. - Proper data cleaning ALWAYS takes longer and
requires more investment than you think it will. - Clear advance planning of data tracking, variable
naming, business rules, and writing data
codebooks will be invaluable for a dataset that
is logical, consistent, and easy to use. - Always respect your data subjects and remember to
build in data procedures that maintain their
right to privacy.
26Contact Information
- Andrea D. Hart
- Hart.AndreaD_at_uams.edu
- University of Arkansas for Medical Sciences
- Partners for Inclusive Communities
- 2001 Pershing Circle, Suite 300
- North Little Rock, AR 72114
- 501-682-9918
- FAX 501-682-9991
- Catherine Callow-Heusser
- cheusser_at_cc.usu.edu
- NSF MSP/RETA Evaluation Capacity Building Project
- 2810 Old Main Hill
- Utah State University
- Logan, UT 84322-2810
- 435-797-1111
- FAX 435-797-1448