Title: Finding and Using Publicly Available Datasets for Secondary Data Analysis Research
1Finding and Using Publicly Available Datasets for
Secondary Data Analysis Research
- KL2 Seminar
- February 2011
2Disclosures and acknowledgements
- Disclosures
- None
- Acknowledgements
- Alex Smith, Michael McWilliams, Ann Nattinger,
SGIM Research Committee
3Two shout-outs
- Comparative Effectiveness Research through CTSI
Smith AK et al, JGIM 2011
4Learning objectives
- Appreciate key conceptual and practical issues
involved in secondary data analysis - Identify and use online tools for locating and
learning about publicly available datasets
relevant to your research - Focus on what is useful to you
5(My) Definition of Secondary Data
- Data that have been collected
- but not for you
6Types of Secondary Data
- Survey (NHIS, NHANES, HRS, BRFSS)
- Administrative (Medicare claims)
- Discharge (HCUP SID and NIS)
- Medical chart / EMR
- Disease registries (SEER)
- Aggregate (ARF, US Census)
- Research databases (SOF)
- Combinations and linkages
7Key Conceptual Issues
- Someone elses secondary data is your primary
data - Treat data and research plan with same rigor as
would for a primary data collection study - Research questions should be conceptually driven,
interesting a priori - Some exceptions Warren Browner rule
- Know data as well as if you had collected it
yourself - Who is in the cohort?
- Strengths and limitations of data collection
procedures, instruments
8Selecting a Database
- Compatibility with research question(s)
- Availability and expense
- Sample representativeness, power
- Measures of interest present and valid
- Messiness and missingness
- Local expertise
- Linkages
9Resources Needed
- Your effort
- Computer resources and security
- Programmer and/or statistician effort
- PhD statistical support complex sampling or
analyses - Coordinator if merging datasets
- Realistic timeline / Gantt chart
10Cases
- Amita is a junior faculty member interested in
doing a secondary data analysis project on
association between race/ethnicity and the
prevalence and outcomes of atrial fibrillation.
No prior experience and limited direct
mentorship. - Eric is a junior faculty member with past
experience. Wants to find new dataset around
which write grant on association between SES and
ADL function in elders.
11Amita Getting Started
- Amita
- Get acquainted with basics
- Find dataset and assess merit and feasibility
- Find a mentor / get expert help
- www.sgim.org/go/datasets
12(No Transcript)
13(No Transcript)
14Get Acquainted with Basics
15(No Transcript)
16Find a Dataset, Assess Merit Feasibility
17(No Transcript)
18(No Transcript)
19(No Transcript)
20CARDIA
21CARDIA
22Get Expert Help
23Getting Expert Help
- Request a consultation
- 1 on 1 consultation
- Clear, defined questions about dataset
- strengths and weaknesses about using XYZ to
study patterns of medication use for heart
failure
24Eric Getting Down to Business
- Identify datasets relevant to his research
interests - Identify health statistics, validated
instruments, funding sources - www.sgim.org/go/datasets
25(No Transcript)
26Finding Additional Resources
- National Information Center on Health Services
Research and Health Care Technology (NICHSR) - Inter-University Consortium for Political and
Social Research (ICPSR) - Partners in Information Access for the Public
Health Workforce - Roadmap K-12 Data Resource Center (UCSF)
- List of datasets from the American Sociologic
Association - Canadian Research Data Centers Data Sets and
Research Tools (Canada) - Directory of Health and Human Services Data
Resources - Publicly Available Databases from National
Institute on Aging (NIA) - Publicly Available Databases from National Heart,
Lung, Blood Institute (NHLBI) - National Center for Health Statistics (NCHS) Data
Warehouse - Medicare Research Data Assistance Center
(RESDAC) and Centers for Medicare and Medicaid
Services (CMS) Research, Statistics, Data
Systems - Veterans Affairs (VA) data
27CELDAC
- Comparative Effectiveness Large Dataset Analysis
Core - UCSF CTSI
- Access to local and national datasets and
expertise
http//ctsi.ucsf.edu/research/celdac
28National Information Center on Health Services
Research and Health Care Technology (NICHSR)
- Databases, data repositories, health statistics
- Fellowship and funding opportunities
- Glossaries, research and clinical guidelines
- Evidence-based practice and health technology
assessment - Specialized PubMed searches on healthcare quality
and costs
http//www.nlm.nih.gov/hsrinfo/index.html
29ISPOR
- International Society for Pharmacoepidemiology
and Outcomes Research
http//www.ispor.org/DigestOfIntDB/CountryList.asp
x
30Inter-University Consortium for Political and
Social Research (ICPSR)
- Worlds largest archive of social science data
- Searchable
- Many sub-archives relevant to HSR
- Health and Medical Care Archive
- National Archive of Computerized Data on Aging
-
http//www.icpsr.umich.edu/icpsrweb/ICPSR/access/i
ndex.jsp
31Questions?
- Specific high-value datasets
- Causal inference / comparative effectiveness
- Which comes first RQ or dataset?
- Evaluating and managing validity of measures
- Analyzing complex survey data
32EXTRA SLIDES
- Additional brief information about specific
high-value datasets - VA administrative data
- NHANES
- NAMCS
- NIS
33Administrative Data (VA)
- VA has multiple high-value administrative
databases - Outpatient visit information
- Visit date, type of clinic, provider, ICD9
diagnoses - Inpatient information
- Admitting dx(s), discharge dx(s), CPT codes, bed
section, meds administered - Lab data
- gt40 labs
- Pharmacy data
- All inpatient and outpatient fills
- Academic affiliation
- etc
34Administrative Data (VA)
- Huge bureaucracy and paperwork
35Administrative Data (VA)
- Messy data
- Huge size
- 2 TB server
- Data analyst
36Survey Data (NHANES)
- National Health and Nutrition Examination Survey
(NHANES) - Nationally representative sample of gt10K patients
every 2 years - Extensive interview data on clinical history
(including diseases, behaviors, psychosocial
parameters, etc.) - Physical exam information (e.g. VS)
- Labs, biomarkers
37Survey Data (NHANES)
- Free and easy to download
- (Relatively) easy to use
- Although requires careful reading of
documentation - Serial cross-sectional
- Disease data self-report
- Very limited information about providers and
systems of care
38Survey Data (NAMCS)
- National Ambulatory Medical Care Survey (NAMCS)
and National Hospital Ambulatory Medical Care
Survey (NHAMCS) - Nationally representative sample of 70K
outpatient and ED visits per year - Physician-completed form about office visit
39(No Transcript)
40Survey Data (NAMCS)
- Data more from physician perspective (diagnoses,
treatments Rxed, etc) and some info on providers
(e.g., clinic organization, use of EMRs, etc) - Serial cross-sectional
- Visit-focused
- Not comprehensive, ? value for chronic diseases
41Discharge Data (NIS)
- National Inpatient Sample (NIS)
- Database of inpatient hospital stays collected
from 20 of US community hospitals by AHRQ - Diagnoses and procedures, severity adjustment
elements, payment source, hospital organizational
characteristics - Hospital and county identifiers that allow
linkage to the American Hospital Association
Annual Survey and Area Resource File
42Discharge Data (NIS)
- Relatively easy to access (DUA, 200/yr)
- Relatively easy to use
- Though need close attention to documentation
- Limited data elements
- Huge data files