Title: Appropriate Use of Statistics for BigData Projects
1Musings by a Statistician
Laura Lee Johnson, PhD September 11, 2008 NIH
Biomedical Computing Interest Group
2Topics on My Mind
- Study and experimental design
- 20 people sending me the Wired (23 JUN 2008)
article on the obsolesce of the scientific method - Large (vs. small?) data
- Repeating the same mistakes
- Variance and independent measurements
- Analyses and sample size
- It is just a pilot (the 4th pilot)
3May 2006 Talk at BCIG Outline
- What is your question?
- What is your design?
- What does your data look like?
- Lots of measures, few people
- Lots of people, lots of zeros/NAs
- No data, large parameter and value space
4Outline
- Lots of numbers, few people
- fMRI, Microarray, Proteomics
- Lots of people, lots of zeros/NAs
- sparse data
- Multidimensional, hierarchical
- No data, large parameter and value space
- Data farming
- The Sims, Project Albert
5The Key
- Leverage what you have
- Do not oversell it
- Add to it this All data has a shape.
- Can you use that shape to answer a question or to
generate hypotheses to develop a study to test a
question of interest? - Sometimes it cannot be done well
6What is the Question?
7Not that different today
- Petabytes are great
- But do not fool yourself
- It might fool whoever keeps demanding answers
from you
8Progression Treatment of Periodontal Disease
- Two studies
- Subject level
- Each tooth of each subject
- Multiple locations on each tooth of each subject
- Longitudinal structure of the study
- Make that studIES
- The data contain multiple levels of correlation
9Variation and Correlation
10Variance and Correlation Laboratory(ies)
11Variation and Correlation Data!
12Variation and Correlation
13Generating More Questions
14Not What You Want
15Can you hit the broad side of
- A barn?
- Did you paint the bulls eye before or after you
shot? - Hypothesis generation is great
- Do not forget that is what you did
- Do not forget 99.9 can be insufficient
- Do not forget 90 could be a winner
16Statistical Inference
- Inferences about a population are made on the
basis of results obtained from a sample drawn
from that population - Want to talk about the larger population from
which the subjects are drawn, not the particular
subjects!
17Linear Regression
- Model for simple linear regression
- Yi ß0 ß1x1i ei
- ß0 intercept
- ß1 slope
- Assumptions
- Observations are independent
- Normally distributed with constant variance
- Hypothesis testing
- H0 ß1 0 vs. HA ß1 ? 0
18Stretching AssumptionsIndependent in any
direction?
- fMRI, microarrays
- Voxels and genes associated with others
- Repeated measures on the same person EVEN IF
DIFFERENT SAMPLES should be considered
non-independent - Are all outcomes measured with equivalent
sensitivity? - Probably not
- Variances will not be the same
19Analyses Make More Datasets
- Permutation/randomization test
- Rearrange the current data ? new dataset
- Calculate the test statistic
- Repeat many times
- Compare results to the original datas test
statistic - Bootstrap
- Sample record with replacement ? new dataset
- Rest is same as above
20What is your design?
21When do I Cringe?
- Small sample size
- Biased sample
- Convenience samples
- Median? Mean? Range? Standard deviation?
- Interviews Who did it, of whom, with what,
where, and how? - Completely impractical in real world setting
22Design
- Produces/uses data to answer the question(s)
- If the question needs hierarchical data to answer
it - Needs a hierarchical design
- Needs to have a hierarchical data analysis
- Not sparse
23Causation
- Biological plausibility
- Temporal relationship
- Dose response
- Reproducibility
- Strength of association
- Coherence with established facts
- Specificity of association
24National Health Interview Survey
- State level stratification
- Black and Hispanic populations oversampled 2006
added Asians to that - Area frame based on previous decennial census
change this every 10 years - Family
- One sample adult and one sample child (if
children under 18 present) - Household, family, and person level files
25fMRI
- Lots of data/scan
- Voxels
- Not independent
- Correlation between voxels not uniform
26What is the question? What is the design?
- Inter and intra subject correlation
- Complex dimensions of brain features
- Supervised learning
- Activity patterns predictive of XXX
- Spatial resolution
27Power
- How likely are you to see a difference if there
is one there? - Look for a big difference!
- More subjects, more runs, longer runs
- Improved signal to noise ratio
28Who? Structure of Design
- Cross-sectional study design
- Compare groups at a single time point
- Pre-post or longitudinal designs
- Look at how one group changes during an
intervention compared to another group - If the design lets you stay simple, do
- Add hierarchy onto all this? If you need to
29Data Sweet Data
- Thank you, Chris Anderson, and everyone who sent
me that article. I already said it.
30Terabyte, meet Petabyte
- Need to turn one huge matrix, all at once
- MATLAB is your friend (say some)
- Random effects (linear mixed) models are your
friends - 64 x 64 voxels x 10 slices per image
- Or 128x128 or 15 or 20 slices
- At least 10 people/group
- Preferably more
- Do you have any confounders to adjust for?
31HAHAH 10 per group!
- Yeah, I said it, which was a high number back
then - I thought it was too low, but everyone thought 10
was too high - So what should the number be?
- 10, 12, 15, 36, 50, 100, i2b2
32What is the measure?
- ADNI had a nice presentation July 2008
- Formal comparisons of MRI, PET summary measures,
association of MRI, PET with cognitive change - Can sample sizes of 400-1000 or 800-8500 per
study arm/group be reduced? - In normals do the measures help us? Can we look
earlier into the disease process?
33Missing Data
- Data we know is missing
- Imputation
- Are the zeros real
- Categorize continuous responses
- Data that might exist
- Not measured on anyone
34How Clean is the Data?
- Measurement error
- Sample quality
- Laboratory quality assurance
- Consistency
- Are time and condition confounded?
- Might be a numerical answer should not be sought
at this time - Pattern recognition
35Class Comparison Things to be/not to be worried
about
- Bias
- Systematic
- Not systematic
- Replication
- One array per specimen
- More?
36Class Comparison
- Time of day/temperature (if known)
- Serum sample age, preservation, storage
- Uniformity of sample collection
- Change in machine or technician
- May hide the truth
- May be the truth
- Try to clean up the noise
- Overprocess the data
37Class Comparison
- Fixable but not regression
- Computer buffer
- Machine problems Power cords and other
electrical interference - Not fixable
- Matrix artifacts
- Fingerprints on scanned chip
38QC/QA that will help?
- Take samples at the same time under the same
conditions - RCT or observational study
- If you cannot collect samples in the same way
- Differences may come from collection, not the
class difference hoping to measure - Will your findings hold up to other data sets?
39QC/QA that will help?
- Randomize specimens into the lab!
- Do not process all controls, then all cases
- If there is a non-biological artifact lurking
randomization of samples might help avoid it
hurting your outcome
40It is all about the N
- If you cannot enroll 10 per group and analyze
that data in some manner - Why should I think you could get 50 or 600 per
group? - As many questions as you ask
- Google and Amazon likely know more and better
data on a person than the research participant
his/her self - Maybe
41Call it a Pattern or a Model
- Figure something out using analytic tools
- Call it anything you please
- I think most are saying we do not need
theoretical models, or preconceived models - Although generally you need some idea of what
data you are looking at general as that may be
42Good Models from Good Data
- Wide range of people contributing data to the
model - Training set, test set, validation set
- Validation set
- Locked away
- Preferably from another lab
- Run through the final model, get the calls, give
to a person who has the truth and see
43Analysis follows Design
- Questions ? Hypotheses ?
- Experimental Design ? Samples ?
- Data ? Analyses ? Conclusions
44But Google Tells Me I Do Not Need Models or
Hypotheses or
- Yeah, you need applied math
- Guess what stats is
- You need better data with better analytical tools
- You can track and measure what people do with
unprecedented fidelityif you have the data.
Good data.
45It Works
- Business does this all the time
- Analytic tools tell you something you try it
- It fails, you realize it, and you change quickly
- If I give a drug to all type II diabetics
- Hard to change course when 40 of them die
- At least for those 40
- Numbers speak for themselves when you have enough
of them - Hypothesis testing does not miss penicillin.
When you need an antibiotic.
46And PS
- If you are only looking for correlation you are
probably wrong - Associations, associations, associations
- May not be linear
- May not be bivariate
- Some things are simple, though
- But that is not why most people want big data
- Even though they tend to analyze it that way
47I Agree
- We can stop looking for models. We can analyze
the data without hypotheses about what it might
show. We can throw the numbers into the biggest
computing clusters the world has ever seen and
let statistical algorithms find patterns where
science cannot. - And then we have to figure out what to do with
them. You can make some good guesses.
48Analysis follows Design
- Questions ? Hypotheses ?
- Experimental Design ? Samples ?
- Data ? Analyses ? Conclusions
49Step Forward
- Replication is finding the pattern again
- Science perhaps is figuring out why we care about
the pattern - Translational research (medical or otherwise) is
figuring out how the use of this pattern can
change the health and well being of people (or
animals, the planet, or whatever else you care to
look at)
50Lies, Damn Lies, and Statistics!
- Go astray from the statistical and study design
assumptions - Same applies for analytic tools and
interpretations made from the use - Impede accurate interpretation of resulting data
- Simulations, permutations, bootstrap
- Outliers are interesting, not a pain
- Many, many runs are needed
51Goals
- How can you ensure the numerical processing of
your data does not hurt the interpretability of
its final outcome? - Big-data projects
- fMRI
- Proteomics
- Microarray
52Just fix it
- Statistician ? Miracle Worker
- If the new latest greatest technology provides
data with serious numerical bias - Statisticians job may involve working with bias
BUT - New better machines with less measurement error
and bias should be built
53Class Comparison MistakesInteractions and
Covariance
- What goes into the models?
- Regression models
- Linear Mixed models
- Variance structure is not simple
- Variance structure is not known
- Big complex data - big complex structure
- Explain in 1 sentence in the methods section
54Prediction MistakesEyes on the Prize
- Prediction itself to remain stable
- Do not care about the features that get you to
the prediction remaining the same across various
prediction models - Are you looking for a clinically useful model?
Is this a step in trying to find a simple test? - In 15 years running the same lab method with
specimens or something else?
55Class Discovery
- Do not have pre-defined classes
- Unsupervised
- Objective Discovering new groupings or
taxonomies of specimens based on expression
profiles. Discovering classes of co-expressed
and co-regulated genes
56Class Discovery MistakesIf you look you will
find something
- Cluster analyses lead to a cluster structure
- What clusters do you believe?
- Where do you stop?
- Same data-different algorithms-different
structure - Texas Sharpshooter Fallacy
- Reproducibility?
- Data perturbation methods
- Estimate of clusters
57Data Summary Remarks
- Enormous chances for spurious findings
- Knowing everything about 1 specimen, or 3
- Do not hold all the answers
- But sometimes you hit gold
- Validation on larger independent collections of
specimens is essential - Yes, business does this too
58Lets say the data is perfect
- New technologies ? many many different measures
from the same sample - Measures, even if of different items, may be
associated inside the same person - Known or unknown pathways or associations
- Variance and covariance
- Ignore it or use it
59System
- Almost everything is part of a system
- Evaluating more complex systems
- Not easy, not common
- If you care about the system
- Ask the question that way
- Design the study that way
- Collect enough data to analyze it that way
- Simple is ok answer YOUR question
60Outline
- What is your question?
- What is your design?
- What does your data look like?
- Lots of measures, few subjects
- Lots of subjects, lots of zeros/NAs
- No data, large parameter and value space
61Topics on My Mind
- Study and experimental design
- 20 people sending me the Wired (23 JUN 2008)
article on the obsolesce of the scientific method - Large (vs. small?) data
- Repeating the same mistakes
- Variance and independent measurements
- Analyses and sample size
- It is just a pilot (the 4th pilot)
62What is the question?
- If you make an inference do real inference
- What is the difference of interest
- What is compared to what?
- A difference or activation or down regulation
means TWO or more items were compared. What is
the basis of the comparison? - What is the baseline or control condition and how
many are there?
63Summary Remarks
- Analysis tools cannot compensate for poorly
designed experiments or studies - Fancy analysis tools do not necessarily
outperform simple ones - Even the best analysis tools, if applied
inappropriately, can produce incorrect or
misleading results
64Summary Remarks
- Have a statistician who has thought about high
dimensional data help design the study and
experiment, do the analysis, and interpret the
results - Not the only person but often forgotten one
- Skilled programmers who understand the question
- Good computing support, space, and speed
65Question ? Design ? Analysis
- Garbage in
- Miracle occurs
- All our fault
- The postdoc worked very hard
- We know already! Garbage out.
- If you really do, then stop
- Use correlation?
- Stop talking about testing correlations. That is
bogus.
66(No Transcript)