Appropriate Use of Statistics for BigData Projects

About This Presentation

Title:

Appropriate Use of Statistics for BigData Projects

Description:

I think most are saying we do not need theoretical models, or preconceived models. Although generally you need some idea of what data you are looking at general as ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 67

Provided by: alt7

Category:

more less

Transcript and Presenter's Notes

Title: Appropriate Use of Statistics for BigData Projects

1
Musings by a Statistician
Laura Lee Johnson, PhD September 11, 2008 NIH
Biomedical Computing Interest Group
2
Topics on My Mind

Study and experimental design
20 people sending me the Wired (23 JUN 2008)
article on the obsolesce of the scientific method
Large (vs. small?) data
Repeating the same mistakes
Variance and independent measurements
Analyses and sample size
It is just a pilot (the 4th pilot)

3
May 2006 Talk at BCIG Outline

What is your question?
What is your design?
What does your data look like?
Lots of measures, few people
Lots of people, lots of zeros/NAs
No data, large parameter and value space

4
Outline

Lots of numbers, few people
fMRI, Microarray, Proteomics
Lots of people, lots of zeros/NAs
sparse data
Multidimensional, hierarchical
No data, large parameter and value space
Data farming
The Sims, Project Albert

5
The Key

Leverage what you have
Do not oversell it
Add to it this All data has a shape.
Can you use that shape to answer a question or to
generate hypotheses to develop a study to test a
question of interest?
Sometimes it cannot be done well

6
What is the Question?
7
Not that different today

Petabytes are great
But do not fool yourself
It might fool whoever keeps demanding answers
from you

8
Progression Treatment of Periodontal Disease

Two studies
Subject level
Each tooth of each subject
Multiple locations on each tooth of each subject
Longitudinal structure of the study
Make that studIES
The data contain multiple levels of correlation

9
Variation and Correlation
10
Variance and Correlation Laboratory(ies)
11
Variation and Correlation Data!
12
Variation and Correlation
13
Generating More Questions
14
Not What You Want
15
Can you hit the broad side of

A barn?
Did you paint the bulls eye before or after you
shot?
Hypothesis generation is great
Do not forget that is what you did
Do not forget 99.9 can be insufficient
Do not forget 90 could be a winner

16
Statistical Inference

Inferences about a population are made on the
basis of results obtained from a sample drawn
from that population
Want to talk about the larger population from
which the subjects are drawn, not the particular
subjects!

17
Linear Regression

Model for simple linear regression
Yi ß0 ß1x1i ei
ß0 intercept
ß1 slope
Assumptions
Observations are independent
Normally distributed with constant variance
Hypothesis testing
H0 ß1 0 vs. HA ß1 ? 0

18
Stretching AssumptionsIndependent in any
direction?

fMRI, microarrays
Voxels and genes associated with others
Repeated measures on the same person EVEN IF
DIFFERENT SAMPLES should be considered
non-independent
Are all outcomes measured with equivalent
sensitivity?
Probably not
Variances will not be the same

19
Analyses Make More Datasets

Permutation/randomization test
Rearrange the current data ? new dataset
Calculate the test statistic
Repeat many times
Compare results to the original datas test
statistic
Bootstrap
Sample record with replacement ? new dataset
Rest is same as above

20
What is your design?
21
When do I Cringe?

Small sample size
Biased sample
Convenience samples
Median? Mean? Range? Standard deviation?
Interviews Who did it, of whom, with what,
where, and how?
Completely impractical in real world setting

22
Design

Produces/uses data to answer the question(s)
If the question needs hierarchical data to answer
it
Needs a hierarchical design
Needs to have a hierarchical data analysis
Not sparse

23
Causation

Biological plausibility
Temporal relationship
Dose response
Reproducibility
Strength of association
Coherence with established facts
Specificity of association

24
National Health Interview Survey

State level stratification
Black and Hispanic populations oversampled 2006
added Asians to that
Area frame based on previous decennial census
change this every 10 years
Family
One sample adult and one sample child (if
children under 18 present)
Household, family, and person level files

25
fMRI

Lots of data/scan
Voxels
Not independent
Correlation between voxels not uniform

26
What is the question? What is the design?

Inter and intra subject correlation
Complex dimensions of brain features
Supervised learning
Activity patterns predictive of XXX
Spatial resolution

27
Power

How likely are you to see a difference if there
is one there?
Look for a big difference!
More subjects, more runs, longer runs
Improved signal to noise ratio

28
Who? Structure of Design

Cross-sectional study design
Compare groups at a single time point
Pre-post or longitudinal designs
Look at how one group changes during an
intervention compared to another group
If the design lets you stay simple, do
Add hierarchy onto all this? If you need to

29
Data Sweet Data

Thank you, Chris Anderson, and everyone who sent
me that article. I already said it.

30
Terabyte, meet Petabyte

Need to turn one huge matrix, all at once
MATLAB is your friend (say some)
Random effects (linear mixed) models are your
friends
64 x 64 voxels x 10 slices per image
Or 128x128 or 15 or 20 slices
At least 10 people/group
Preferably more
Do you have any confounders to adjust for?

31
HAHAH 10 per group!

Yeah, I said it, which was a high number back
then
I thought it was too low, but everyone thought 10
was too high
So what should the number be?
10, 12, 15, 36, 50, 100, i2b2

32
What is the measure?

ADNI had a nice presentation July 2008
Formal comparisons of MRI, PET summary measures,
association of MRI, PET with cognitive change
Can sample sizes of 400-1000 or 800-8500 per
study arm/group be reduced?
In normals do the measures help us? Can we look
earlier into the disease process?

33
Missing Data

Data we know is missing
Imputation
Are the zeros real
Categorize continuous responses
Data that might exist
Not measured on anyone

34
How Clean is the Data?

Measurement error
Sample quality
Laboratory quality assurance
Consistency
Are time and condition confounded?
Might be a numerical answer should not be sought
at this time
Pattern recognition

35
Class Comparison Things to be/not to be worried
about

Bias
Systematic
Not systematic
Replication
One array per specimen
More?

36
Class Comparison

Time of day/temperature (if known)
Serum sample age, preservation, storage
Uniformity of sample collection
Change in machine or technician
May hide the truth
May be the truth
Try to clean up the noise
Overprocess the data

37
Class Comparison

Fixable but not regression
Computer buffer
Machine problems Power cords and other
electrical interference
Not fixable
Matrix artifacts
Fingerprints on scanned chip

38
QC/QA that will help?

Take samples at the same time under the same
conditions
RCT or observational study
If you cannot collect samples in the same way
Differences may come from collection, not the
class difference hoping to measure
Will your findings hold up to other data sets?

39
QC/QA that will help?

Randomize specimens into the lab!
Do not process all controls, then all cases
If there is a non-biological artifact lurking
randomization of samples might help avoid it
hurting your outcome

40
It is all about the N

If you cannot enroll 10 per group and analyze
that data in some manner
Why should I think you could get 50 or 600 per
group?
As many questions as you ask
Google and Amazon likely know more and better
data on a person than the research participant
his/her self
Maybe

41
Call it a Pattern or a Model

Figure something out using analytic tools
Call it anything you please
I think most are saying we do not need
theoretical models, or preconceived models
Although generally you need some idea of what
data you are looking at general as that may be

42
Good Models from Good Data

Wide range of people contributing data to the
model
Training set, test set, validation set
Validation set
Locked away
Preferably from another lab
Run through the final model, get the calls, give
to a person who has the truth and see

43
Analysis follows Design

Questions ? Hypotheses ?
Experimental Design ? Samples ?
Data ? Analyses ? Conclusions

44
But Google Tells Me I Do Not Need Models or
Hypotheses or

Yeah, you need applied math
Guess what stats is
You need better data with better analytical tools
You can track and measure what people do with
unprecedented fidelityif you have the data.
Good data.

45
It Works

Business does this all the time
Analytic tools tell you something you try it
It fails, you realize it, and you change quickly
If I give a drug to all type II diabetics
Hard to change course when 40 of them die
At least for those 40
Numbers speak for themselves when you have enough
of them
Hypothesis testing does not miss penicillin.
When you need an antibiotic.

46
And PS

If you are only looking for correlation you are
probably wrong
Associations, associations, associations
May not be linear
May not be bivariate
Some things are simple, though
But that is not why most people want big data
Even though they tend to analyze it that way

47
I Agree

We can stop looking for models. We can analyze
the data without hypotheses about what it might
show. We can throw the numbers into the biggest
computing clusters the world has ever seen and
let statistical algorithms find patterns where
science cannot.
And then we have to figure out what to do with
them. You can make some good guesses.

48
Analysis follows Design

Questions ? Hypotheses ?
Experimental Design ? Samples ?
Data ? Analyses ? Conclusions

49
Step Forward

Replication is finding the pattern again
Science perhaps is figuring out why we care about
the pattern
Translational research (medical or otherwise) is
figuring out how the use of this pattern can
change the health and well being of people (or
animals, the planet, or whatever else you care to
look at)

50
Lies, Damn Lies, and Statistics!

Go astray from the statistical and study design
assumptions
Same applies for analytic tools and
interpretations made from the use
Impede accurate interpretation of resulting data
Simulations, permutations, bootstrap
Outliers are interesting, not a pain
Many, many runs are needed

51
Goals

How can you ensure the numerical processing of
your data does not hurt the interpretability of
its final outcome?
Big-data projects
fMRI
Proteomics
Microarray

52
Just fix it

Statistician ? Miracle Worker
If the new latest greatest technology provides
data with serious numerical bias
Statisticians job may involve working with bias
BUT
New better machines with less measurement error
and bias should be built

53
Class Comparison MistakesInteractions and
Covariance

What goes into the models?
Regression models
Linear Mixed models
Variance structure is not simple
Variance structure is not known
Big complex data - big complex structure
Explain in 1 sentence in the methods section

54
Prediction MistakesEyes on the Prize

Prediction itself to remain stable
Do not care about the features that get you to
the prediction remaining the same across various
prediction models
Are you looking for a clinically useful model?
Is this a step in trying to find a simple test?
In 15 years running the same lab method with
specimens or something else?

55
Class Discovery

Do not have pre-defined classes
Unsupervised
Objective Discovering new groupings or
taxonomies of specimens based on expression
profiles. Discovering classes of co-expressed
and co-regulated genes

56
Class Discovery MistakesIf you look you will
find something

Cluster analyses lead to a cluster structure
What clusters do you believe?
Where do you stop?
Same data-different algorithms-different
structure
Texas Sharpshooter Fallacy
Reproducibility?
Data perturbation methods
Estimate of clusters

57
Data Summary Remarks

Enormous chances for spurious findings
Knowing everything about 1 specimen, or 3
Do not hold all the answers
But sometimes you hit gold
Validation on larger independent collections of
specimens is essential
Yes, business does this too

58
Lets say the data is perfect

New technologies ? many many different measures
from the same sample
Measures, even if of different items, may be
associated inside the same person
Known or unknown pathways or associations
Variance and covariance
Ignore it or use it

59
System

Almost everything is part of a system
Evaluating more complex systems
Not easy, not common
If you care about the system
Ask the question that way
Design the study that way
Collect enough data to analyze it that way
Simple is ok answer YOUR question

60
Outline

What is your question?
What is your design?
What does your data look like?
Lots of measures, few subjects
Lots of subjects, lots of zeros/NAs
No data, large parameter and value space

61
Topics on My Mind

Study and experimental design
20 people sending me the Wired (23 JUN 2008)
article on the obsolesce of the scientific method
Large (vs. small?) data
Repeating the same mistakes
Variance and independent measurements
Analyses and sample size
It is just a pilot (the 4th pilot)

62
What is the question?

If you make an inference do real inference
What is the difference of interest
What is compared to what?
A difference or activation or down regulation
means TWO or more items were compared. What is
the basis of the comparison?
What is the baseline or control condition and how
many are there?

63
Summary Remarks

Analysis tools cannot compensate for poorly
designed experiments or studies
Fancy analysis tools do not necessarily
outperform simple ones
Even the best analysis tools, if applied
inappropriately, can produce incorrect or
misleading results

64
Summary Remarks

Have a statistician who has thought about high
dimensional data help design the study and
experiment, do the analysis, and interpret the
results
Not the only person but often forgotten one
Skilled programmers who understand the question
Good computing support, space, and speed

65
Question ? Design ? Analysis

Garbage in
Miracle occurs
All our fault
The postdoc worked very hard
We know already! Garbage out.
If you really do, then stop
Use correlation?
Stop talking about testing correlations. That is
bogus.

66
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Appropriate Use of Statistics for BigData Projects - PowerPoint PPT Presentation

Appropriate Use of Statistics for BigData Projects

I think most are saying we do not need theoretical models, or preconceived models. Although generally you need some idea of what data you are looking at general as ... – PowerPoint PPT presentation