Title: Statistics: Science of Data
1Statistics Science of Data
- Xuming He Department of Statistics,
University of Illinois -
2Definitions of Statistics
- a branch of applied mathematics concerned with
the collection and interpretation of quantitative
data and the use of probability theory to
estimate population parameters - Statistics is the science and practice of
developing knowledge through the use of empirical
data expressed in quantitative form.
3Traditional Core
- Design of experiments
- Probability-based models
- Parameter estimation (MLE)
- Hypothesis testing (Neyman-Pearson)
- Large sample approximations
- Decision theory
4Statistics today
- What is new?
- Data deluge
- Increasing awareness of statistical analysis by
other researchers - Core statistics challenged
- NIH funding changed the way we do research?
- Jobs in medicine, finance, marketing,
bioinformatics,
5 Changing landscape
- 20 years ago
- Finding optimal procedures under strict
models/assumptions - Proving mathematical theorems
- Today
- Finding working procedures under flexible models
- Running computer simulations
6Statistics a Vision for the Future (David
Donohue)
- Statistics research driven by innovation in data
- Gene expression arrays
- Diffusion tensor imaging
- Sloan digital sky survey
- Laser scanned 3-d imagery
- Biometry data (faces)
- Network traffic data
7How Statistics Operates
- People in field of origin of data take a crack at
it using crude tools - Statisticians provide theory, methodology,
computing - Applications in other fields that produce similar
data structures
8Statistics Theory Dispersedto Other Disciplines
- 500 Recent citations of Efron's Bootstrap Paper
Geosciences (31)
Biology (55)
Medicine (71) Agriculture (11) Education
(5) Mathematics (4) Statistics (151)
Physical Sic. (24)
SBE (94)
Engineering (27)
Environmental (19)
Comp. Sic. (7)
Polar Research (1)
9Statistics Theory Inspired by Some Disciplines
and Dispersed to Other Disciplines
- From Medicine to The Big Bang via FDR
Acoustic Oscillation
Physics-signal detection
Microarrays, Data Mining
Sparse Estimation Honcho-J 1998
FDR Benjamin Hochberg 1995
Signal Processing
Medical Statistics 70s-80s
10Current Status
- Author affiliation in leading journals
- Stat Dept 49
- Bio Stat Dept 23
- Industry 6
- Math 9
- Others 13
11Current Status (2)
- Federal Funding for Research in the US
- NIH 40
- NSF 38
- NSA 9
- ARO/ONR/EPA 4
-
12Current Status
- Membership
- ASA 16,000
- IMS 3,500
- ENAR/WNAR 3,500
- AMS 30,000
- SIAM 9,000
- Annual PHD 400-500
- (fastest growth in biostatistics)
13Major Themes (1)
- Nonparametric methods for flexible models
- Large p, small n
- Dimension reduction
- Multiple testing, and false discovery control
- .
14Nonparametric methods flexible models
- 20 Years ago
- Parametric families (e.g. normal, Weibull)
- Linear models (e.g. LS regression, ANOVA)
- Contingency tables
- Today
- Semiparametric models
- Nonparametric models
- Generalized linear mixed models
15Flexible models (continued)
- Classification
- Traditional Fisher discriminate analysis
- Modern tree-based classification
- http//www.salford-systems.com/
- support vector machines
- http//www.support-vector-machines.o
rg/
16Flexible methods (continued)
- Traditional one model
- Today ensemble methods
- http//repositories.cdlib.org/uclastat/papers/2004
072501/
17Large p, small n
- Traditional Problems
- To compare 2 treatments with 20 observations
each - Conduct a survey of 1000 people to find their
preference in cars (small or large?) - Todays Problems
- To compare 10,000 genes with 3 measurements
each - Find out the best customers to go after based
on existing records
18Large p, small n
- Borrow information from neighbors
- Bayesian methods
- Dimension reduction first
- Special asymptotics as p/n -gt 0 slowly
19Asymptotics
- Regression yxb e
- parameter b is p-dim
- Consider OLS estimate b(n) of b
- Consistency if p/n -gt 0
- Asymptotic normality if p/n -gt 0 at some rate
- What if p/n is constant?
- What if p gt n ?
20Asymptotics (continued)
- Shrinkage
- Residual size c size of coefficients in b
- Recall James-Stein estimator in statistics
- Connections regularization in engineering
Bayesian paradigm inverse problem in math
21Residual size c size of coefficients in b
- Consistency in (1) variable selection
- (2) parameter
estimation - Prediction of y is more important?
- Variable selection is more important?
- LASSO http//www-stat.stanford.edu/tibs/lasso.ht
ml
22Major Themes (1)
- Nonparametric methods for flexible models
- Large p, small n
- Dimension reduction
- Multiple testing, and false discovery control
- .
23Dimension Reduction
- Several topics will be covered in this short
course
24False Discovery Rates
- Traditional control type I error in hypothesis
testing - Traditional family-wise type I error for
multiple hypotheses - Does this make sense?
- Reality unlikely that all the null hypotheses
hold true
25Statistics in Genomics
- Microarray (coda and Affymetrix)
26Microarray Data
- A look at N genes at a time
- Wish to find which genes have differential
expressions under two conditions (e.g. , cancer
v.s. normal tissues) - One test per gene -? N tests
- Type I error making sense?
- False discovery rate (FDR) among the positive
discoveries, what proportion is expected to be
false discoveries?
27Example
- Accept H0 Reject H0
Total - True pos 2 20
22 - True neg 100 8
108 - Total 102 28
130 - FDR 8/28 28.6
- Type I error 8/108 7.4 (on the cases with H0
being true)
28Control FDR
- Independent tests
- Benjamini, Y., and Hochberg Y. (1995).
"Controlling the false discovery rate a
practical and powerful approach to multiple
testing". Journal of the Royal Statistical
Society 57 (1), 289300. - Dependent tests
- Estimation of q-value (versus p-value)
29How about a break?
- Statistics is the art of never having to say
you're wrong. - Mathematical statisticians are normal, and the
rest are not. - A poor statistician can have his head in an oven
and his feet in ice, and he will say that on the
average he feels fine.
30Major Themes (2)
- Modeling of complex phenomena using hierarchical
models - Feasibility of Bayesian analysis for complex
models because of the development of the theory
of Markov Chain Monte Carlo (MCMC) methods and
because of the feasibility of implementing MCMC
analysis with modern computing
31Complex systems
- Many interacting parts in genomics, networks,
climate changes, financial engineering, etc. - Systems of multi-scales
- Satellite images
- Intelligence information
- FMRI (Functional magnetic resonance imaging is
the use of MRI to measure the neural activity in
the brain or spinal cord of humans or other
animals. )
32Monte Carlo to new heights
- New methods developed to sample from any
(complex) probability distribution -
- Integration of prior information with new data
-gt posterior information, often mathematically
intractable, but the MCMC methods make it
directly interpretable - http//www.statslab.cam.ac.uk/mcmc/
33Bayesian inference
- Frequentist inference
- Fixed parameter, random sample
- Bayesian inference
- Fixed sample, random parameter
- Posterior distribution involves hard
integrals, but MCMC methods avoid them
34Bayesians say
- 20 years ago Bayesian methods face computational
challenges even in simple problems - Today Bayesian methods are well suited for
complex problems. - New Research check out International Meetings on
Bayesian Statistics - http//www.uv.es/valenciameeting
35Monte Carlo to new heights (contd)
- Data augmentation methods to handle latent
variables or missing data - Missing data or partially observed data are
common in survey data, biomedical studies,
economics, etc. - Latent variables are common in psychology,
education testing, genetics,
36Missing Values?
- The short course by Professor Jun Shao
37Trends in Publications
- N Years ago
- More single-authored papers
- Focus on mathematical results
- Toy examples
- Today
- Many co-authored papers
- Focus on methodology development
- Examples of substance
38A Typical JASA Paper
- 1. Introduction (background and significance)
- 2. Problem description (often motivated by real
applications) - 3. Proposed methodology and properties
- 4. Empirical evaluation and comparison
- 5. Examples
- 6. Conclusions
39Major Journals
- Journal of the American Statistical Association
(theorymethods applications and case studies) - Annals of Statistics (mathematical statistics)
- Journal of the Royal Statistical Society Series B
(methodology) - Biometrka (methodology)
40Specialized Journals
- Journal of Graphical and Computational Statistics
- Technometrics
- Biometrics
- Psychometrics
- Journal of Multivariate Analysis
-
41New Trends
- More post-docs in applied statistics
- Return of demand on core statistics
- More women in the workforce
- More collaborations
- More like science than mathematics
42Jobs
- Statistics/Math Department 9-mon (10)
- Biostatistics Research 12-mon (20)
- Pharmaceutical industry 12-mon (50)
- IBM (and the like) 12-mon
- Bank One (and the like) 12-mon
- Postdoc 11-month
- of stat/biostat Ph.Ds with jobs gt 80
43Best Candidates
- Broadly trained
- Math, computing, and data analysis skills
- Research or consulting experience
- Good communication skills
- Highly motivated
- Nice personality
44Research Grants
- Increasingly important for tenure and promotion
- Increasingly competitive at NIH
- A high in collaborative grants
- Success rates 10-30 depending on programs
45 Why statistics?