Statistics: Science of Data - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Statistics: Science of Data

Description:

support vector machines. http://www.support-vector-machines.org/ Flexible methods (continued) ... Statistics is the art of never having to say you're wrong. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 46
Provided by: rogerl
Category:

less

Transcript and Presenter's Notes

Title: Statistics: Science of Data


1
Statistics Science of Data
  • Xuming He Department of Statistics,
    University of Illinois

2
Definitions of Statistics
  • a branch of applied mathematics concerned with
    the collection and interpretation of quantitative
    data and the use of probability theory to
    estimate population parameters
  • Statistics is the science and practice of
    developing knowledge through the use of empirical
    data expressed in quantitative form.

3
Traditional Core
  • Design of experiments
  • Probability-based models
  • Parameter estimation (MLE)
  • Hypothesis testing (Neyman-Pearson)
  • Large sample approximations
  • Decision theory

4
Statistics today
  • What is new?
  • Data deluge
  • Increasing awareness of statistical analysis by
    other researchers
  • Core statistics challenged
  • NIH funding changed the way we do research?
  • Jobs in medicine, finance, marketing,
    bioinformatics,

5
Changing landscape
  • 20 years ago
  • Finding optimal procedures under strict
    models/assumptions
  • Proving mathematical theorems
  • Today
  • Finding working procedures under flexible models
  • Running computer simulations

6
Statistics a Vision for the Future (David
Donohue)
  • Statistics research driven by innovation in data
  • Gene expression arrays
  • Diffusion tensor imaging
  • Sloan digital sky survey
  • Laser scanned 3-d imagery
  • Biometry data (faces)
  • Network traffic data

7
How Statistics Operates
  • People in field of origin of data take a crack at
    it using crude tools
  • Statisticians provide theory, methodology,
    computing
  • Applications in other fields that produce similar
    data structures

8
Statistics Theory Dispersedto Other Disciplines
  • 500 Recent citations of Efron's Bootstrap Paper

Geosciences (31)
Biology (55)
Medicine (71) Agriculture (11) Education
(5) Mathematics (4) Statistics (151)
Physical Sic. (24)
SBE (94)
Engineering (27)
Environmental (19)
Comp. Sic. (7)
Polar Research (1)
9
Statistics Theory Inspired by Some Disciplines
and Dispersed to Other Disciplines
  • From Medicine to The Big Bang via FDR

Acoustic Oscillation
Physics-signal detection
Microarrays, Data Mining
Sparse Estimation Honcho-J 1998
FDR Benjamin Hochberg 1995
Signal Processing
Medical Statistics 70s-80s
10
Current Status
  • Author affiliation in leading journals
  • Stat Dept 49
  • Bio Stat Dept 23
  • Industry 6
  • Math 9
  • Others 13

11
Current Status (2)
  • Federal Funding for Research in the US
  • NIH 40
  • NSF 38
  • NSA 9
  • ARO/ONR/EPA 4

12
Current Status
  • Membership
  • ASA 16,000
  • IMS 3,500
  • ENAR/WNAR 3,500
  • AMS 30,000
  • SIAM 9,000
  • Annual PHD 400-500
  • (fastest growth in biostatistics)

13
Major Themes (1)
  • Nonparametric methods for flexible models
  • Large p, small n
  • Dimension reduction
  • Multiple testing, and false discovery control
  • .

14
Nonparametric methods flexible models
  • 20 Years ago
  • Parametric families (e.g. normal, Weibull)
  • Linear models (e.g. LS regression, ANOVA)
  • Contingency tables
  • Today
  • Semiparametric models
  • Nonparametric models
  • Generalized linear mixed models

15
Flexible models (continued)
  • Classification
  • Traditional Fisher discriminate analysis
  • Modern tree-based classification
  • http//www.salford-systems.com/
  • support vector machines
  • http//www.support-vector-machines.o
    rg/

16
Flexible methods (continued)
  • Traditional one model
  • Today ensemble methods
  • http//repositories.cdlib.org/uclastat/papers/2004
    072501/

17
Large p, small n
  • Traditional Problems
  • To compare 2 treatments with 20 observations
    each
  • Conduct a survey of 1000 people to find their
    preference in cars (small or large?)
  • Todays Problems
  • To compare 10,000 genes with 3 measurements
    each
  • Find out the best customers to go after based
    on existing records

18
Large p, small n
  • Borrow information from neighbors
  • Bayesian methods
  • Dimension reduction first
  • Special asymptotics as p/n -gt 0 slowly

19
Asymptotics
  • Regression yxb e
  • parameter b is p-dim
  • Consider OLS estimate b(n) of b
  • Consistency if p/n -gt 0
  • Asymptotic normality if p/n -gt 0 at some rate
  • What if p/n is constant?
  • What if p gt n ?

20
Asymptotics (continued)
  • Shrinkage
  • Residual size c size of coefficients in b
  • Recall James-Stein estimator in statistics
  • Connections regularization in engineering
    Bayesian paradigm inverse problem in math

21
Residual size c size of coefficients in b
  • Consistency in (1) variable selection
  • (2) parameter
    estimation
  • Prediction of y is more important?
  • Variable selection is more important?
  • LASSO http//www-stat.stanford.edu/tibs/lasso.ht
    ml

22
Major Themes (1)
  • Nonparametric methods for flexible models
  • Large p, small n
  • Dimension reduction
  • Multiple testing, and false discovery control
  • .

23
Dimension Reduction
  • Several topics will be covered in this short
    course

24
False Discovery Rates
  • Traditional control type I error in hypothesis
    testing
  • Traditional family-wise type I error for
    multiple hypotheses
  • Does this make sense?
  • Reality unlikely that all the null hypotheses
    hold true

25
Statistics in Genomics
  • Microarray (coda and Affymetrix)

26
Microarray Data
  • A look at N genes at a time
  • Wish to find which genes have differential
    expressions under two conditions (e.g. , cancer
    v.s. normal tissues)
  • One test per gene -? N tests
  • Type I error making sense?
  • False discovery rate (FDR) among the positive
    discoveries, what proportion is expected to be
    false discoveries?

27
Example
  • Accept H0 Reject H0
    Total
  • True pos 2 20
    22
  • True neg 100 8
    108
  • Total 102 28
    130
  • FDR 8/28 28.6
  • Type I error 8/108 7.4 (on the cases with H0
    being true)

28
Control FDR
  • Independent tests
  • Benjamini, Y., and Hochberg Y. (1995).
    "Controlling the false discovery rate a
    practical and powerful approach to multiple
    testing". Journal of the Royal Statistical
    Society 57 (1), 289300.
  • Dependent tests
  • Estimation of q-value (versus p-value)

29
How about a break?
  • Statistics is the art of never having to say
    you're wrong.
  • Mathematical statisticians are normal, and the
    rest are not.
  • A poor statistician can have his head in an oven
    and his feet in ice, and he will say that on the
    average he feels fine.

30
Major Themes (2)
  • Modeling of complex phenomena using hierarchical
    models
  • Feasibility of Bayesian analysis for complex
    models because of the development of the theory
    of Markov Chain Monte Carlo (MCMC) methods and
    because of the feasibility of implementing MCMC
    analysis with modern computing

31
Complex systems
  • Many interacting parts in genomics, networks,
    climate changes, financial engineering, etc.
  • Systems of multi-scales
  • Satellite images
  • Intelligence information
  • FMRI (Functional magnetic resonance imaging is
    the use of MRI to measure the neural activity in
    the brain or spinal cord of humans or other
    animals. )

32
Monte Carlo to new heights
  • New methods developed to sample from any
    (complex) probability distribution
  • Integration of prior information with new data
    -gt posterior information, often mathematically
    intractable, but the MCMC methods make it
    directly interpretable
  • http//www.statslab.cam.ac.uk/mcmc/

33
Bayesian inference
  • Frequentist inference
  • Fixed parameter, random sample
  • Bayesian inference
  • Fixed sample, random parameter
  • Posterior distribution involves hard
    integrals, but MCMC methods avoid them

34
Bayesians say
  • 20 years ago Bayesian methods face computational
    challenges even in simple problems
  • Today Bayesian methods are well suited for
    complex problems.
  • New Research check out International Meetings on
    Bayesian Statistics
  • http//www.uv.es/valenciameeting

35
Monte Carlo to new heights (contd)
  • Data augmentation methods to handle latent
    variables or missing data
  • Missing data or partially observed data are
    common in survey data, biomedical studies,
    economics, etc.
  • Latent variables are common in psychology,
    education testing, genetics,

36
Missing Values?
  • The short course by Professor Jun Shao

37
Trends in Publications
  • N Years ago
  • More single-authored papers
  • Focus on mathematical results
  • Toy examples
  • Today
  • Many co-authored papers
  • Focus on methodology development
  • Examples of substance

38
A Typical JASA Paper
  • 1. Introduction (background and significance)
  • 2. Problem description (often motivated by real
    applications)
  • 3. Proposed methodology and properties
  • 4. Empirical evaluation and comparison
  • 5. Examples
  • 6. Conclusions

39
Major Journals
  • Journal of the American Statistical Association
    (theorymethods applications and case studies)
  • Annals of Statistics (mathematical statistics)
  • Journal of the Royal Statistical Society Series B
    (methodology)
  • Biometrka (methodology)

40
Specialized Journals
  • Journal of Graphical and Computational Statistics
  • Technometrics
  • Biometrics
  • Psychometrics
  • Journal of Multivariate Analysis

41
New Trends
  • More post-docs in applied statistics
  • Return of demand on core statistics
  • More women in the workforce
  • More collaborations
  • More like science than mathematics

42
Jobs
  • Statistics/Math Department 9-mon (10)
  • Biostatistics Research 12-mon (20)
  • Pharmaceutical industry 12-mon (50)
  • IBM (and the like) 12-mon
  • Bank One (and the like) 12-mon
  • Postdoc 11-month
  • of stat/biostat Ph.Ds with jobs gt 80

43
Best Candidates
  • Broadly trained
  • Math, computing, and data analysis skills
  • Research or consulting experience
  • Good communication skills
  • Highly motivated
  • Nice personality

44
Research Grants
  • Increasingly important for tenure and promotion
  • Increasingly competitive at NIH
  • A high in collaborative grants
  • Success rates 10-30 depending on programs

45
Why statistics?
  • Your choice!
Write a Comment
User Comments (0)
About PowerShow.com