STAT 254 -lecture1 An overview - PowerPoint PPT Presentation

About This Presentation
Title:

STAT 254 -lecture1 An overview

Description:

... cyclic pattern Yeast Cell Cycle (adapted from Molecular Cell ... computer optimization Gene Ontology Data base/ visualization Oil-refining (Crude oil ... – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 56
Provided by: statUcla
Learn more at: http://www.stat.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: STAT 254 -lecture1 An overview


1
STAT 254 -lecture1An overview
  • Cell biology, microarray, statistics
  • Bioinformatics and Statistics
  • Topics to cover
  • Keep a skeptical eye on everything you read or
    hear
  • Keep an eye on bigger picture while working on
    specifics
  • The shaping of bioinformatics falls on your
    shoulders
  • What to take home not just microarray, or high
    throughput data analysis methods, but a set of
    skills, ways of thinking about quantitative
    biology

2
Exploratory data analysis multivariate high
dimensional
20 min
3
Study of Gene ExpressionStatistics, Biology,
and Microarrays
IMS ENAR Conference Time March 31,
2003 PlaceTampa, FL
  • Ker-Chau Li
  • Statistics Department
  • UCLA
  • kcli_at_stat.ucla.edu

4
Outline
  • Review of cell biology
  • Microarray gene expression data collection
  • Cell-cycle gene expression (Main Data set)
  • PCA/Nested regression SIR (Dim. red.)
  • Similarity analysis - clustering (Why Popular?)
  • Liquid association
  • Closing remarks

New statistical concept, fueled by Steins lemma
Justification for IMS
5
PART I. Cellular Biology
  • Macromolecules DNA, mRNA, protein

6
Why Biology hot?
Because of
7
Human Genome Project
Begun in 1990, the U.S. Human Genome Project is a
13-year effort coordinated by the U.S. Department
of Energy and the National Institutes of Health.
The project originally was planned to last 15
years, but effective resource and technological
advances have accelerated the expected completion
date to 2003. Project goals are to
identify all the approximate 30,000 genes in
human DNA, determine the sequences of the 3
billion chemical base pairs that make up human
DNA, store this information in databases,
improve tools for data analysis, transfer
related technologies to the private sector, and
address the ethical, legal, and social issues
(ELSI) that may arise from the project.
Recent Milestones June 2000 completion of a
working draft of the entire human genome
February 2001 analyses of the working draft are
published
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
8
Future Challenges What We Still Dont Know
Predicted vs experimentally determined gene
function 1 Gene regulation 2 (upstream
regulatory region) Coordination of gene
expression, protein synthesis, and
post-translational events 3
Gene number, exact locations, and functions
DNA sequence organization Chromosomal structure
and organization Noncoding DNA types, amount,
distribution, information content, and
functions Interaction of proteins in complex
molecular machines Evolutionary conservation
among organisms Protein conservation (structure
and function) Proteomes (total protein content
and function) in organisms Correlation of SNPs
(single-base DNA variations among individuals)
with health and disease Disease-susceptibility
prediction based on gene sequence variation
Genes involved in complex traits and multigene
diseases Complex systems biology including
microbial consortia useful for environmental
restoration Developmental genetics, genomics
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
9
Medicine and the New Genomics
  • Gene Testing
  • Gene Therapy
  • Pharmacogenomics

Anticipated Benefits
  • improved diagnosis of disease
  • earlier detection of genetic predispositions to
    disease
  • rational drug design
  • gene therapy and control systems for drugs
  • personalized, custom drugs

Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
10
Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
11
How does the cell work?
  • The guiding principle is the so-called

Central dogma of cell biology
12
Medicine and the New Genomics
  • Gene Testing
  • Gene Therapy
  • Pharmacogenomics

Anticipated Benefits
  • improved diagnosis of disease
  • earlier detection of genetic predispositions to
    disease
  • rational drug design
  • gene therapy and control systems for drugs
  • personalized, custom drugs

Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
13
Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
14
How does the cell work?
  • The guiding principle is the so-called

Central dogma of cell biology
15
(No Transcript)
16
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
17
Gene to protein4 Nucleotides and 20 amino acids
Protein is synthesized from amino acids by
ribosome
18
Gene to Protein
The mediator mRNA
Transcription
Translation
19
Transcription and translation
20
PART II. Microarray
  • Genome-wide expression profiling

21
Exploring the Metabolic and Genetic Control
ofGene Expression on a Genomic ScaleJoseph L.
DeRisi, Vishwanath R. Iyer, Patrick O. Brown
22
Microarray
23
MicroArray
  • Allows measuring the mRNA level of thousands of
    genes in one experiment -- system level response
  • The data generation can be fully automated by
    robots
  • Common experimental themes
  • Time Course (when)
  • Tissue Type (where)
  • Response (under what conditions)
  • Perturbation Mutation/Knockout, Knock-in
  • Over-expression

24
Reverse-transcription
Color cy3, cy5 green, red
25
Example 1
5 min
  • Comparative expression
  • Normal versus cancer cells
  • ALL versus AML

E.Landers group at MIT
26
PART III. Statistics
  • Low-level analysis
  • Comparative expression

Feature extraction Clustering/classification Pears
on correlation Liquid association
27
Issues related to image qualities
(not to be covered)
  • Convert an image into a number representing the
    ratio of the levels of expression between red and
    green channels
  • Color bias
  • Spatial, tip, spot effects
  • Background noises
  • cDNA, oligonucleotide arrays,

28
Genome-wide expression profileA basic structure
  • cond1 cond2 .. condp

Gene1Gene2 Genen
x11 x12 .. x1p x21 x22 ..
x2p
...
... xn1 xn2 .. xnp
29
Cond1, cond2, , condp denote various
environmental conditions, time points, cell
types, etc. under which mRNA samples are taken
Note numerous cells are involved Data
quality issues 1. chip (manufacturer)
2. mRNA sample (user)It
is important to have a homogeneous sample so that
cellular signals can be amplified
Yeast Cell Cycle data ideally all cells are
engaged in the same activities- synchronization
30
Two classes problem
An application
ALL (acute lymphoblastic leukemia) AML(acute
myeloid leukemia)
31
Which Genes to select?
  • For each gene (row) compute a score defined by
  • sample mean of X - sample mean of Y
  • divided by
  • standard deviation of X standard deviation
    of Y
  • XALL, YAML
  • Genes (rows) with highest scores are selected.

They have a method
That seems to work well.
  • 34 new leukemia samples
  • 29 are predicated with 100 accuracy 5 weak
    predication cases

Seems to work ! Improvement?
32
Study of cell-cycle regulated genes
  • Rate of cell growth and division varies
  • Yeast(120 min), insect egg(15-30 min) nerve
    cell(no)fibroblast(healing wounds)
  • Regulation irregular growth causes cancer
  • Goal find what genes are expressed at each
    state of cell cycle
  • Yeast cells Spellman et al (2000)
  • Fourier analysis cyclic pattern

33
Yeast Cell Cycle(adapted from Molecular Cell
Biology, Darnell et al)
Most visible event
34
Example of the time curve Histone Genes
(HTT2) ORF YNL031C Time course
Histone
35
EBP2 YKL172W
TSM1 YCR042C
YOR263C
36
(No Transcript)
37
Why clustering make sense biologically?
The rationale is
Genes with high degree of expression similarity
are likely to be functionally related and may
participate in common pathways. They may be
co-regulated by common upstream regulatory
factors.
Rationale behind massive gene expression analysis
Simply put,
Profile similarity implies functional association
38
Some protein complexes
Protein rarely works as a single unit
39
Gene profiles and correlation
  • Pearson's correlation coefficient, a simple
    way of describing the strength of linear
    association between a pair of random variables,
    has become the most popular measure of gene
    expression similarity.
  • 1.Cluster analysis average linkage,
    self-organizing map, K-mean, ...
  • 2.Classification nearest neighbor,linear
    discriminant analysis, support vector machine,
  • 3.Dimension reduction methods PCA ( SVD)

40
CC has been used by Gauss, Bravais, Edgeworth
Sweeping impact in data analysis is due to
Galton(1822-1911) Typical laws of heridity in
man Karl Pearson modifies and popularizes the
use. A building block in multivariate analysis,
of which clustering, classification, dim. reduct.
are recurrent themes
As a statistician, how can you ignore the time
order ? (Isnt it true that the use of sample
correlation relies on the assumption that data
are I.I.D. ???)

41
Other methods forFinding Gene clusters
  • Bayesian clustering normal mixture, (hidden)
    indicator
  • PCA plot, projection pursuit, grand tour
  • Multi-Dimension Scaling( bi-plot for categorical
    responses, showing both cases (genes) and
    variables(different clustering methods),
    displaying results from many different clustering
    procedures)
  • Generalized association plot (Chen 2001,
    Statistica Sinica)
  • PLAID model ( Statistica Sinica 2002, Lazzeroni,
    Owen)

42
(No Transcript)
43
1st PCA direction
2nd PCA direction
3rd PCA direction
Eigenvalues
44
Phase Assignment
Smooth
Non-smooth
S
S
G1
G1
31
S/G2
S/G2
27
108
103
352
255
90
295
165
M/G1
239
90
G2/M
M/G1
G2/M
45
ARG1
Glutamate
ARG2
Book a flight from LA to KEGG, JAPAN in
less than 10 seconds
46
ARG1
8th place negative
Y
Head
X
Compute LA(X,YZ) for all Z
Backdoor
Rank and find leading genes
Adapted from KEGG
47
Coverage of bioinformaticsby areas topics
Sequence analysis
Linkage, pedigree
Microarray
DNA RNA Protein EST Drug
Evolution
Functional prediction
SNP
Alternative splicing
System modeling
Pathway discovery
Promoter
Motif
Domain
Drug -gene -protein
3-D structure
Protein-protein
Protein -gene
TRANSFAC
48
Coverage of Bioinformatics by expertise (hat,
not person)
Computer scientist
Statistician/mathematician
Biologist
(raw data provider)
(huge data volume)
Oil-refining
(Crude oil)
(Noise, garbage, or ignorance?)
Make researchers life easier (pipeline)
Data cleaning
Data mining
Pattern searching /comparison
(Bio-information distilling/ Bio-data refining)
Web page browsing
Literature searching
Physical/Math/prob/stat models, computer
optimization
Data base/ visualization
Generalization/inference
Gene Ontology
49
Math. Modeling a nightmare
Current
Next
mRNA
F I T N E S S
mRNA
mRNA
Observed
protein kinase
hidden
ATP, GTP, cAMP, etc
Cytoplasm Nucleus Mitochondria Vacuolar
localization
F U N C T I O N
Statistical methods become useful
DNA methylation, chromatin structure
Nutrients- carbon, nitrogen sources Temperature Wa
ter
50
Bioinformatics(knowledge integration center)
  • When
  • Where
  • Who
  • What
  • Why
  • Cell level
  • Organ level
  • Organism level
  • Species level
  • Ecology system level

51
Special issue on bioinformatics
Want to get a quick start ?
  • Statistica Sinica
  • 2002 January

My paper on liquid association PNAS 2002, 99,
16875-16880
Genome-wide co-expression dynamics theory and
application Classification Biological Science,
Genetics Physical Science, Statistics
52
END
53
Cautionary Notes for Seriation and row-column
sorting
  • Hierarchical clustering is popular, but
  • Sharp boundaries may be artifacts due to clever
    permutation
  • how to fine-tune user-specified parameters-need
    some theoretical guidance
  • What is a cluster ? Criteria needed

54
Popular methods for clustering/data mining
  • Linkage Eisen et al , Alon et al
  • K-mean Tavazoein et al
  • Self-organizing map Tamayo et al
  • SVD Holter et al Alter, Brown, Botstein

55
Can statisticians take the lead?
  • Difficult
  • But not impossible
  • The key
  • Willingness to learn more biology

February 2002, Talk at UCLA Biochemistry,
feedback from David Eisenberg March 2002, David
gave an inspiring review talk about several of
his works (Nature, similarity)
schematic illustration
Write a Comment
User Comments (0)
About PowerShow.com