Title: STAT 254 -lecture1 An overview
1STAT 254 -lecture1An overview
- Cell biology, microarray, statistics
- Bioinformatics and Statistics
- Topics to cover
- Keep a skeptical eye on everything you read or
hear - Keep an eye on bigger picture while working on
specifics - The shaping of bioinformatics falls on your
shoulders - What to take home not just microarray, or high
throughput data analysis methods, but a set of
skills, ways of thinking about quantitative
biology
2Exploratory data analysis multivariate high
dimensional
20 min
3Study of Gene ExpressionStatistics, Biology,
and Microarrays
IMS ENAR Conference Time March 31,
2003 PlaceTampa, FL
- Ker-Chau Li
- Statistics Department
- UCLA
- kcli_at_stat.ucla.edu
4Outline
- Review of cell biology
- Microarray gene expression data collection
- Cell-cycle gene expression (Main Data set)
- PCA/Nested regression SIR (Dim. red.)
- Similarity analysis - clustering (Why Popular?)
- Liquid association
- Closing remarks
New statistical concept, fueled by Steins lemma
Justification for IMS
5PART I. Cellular Biology
- Macromolecules DNA, mRNA, protein
6Why Biology hot?
Because of
7Human Genome Project
Begun in 1990, the U.S. Human Genome Project is a
13-year effort coordinated by the U.S. Department
of Energy and the National Institutes of Health.
The project originally was planned to last 15
years, but effective resource and technological
advances have accelerated the expected completion
date to 2003. Project goals are to
identify all the approximate 30,000 genes in
human DNA, determine the sequences of the 3
billion chemical base pairs that make up human
DNA, store this information in databases,
improve tools for data analysis, transfer
related technologies to the private sector, and
address the ethical, legal, and social issues
(ELSI) that may arise from the project.
Recent Milestones June 2000 completion of a
working draft of the entire human genome
February 2001 analyses of the working draft are
published
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
8Future Challenges What We Still Dont Know
Predicted vs experimentally determined gene
function 1 Gene regulation 2 (upstream
regulatory region) Coordination of gene
expression, protein synthesis, and
post-translational events 3
Gene number, exact locations, and functions
DNA sequence organization Chromosomal structure
and organization Noncoding DNA types, amount,
distribution, information content, and
functions Interaction of proteins in complex
molecular machines Evolutionary conservation
among organisms Protein conservation (structure
and function) Proteomes (total protein content
and function) in organisms Correlation of SNPs
(single-base DNA variations among individuals)
with health and disease Disease-susceptibility
prediction based on gene sequence variation
Genes involved in complex traits and multigene
diseases Complex systems biology including
microbial consortia useful for environmental
restoration Developmental genetics, genomics
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
9Medicine and the New Genomics
- Gene Testing
- Gene Therapy
Anticipated Benefits
- improved diagnosis of disease
- earlier detection of genetic predispositions to
disease - rational drug design
- gene therapy and control systems for drugs
- personalized, custom drugs
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
10Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
11How does the cell work?
- The guiding principle is the so-called
Central dogma of cell biology
12Medicine and the New Genomics
- Gene Testing
- Gene Therapy
Anticipated Benefits
- improved diagnosis of disease
- earlier detection of genetic predispositions to
disease - rational drug design
- gene therapy and control systems for drugs
- personalized, custom drugs
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
13Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
14How does the cell work?
- The guiding principle is the so-called
Central dogma of cell biology
15(No Transcript)
16Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
17Gene to protein4 Nucleotides and 20 amino acids
Protein is synthesized from amino acids by
ribosome
18Gene to Protein
The mediator mRNA
Transcription
Translation
19Transcription and translation
20PART II. Microarray
- Genome-wide expression profiling
21Exploring the Metabolic and Genetic Control
ofGene Expression on a Genomic ScaleJoseph L.
DeRisi, Vishwanath R. Iyer, Patrick O. Brown
22Microarray
23MicroArray
- Allows measuring the mRNA level of thousands of
genes in one experiment -- system level response - The data generation can be fully automated by
robots - Common experimental themes
- Time Course (when)
- Tissue Type (where)
- Response (under what conditions)
- Perturbation Mutation/Knockout, Knock-in
- Over-expression
24Reverse-transcription
Color cy3, cy5 green, red
25Example 1
5 min
- Comparative expression
- Normal versus cancer cells
- ALL versus AML
E.Landers group at MIT
26PART III. Statistics
- Low-level analysis
- Comparative expression
Feature extraction Clustering/classification Pears
on correlation Liquid association
27Issues related to image qualities
(not to be covered)
- Convert an image into a number representing the
ratio of the levels of expression between red and
green channels - Color bias
- Spatial, tip, spot effects
- Background noises
- cDNA, oligonucleotide arrays,
28Genome-wide expression profileA basic structure
Gene1Gene2 Genen
x11 x12 .. x1p x21 x22 ..
x2p
...
... xn1 xn2 .. xnp
29Cond1, cond2, , condp denote various
environmental conditions, time points, cell
types, etc. under which mRNA samples are taken
Note numerous cells are involved Data
quality issues 1. chip (manufacturer)
2. mRNA sample (user)It
is important to have a homogeneous sample so that
cellular signals can be amplified
Yeast Cell Cycle data ideally all cells are
engaged in the same activities- synchronization
30Two classes problem
An application
ALL (acute lymphoblastic leukemia) AML(acute
myeloid leukemia)
31Which Genes to select?
- For each gene (row) compute a score defined by
- sample mean of X - sample mean of Y
- divided by
- standard deviation of X standard deviation
of Y - XALL, YAML
- Genes (rows) with highest scores are selected.
They have a method
That seems to work well.
- 34 new leukemia samples
- 29 are predicated with 100 accuracy 5 weak
predication cases
Seems to work ! Improvement?
32Study of cell-cycle regulated genes
- Rate of cell growth and division varies
- Yeast(120 min), insect egg(15-30 min) nerve
cell(no)fibroblast(healing wounds) - Regulation irregular growth causes cancer
- Goal find what genes are expressed at each
state of cell cycle - Yeast cells Spellman et al (2000)
- Fourier analysis cyclic pattern
33Yeast Cell Cycle(adapted from Molecular Cell
Biology, Darnell et al)
Most visible event
34Example of the time curve Histone Genes
(HTT2) ORF YNL031C Time course
Histone
35EBP2 YKL172W
TSM1 YCR042C
YOR263C
36(No Transcript)
37Why clustering make sense biologically?
The rationale is
Genes with high degree of expression similarity
are likely to be functionally related and may
participate in common pathways. They may be
co-regulated by common upstream regulatory
factors.
Rationale behind massive gene expression analysis
Simply put,
Profile similarity implies functional association
38Some protein complexes
Protein rarely works as a single unit
39Gene profiles and correlation
- Pearson's correlation coefficient, a simple
way of describing the strength of linear
association between a pair of random variables,
has become the most popular measure of gene
expression similarity. - 1.Cluster analysis average linkage,
self-organizing map, K-mean, ... - 2.Classification nearest neighbor,linear
discriminant analysis, support vector machine, - 3.Dimension reduction methods PCA ( SVD)
40CC has been used by Gauss, Bravais, Edgeworth
Sweeping impact in data analysis is due to
Galton(1822-1911) Typical laws of heridity in
man Karl Pearson modifies and popularizes the
use. A building block in multivariate analysis,
of which clustering, classification, dim. reduct.
are recurrent themes
As a statistician, how can you ignore the time
order ? (Isnt it true that the use of sample
correlation relies on the assumption that data
are I.I.D. ???)
41Other methods forFinding Gene clusters
- Bayesian clustering normal mixture, (hidden)
indicator - PCA plot, projection pursuit, grand tour
- Multi-Dimension Scaling( bi-plot for categorical
responses, showing both cases (genes) and
variables(different clustering methods),
displaying results from many different clustering
procedures) - Generalized association plot (Chen 2001,
Statistica Sinica) - PLAID model ( Statistica Sinica 2002, Lazzeroni,
Owen)
42(No Transcript)
431st PCA direction
2nd PCA direction
3rd PCA direction
Eigenvalues
44Phase Assignment
Smooth
Non-smooth
S
S
G1
G1
31
S/G2
S/G2
27
108
103
352
255
90
295
165
M/G1
239
90
G2/M
M/G1
G2/M
45ARG1
Glutamate
ARG2
Book a flight from LA to KEGG, JAPAN in
less than 10 seconds
46ARG1
8th place negative
Y
Head
X
Compute LA(X,YZ) for all Z
Backdoor
Rank and find leading genes
Adapted from KEGG
47Coverage of bioinformaticsby areas topics
Sequence analysis
Linkage, pedigree
Microarray
DNA RNA Protein EST Drug
Evolution
Functional prediction
SNP
Alternative splicing
System modeling
Pathway discovery
Promoter
Motif
Domain
Drug -gene -protein
3-D structure
Protein-protein
Protein -gene
TRANSFAC
48Coverage of Bioinformatics by expertise (hat,
not person)
Computer scientist
Statistician/mathematician
Biologist
(raw data provider)
(huge data volume)
Oil-refining
(Crude oil)
(Noise, garbage, or ignorance?)
Make researchers life easier (pipeline)
Data cleaning
Data mining
Pattern searching /comparison
(Bio-information distilling/ Bio-data refining)
Web page browsing
Literature searching
Physical/Math/prob/stat models, computer
optimization
Data base/ visualization
Generalization/inference
Gene Ontology
49Math. Modeling a nightmare
Current
Next
mRNA
F I T N E S S
mRNA
mRNA
Observed
protein kinase
hidden
ATP, GTP, cAMP, etc
Cytoplasm Nucleus Mitochondria Vacuolar
localization
F U N C T I O N
Statistical methods become useful
DNA methylation, chromatin structure
Nutrients- carbon, nitrogen sources Temperature Wa
ter
50Bioinformatics(knowledge integration center)
- When
- Where
- Who
- What
- Why
- Cell level
- Organ level
- Organism level
- Species level
- Ecology system level
51Special issue on bioinformatics
Want to get a quick start ?
- Statistica Sinica
- 2002 January
My paper on liquid association PNAS 2002, 99,
16875-16880
Genome-wide co-expression dynamics theory and
application Classification Biological Science,
Genetics Physical Science, Statistics
52END
53Cautionary Notes for Seriation and row-column
sorting
- Hierarchical clustering is popular, but
- Sharp boundaries may be artifacts due to clever
permutation - how to fine-tune user-specified parameters-need
some theoretical guidance - What is a cluster ? Criteria needed
54Popular methods for clustering/data mining
- Linkage Eisen et al , Alon et al
- K-mean Tavazoein et al
- Self-organizing map Tamayo et al
- SVD Holter et al Alter, Brown, Botstein
55Can statisticians take the lead?
- Difficult
- But not impossible
- The key
- Willingness to learn more biology
February 2002, Talk at UCLA Biochemistry,
feedback from David Eisenberg March 2002, David
gave an inspiring review talk about several of
his works (Nature, similarity)
schematic illustration