STAT 254 -lecture1 An overview - PowerPoint PPT Presentation

About This Presentation

Title:

STAT 254 -lecture1 An overview

Description:

... cyclic pattern Yeast Cell Cycle (adapted from Molecular Cell ... computer optimization Gene Ontology Data base/ visualization Oil-refining (Crude oil ... – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 56

Provided by: statUcla

Learn more at: http://www.stat.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: STAT 254 -lecture1 An overview

1
STAT 254 -lecture1An overview

Cell biology, microarray, statistics
Bioinformatics and Statistics
Topics to cover
Keep a skeptical eye on everything you read or
hear
Keep an eye on bigger picture while working on
specifics
The shaping of bioinformatics falls on your
shoulders
What to take home not just microarray, or high
throughput data analysis methods, but a set of
skills, ways of thinking about quantitative
biology

2
Exploratory data analysis multivariate high
dimensional
20 min
3
Study of Gene ExpressionStatistics, Biology,
and Microarrays
IMS ENAR Conference Time March 31,
2003 PlaceTampa, FL

Ker-Chau Li
Statistics Department
UCLA
kcli_at_stat.ucla.edu

4
Outline

Review of cell biology
Microarray gene expression data collection
Cell-cycle gene expression (Main Data set)
PCA/Nested regression SIR (Dim. red.)
Similarity analysis - clustering (Why Popular?)
Liquid association
Closing remarks

New statistical concept, fueled by Steins lemma
Justification for IMS
5
PART I. Cellular Biology

Macromolecules DNA, mRNA, protein

6
Why Biology hot?
Because of
7
Human Genome Project
Begun in 1990, the U.S. Human Genome Project is a
13-year effort coordinated by the U.S. Department
of Energy and the National Institutes of Health.
The project originally was planned to last 15
years, but effective resource and technological
advances have accelerated the expected completion
date to 2003. Project goals are to
identify all the approximate 30,000 genes in
human DNA, determine the sequences of the 3
billion chemical base pairs that make up human
DNA, store this information in databases,
improve tools for data analysis, transfer
related technologies to the private sector, and
address the ethical, legal, and social issues
(ELSI) that may arise from the project.
Recent Milestones June 2000 completion of a
working draft of the entire human genome
February 2001 analyses of the working draft are
published
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
8
Future Challenges What We Still Dont Know
Predicted vs experimentally determined gene
function 1 Gene regulation 2 (upstream
regulatory region) Coordination of gene
expression, protein synthesis, and
post-translational events 3
Gene number, exact locations, and functions
DNA sequence organization Chromosomal structure
and organization Noncoding DNA types, amount,
distribution, information content, and
functions Interaction of proteins in complex
molecular machines Evolutionary conservation
among organisms Protein conservation (structure
and function) Proteomes (total protein content
and function) in organisms Correlation of SNPs
(single-base DNA variations among individuals)
with health and disease Disease-susceptibility
prediction based on gene sequence variation
Genes involved in complex traits and multigene
diseases Complex systems biology including
microbial consortia useful for environmental
restoration Developmental genetics, genomics
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
9
Medicine and the New Genomics

Gene Testing
Gene Therapy

Pharmacogenomics

Anticipated Benefits

improved diagnosis of disease
earlier detection of genetic predispositions to
disease
rational drug design
gene therapy and control systems for drugs

personalized, custom drugs

Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
10
Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
11
How does the cell work?

The guiding principle is the so-called

Central dogma of cell biology
12
Medicine and the New Genomics

Gene Testing
Gene Therapy

Pharmacogenomics

Anticipated Benefits

improved diagnosis of disease
earlier detection of genetic predispositions to
disease
rational drug design
gene therapy and control systems for drugs

personalized, custom drugs

Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
13
Anticipated Benefits
Agriculture, Livestock Breeding, and
Bioprocessing disease-, insect-, and
drought-resistant crops healthier, more
productive, disease-resistant farm animals more
nutritious produce biopesticides edible
vaccines incorporated into food products new
environmental cleanup uses for plants like
tobacco
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
14
How does the cell work?

The guiding principle is the so-called

Central dogma of cell biology
15
(No Transcript)
16
Human Genome Program, U.S. Department of Energy,
Genomics and Its Impact on Medicine and Society
A 2001 Primer, 2001
17
Gene to protein4 Nucleotides and 20 amino acids
Protein is synthesized from amino acids by
ribosome
18
Gene to Protein
The mediator mRNA
Transcription
Translation
19
Transcription and translation
20
PART II. Microarray

Genome-wide expression profiling

21
Exploring the Metabolic and Genetic Control
ofGene Expression on a Genomic ScaleJoseph L.
DeRisi, Vishwanath R. Iyer, Patrick O. Brown
22
Microarray
23
MicroArray

Allows measuring the mRNA level of thousands of
genes in one experiment -- system level response
The data generation can be fully automated by
robots
Common experimental themes

Time Course (when)
Tissue Type (where)
Response (under what conditions)
Perturbation Mutation/Knockout, Knock-in
Over-expression

24
Reverse-transcription
Color cy3, cy5 green, red
25
Example 1
5 min

Comparative expression
Normal versus cancer cells
ALL versus AML

E.Landers group at MIT
26
PART III. Statistics

Low-level analysis
Comparative expression

Feature extraction Clustering/classification Pears
on correlation Liquid association
27
Issues related to image qualities
(not to be covered)

Convert an image into a number representing the
ratio of the levels of expression between red and
green channels
Color bias
Spatial, tip, spot effects
Background noises
cDNA, oligonucleotide arrays,

28
Genome-wide expression profileA basic structure

cond1 cond2 .. condp

Gene1Gene2 Genen
x11 x12 .. x1p x21 x22 ..
x2p
...
... xn1 xn2 .. xnp
29
Cond1, cond2, , condp denote various
environmental conditions, time points, cell
types, etc. under which mRNA samples are taken
Note numerous cells are involved Data
quality issues 1. chip (manufacturer)
2. mRNA sample (user)It
is important to have a homogeneous sample so that
cellular signals can be amplified
Yeast Cell Cycle data ideally all cells are
engaged in the same activities- synchronization
30
Two classes problem
An application
ALL (acute lymphoblastic leukemia) AML(acute
myeloid leukemia)
31
Which Genes to select?

For each gene (row) compute a score defined by
sample mean of X - sample mean of Y
divided by
standard deviation of X standard deviation
of Y
XALL, YAML
Genes (rows) with highest scores are selected.

They have a method
That seems to work well.

34 new leukemia samples
29 are predicated with 100 accuracy 5 weak
predication cases

Seems to work ! Improvement?
32
Study of cell-cycle regulated genes

Rate of cell growth and division varies
Yeast(120 min), insect egg(15-30 min) nerve
cell(no)fibroblast(healing wounds)
Regulation irregular growth causes cancer
Goal find what genes are expressed at each
state of cell cycle
Yeast cells Spellman et al (2000)
Fourier analysis cyclic pattern

33
Yeast Cell Cycle(adapted from Molecular Cell
Biology, Darnell et al)
Most visible event
34
Example of the time curve Histone Genes
(HTT2) ORF YNL031C Time course
Histone
35
EBP2 YKL172W
TSM1 YCR042C
YOR263C
36
(No Transcript)
37
Why clustering make sense biologically?
The rationale is
Genes with high degree of expression similarity
are likely to be functionally related and may
participate in common pathways. They may be
co-regulated by common upstream regulatory
factors.
Rationale behind massive gene expression analysis
Simply put,
Profile similarity implies functional association
38
Some protein complexes
Protein rarely works as a single unit
39
Gene profiles and correlation

Pearson's correlation coefficient, a simple
way of describing the strength of linear
association between a pair of random variables,
has become the most popular measure of gene
expression similarity.
1.Cluster analysis average linkage,
self-organizing map, K-mean, ...
2.Classification nearest neighbor,linear
discriminant analysis, support vector machine,
3.Dimension reduction methods PCA ( SVD)

40
CC has been used by Gauss, Bravais, Edgeworth
Sweeping impact in data analysis is due to
Galton(1822-1911) Typical laws of heridity in
man Karl Pearson modifies and popularizes the
use. A building block in multivariate analysis,
of which clustering, classification, dim. reduct.
are recurrent themes
As a statistician, how can you ignore the time
order ? (Isnt it true that the use of sample
correlation relies on the assumption that data
are I.I.D. ???)

41
Other methods forFinding Gene clusters

Bayesian clustering normal mixture, (hidden)
indicator
PCA plot, projection pursuit, grand tour
Multi-Dimension Scaling( bi-plot for categorical
responses, showing both cases (genes) and
variables(different clustering methods),
displaying results from many different clustering
procedures)
Generalized association plot (Chen 2001,
Statistica Sinica)
PLAID model ( Statistica Sinica 2002, Lazzeroni,
Owen)

42
(No Transcript)
43
1st PCA direction
2nd PCA direction
3rd PCA direction
Eigenvalues
44
Phase Assignment
Smooth
Non-smooth
S
S
G1
G1
31
S/G2
S/G2
27
108
103
352
255
90
295
165
M/G1
239
90
G2/M
M/G1
G2/M
45
ARG1
Glutamate
ARG2
Book a flight from LA to KEGG, JAPAN in
less than 10 seconds
46
ARG1
8th place negative
Y
Head
X
Compute LA(X,YZ) for all Z
Backdoor
Rank and find leading genes
Adapted from KEGG
47
Coverage of bioinformaticsby areas topics
Sequence analysis
Linkage, pedigree
Microarray
DNA RNA Protein EST Drug
Evolution
Functional prediction
SNP
Alternative splicing
System modeling
Pathway discovery
Promoter
Motif
Domain
Drug -gene -protein
3-D structure
Protein-protein
Protein -gene
TRANSFAC
48
Coverage of Bioinformatics by expertise (hat,
not person)
Computer scientist
Statistician/mathematician
Biologist
(raw data provider)
(huge data volume)
Oil-refining
(Crude oil)
(Noise, garbage, or ignorance?)
Make researchers life easier (pipeline)
Data cleaning
Data mining
Pattern searching /comparison
(Bio-information distilling/ Bio-data refining)
Web page browsing
Literature searching
Physical/Math/prob/stat models, computer
optimization
Data base/ visualization
Generalization/inference
Gene Ontology
49
Math. Modeling a nightmare
Current
Next
mRNA
F I T N E S S
mRNA
mRNA
Observed
protein kinase
hidden
ATP, GTP, cAMP, etc
Cytoplasm Nucleus Mitochondria Vacuolar
localization
F U N C T I O N
Statistical methods become useful
DNA methylation, chromatin structure
Nutrients- carbon, nitrogen sources Temperature Wa
ter
50
Bioinformatics(knowledge integration center)

When
Where
Who
What
Why
Cell level
Organ level
Organism level
Species level
Ecology system level

51
Special issue on bioinformatics
Want to get a quick start ?

Statistica Sinica
2002 January

My paper on liquid association PNAS 2002, 99,
16875-16880
Genome-wide co-expression dynamics theory and
application Classification Biological Science,
Genetics Physical Science, Statistics
52
END
53
Cautionary Notes for Seriation and row-column
sorting

Hierarchical clustering is popular, but
Sharp boundaries may be artifacts due to clever
permutation
how to fine-tune user-specified parameters-need
some theoretical guidance
What is a cluster ? Criteria needed

54
Popular methods for clustering/data mining

Linkage Eisen et al , Alon et al
K-mean Tavazoein et al
Self-organizing map Tamayo et al
SVD Holter et al Alter, Brown, Botstein

55
Can statisticians take the lead?

Difficult
But not impossible
The key
Willingness to learn more biology

February 2002, Talk at UCLA Biochemistry,
feedback from David Eisenberg March 2002, David
gave an inspiring review talk about several of
his works (Nature, similarity)
schematic illustration

Write a Comment

User Comments (0)