Title: BioInformatics (3)
1BioInformatics (3)
2Computational Issues
- Data Warehousing
- Organising Biological Information into a
Structured Entity (Worlds Largest Distributed
DB) - Function Analysis (Numerical Analysis)
- Gene Expression Analysis Applying sophisticated
data mining/Visualisation to understand gene
activities within an environment (Clustering ) - Integrated Genomic Study Relating structural
analysis with functional analysis - Structure Analysis (Symbolic Analysis)
- Sequence Alignment Analysing a sequence using
comparative methods against existing databases to
develop hypothesis concerning relatives
(genetics) and functions (Dynamic Programming and
HMM) - Structure prediction from a sequence of a
protein to predict its 3D structure (Inductive
LP) -
3Data Warehousing Mapping Biologic into Data
Logic
4Structure Analysis Alignments Scores
Local (motif) ACCACACA
ACACCATA Score 4(1) 4
Global (e.g. haplotype) ACCACACA xxx
ACACCATA Score 5(1) 3(-1) 2
Suffix (shotgun assembly) ACCACACA
ACACCATA Score 3(1) 3
5A comparison of the homology search and the motif
search for functional interpretation of sequence
information.
Homology Search
Motif Search
New sequence
New sequence
Knowledge acquisition
Motif library (Empirical rules)
Sequence database (Primary data)
Retrieval
Similar sequence
Inference
Expert knowledge
Expert knowledge
Sequence interpretation
Sequence interpretation
6 Search and learning problems in sequence analysis
7(Whole genome) Gene Expression Analysis
- Quantitative Analysis of Gene Activities
(Transcription Profiles)
Gene Expression
8Biotinylated RNA from experiment
Each probe cell contains millions of copies of a
specific oligonucleotide probe
GeneChip expression analysis probe array
Streptavidin- phycoerythrin conjugate
Image of hybridized probe array
9(Sub)cellular inhomogeneity
Cell-cycle differences in expression. XIST RNA
localized on inactive X-chromosome
( see figure)
10Cluster Analysis
Protein/protein complex
Genes
DNA regulatory elements
11Functional Analysis via Gene Expression
Pairwise Measures
Clustering
Motif Searching/...
12Clustering Algorithms
A clustering algorithm attempts to find natural
groups of components (or data) based on some
similarity. Also, the clustering algorithm finds
the centroid of a group of data sets.To determine
cluster membership, most algorithms evaluate the
distance between a point and the cluster
centroids. The output from a clustering algorithm
is basically a statistical description of the
cluster centroids with the number of components
in each cluster.
13Clusters of Two-Dimensional Data
14Key Terms in Cluster Analysis
- Distance Similarity measures
- Hierarchical non-hierarchical
- Single/complete/average linkage
- Dendrograms ordering
15Distance Measures Minkowski Metric
ref
16Most Common Minkowski Metrics
17An Example
x
3
y
4
18 Manhattan distance is called Hamming distance
when all features are binary.
Gene Expression Levels Under 17 Conditions
(1-High,0-Low)
19Similarity Measures Correlation Coefficient
20Similarity Measures Correlation Coefficient
Expression Level
Expression Level
Gene A
Gene B
Gene B
Gene A
Time
Time
Expression Level
Gene B
Gene A
Time
21Distance-based Clustering
- Assign a distance measure between data
- Find a partition such that
- Distance between objects within partition (i.e.
same cluster) is minimized - Distance between objects from different clusters
is maximised - Issues
- Requires defining a distance (similarity) measure
in situation where it is unclear how to assign it - What relative weighting to give to one attribute
vs another? - Number of possible partition is
super-exponential
22hierarchical non-
Normalized Expression Data
23Hierarchical Clustering Techniques
24Hierarchical Clustering
Given a set of N items to be clustered, and an
NxN distance (or similarity) matrix, the basic
process hierarchical clustering is this
1.Start by assigning each item to its own
cluster, so that if you have N items, you now
have N clusters, each containing just one item.
Let the distances (similarities) between the
clusters equal the distances (similarities)
between the items they contain. 2.Find the
closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have
one less cluster. 3.Compute distances
(similarities) between the new cluster and each
of the old clusters. 4.Repeat steps 2 and 3
until all items are clustered into a single
cluster of size N.
25The distance between two clusters is defined as
the distance between
- Single-Link Method / Nearest Neighbor
- Complete-Link / Furthest Neighbor
- Their Centroids.
- Average of all cross-cluster pairs.
26Computing Distances
- single-link clustering (also called the
connectedness or minimum method) we consider
the distance between one cluster and another
cluster to be equal to the shortest distance from
any member of one cluster to any member of the
other cluster. If the data consist of
similarities, we consider the similarity between
one cluster and another cluster to be equal to
the greatest similarity from any member of one
cluster to any member of the other cluster. - complete-link clustering (also called the
diameter or maximum method) we consider the
distance between one cluster and another cluster
to be equal to the longest distance from any
member of one cluster to any member of the other
cluster. - average-link clustering we consider the
distance between one cluster and another cluster
to be equal to the average distance from any
member of one cluster to any member of the other
cluster.
27Single-Link Method
Euclidean Distance
a
a,b
b
a,b,c
a,b,c,d
c
c
d
d
d
(1)
(3)
(2)
Distance Matrix
28Complete-Link Method
Euclidean Distance
a
a,b
a,b
b
a,b,c,d
c,d
c
c
d
d
(1)
(3)
(2)
Distance Matrix
29Compare Dendrograms
Single-Link
Complete-Link
0
2
4
6
30Ordered dendrograms
- 2 n-1 linear orderings of n elements
- (n genes or conditions)
- Maximizing adjacent similarity is impractical.
So order by - Average expression level,
- Time of max induction, or
- Chromosome positioning
Eisen98
31Which clustering methods do you suggest for the
following two-dimensional data?
32Nadler and Smith, Pattern Recognition
Engineering, 1993
33Problems of Hierarchical Clustering
- It concerns more about complete tree structure
than the optimal number of clusters. - There is no possibility of correcting for a poor
initial partition. - Similarity and distance measures rarely have
strict numerical significance.
34Non-hierarchical clustering
Normalized Expression Data
Tavazoie et al. 1999 (http//arep.med.harvard.edu)
35Clustering by K-means
- Given a set S of N p-dimension vectors without
any prior knowledge about the set, the K-means
clustering algorithm forms K disjoint nonempty
subsets such that each subset minimizes some
measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of
all subsets. - K-means algorithm has time complexity O(RKN)
where K is the number of desired clusters and R
is the number of iterations to converges. - Euclidean distance metric between the coordinates
of any two genes in the space reflects ignorance
of a more biologically relevant measure of
distance. K-means is an unsupervised, iterative
algorithm that minimizes the within-cluster sum
of squared distances from the cluster mean. - The first cluster center is chosen as the
centroid of the entire data set and subsequent
centers are chosen by finding the data point
farthest from the centers already chosen. 200-400
iterations.
36K-Means Clustering Algorithm
- 1) Select an initial partition of k clusters
- 2) Assign each object to the cluster with the
closest center - 3) Compute the new centers of the clusters
- 4) Repeat step 2 and 3 until no object changes
cluster
37Representation of expression data
T1
T2
T3
Gene 1
Time-point 1
Time-point 3
dij
Gene N
.
Time-point 2
Normalized Expression Data from microarrays
Gene 1
Gene 2
38Identifying prevalent expression patterns (gene
clusters)
Time-point 1
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
39Evaluate Cluster contents
Genes
MIPS functional category
Glycolysis
Nuclear Organization
Ribosome
Translation
Unknown