BioInformatics (3) - PowerPoint PPT Presentation

About This Presentation

Title:

BioInformatics (3)

Description:

Computational Issues Data Warehousing: Organising Biological Information into a Structured ... Search and learning problems in sequence analysis (Whole genome) ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 40

Provided by: Yike1

Category:

more less

Transcript and Presenter's Notes

Title: BioInformatics (3)

1
BioInformatics (3)
2
Computational Issues

Data Warehousing
Organising Biological Information into a
Structured Entity (Worlds Largest Distributed
DB)
Function Analysis (Numerical Analysis)
Gene Expression Analysis Applying sophisticated
data mining/Visualisation to understand gene
activities within an environment (Clustering )
Integrated Genomic Study Relating structural
analysis with functional analysis
Structure Analysis (Symbolic Analysis)
Sequence Alignment Analysing a sequence using
comparative methods against existing databases to
develop hypothesis concerning relatives
(genetics) and functions (Dynamic Programming and
HMM)
Structure prediction from a sequence of a
protein to predict its 3D structure (Inductive
LP)

3
Data Warehousing Mapping Biologic into Data
Logic
4
Structure Analysis Alignments Scores
Local (motif) ACCACACA
ACACCATA Score 4(1) 4
Global (e.g. haplotype) ACCACACA xxx
ACACCATA Score 5(1) 3(-1) 2
Suffix (shotgun assembly) ACCACACA
ACACCATA Score 3(1) 3
5
A comparison of the homology search and the motif
search for functional interpretation of sequence
information.
Homology Search
Motif Search
New sequence
New sequence
Knowledge acquisition
Motif library (Empirical rules)
Sequence database (Primary data)
Retrieval
Similar sequence
Inference
Expert knowledge
Expert knowledge
Sequence interpretation
Sequence interpretation
6
Search and learning problems in sequence analysis
7
(Whole genome) Gene Expression Analysis

Quantitative Analysis of Gene Activities
(Transcription Profiles)

Gene Expression
8
Biotinylated RNA from experiment
Each probe cell contains millions of copies of a
specific oligonucleotide probe
GeneChip expression analysis probe array
Streptavidin- phycoerythrin conjugate
Image of hybridized probe array
9
(Sub)cellular inhomogeneity
Cell-cycle differences in expression. XIST RNA
localized on inactive X-chromosome
( see figure)
10
Cluster Analysis
Protein/protein complex
Genes
DNA regulatory elements
11
Functional Analysis via Gene Expression
Pairwise Measures
Clustering
Motif Searching/...
12
Clustering Algorithms
A clustering algorithm attempts to find natural
groups of components (or data) based on some
similarity. Also, the clustering algorithm finds
the centroid of a group of data sets.To determine
cluster membership, most algorithms evaluate the
distance between a point and the cluster
centroids. The output from a clustering algorithm
is basically a statistical description of the
cluster centroids with the number of components
in each cluster.
13
Clusters of Two-Dimensional Data
14
Key Terms in Cluster Analysis

Distance Similarity measures
Hierarchical non-hierarchical
Single/complete/average linkage
Dendrograms ordering

15
Distance Measures Minkowski Metric
ref
16
Most Common Minkowski Metrics
17
An Example
x
3
y
4
18
Manhattan distance is called Hamming distance
when all features are binary.
Gene Expression Levels Under 17 Conditions
(1-High,0-Low)
19
Similarity Measures Correlation Coefficient
20
Similarity Measures Correlation Coefficient
Expression Level
Expression Level
Gene A
Gene B
Gene B
Gene A
Time
Time
Expression Level
Gene B
Gene A
Time
21
Distance-based Clustering

Assign a distance measure between data
Find a partition such that
Distance between objects within partition (i.e.
same cluster) is minimized
Distance between objects from different clusters
is maximised
Issues
Requires defining a distance (similarity) measure
in situation where it is unclear how to assign it
What relative weighting to give to one attribute
vs another?
Number of possible partition is
super-exponential

22
hierarchical non-
Normalized Expression Data
23
Hierarchical Clustering Techniques
24
Hierarchical Clustering
Given a set of N items to be clustered, and an
NxN distance (or similarity) matrix, the basic
process hierarchical clustering is this
1.Start by assigning each item to its own
cluster, so that if you have N items, you now
have N clusters, each containing just one item.
Let the distances (similarities) between the
clusters equal the distances (similarities)
between the items they contain. 2.Find the
closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have
one less cluster. 3.Compute distances
(similarities) between the new cluster and each
of the old clusters. 4.Repeat steps 2 and 3
until all items are clustered into a single
cluster of size N.
25
The distance between two clusters is defined as
the distance between

Single-Link Method / Nearest Neighbor
Complete-Link / Furthest Neighbor
Their Centroids.
Average of all cross-cluster pairs.

26
Computing Distances

single-link clustering (also called the
connectedness or minimum method) we consider
the distance between one cluster and another
cluster to be equal to the shortest distance from
any member of one cluster to any member of the
other cluster. If the data consist of
similarities, we consider the similarity between
one cluster and another cluster to be equal to
the greatest similarity from any member of one
cluster to any member of the other cluster.
complete-link clustering (also called the
diameter or maximum method) we consider the
distance between one cluster and another cluster
to be equal to the longest distance from any
member of one cluster to any member of the other
cluster.
average-link clustering we consider the
distance between one cluster and another cluster
to be equal to the average distance from any
member of one cluster to any member of the other
cluster.

27
Single-Link Method
Euclidean Distance
a
a,b
b
a,b,c
a,b,c,d
c
c
d
d
d
(1)
(3)
(2)
Distance Matrix
28
Complete-Link Method
Euclidean Distance
a
a,b
a,b
b
a,b,c,d
c,d
c
c
d
d
(1)
(3)
(2)
Distance Matrix
29
Compare Dendrograms
Single-Link
Complete-Link
0
2
4
6
30
Ordered dendrograms

2 n-1 linear orderings of n elements
(n genes or conditions)
Maximizing adjacent similarity is impractical.
So order by
Average expression level,
Time of max induction, or
Chromosome positioning

Eisen98
31
Which clustering methods do you suggest for the
following two-dimensional data?
32
Nadler and Smith, Pattern Recognition
Engineering, 1993
33
Problems of Hierarchical Clustering

It concerns more about complete tree structure
than the optimal number of clusters.
There is no possibility of correcting for a poor
initial partition.
Similarity and distance measures rarely have
strict numerical significance.

34
Non-hierarchical clustering
Normalized Expression Data
Tavazoie et al. 1999 (http//arep.med.harvard.edu)
35
Clustering by K-means

Given a set S of N p-dimension vectors without
any prior knowledge about the set, the K-means
clustering algorithm forms K disjoint nonempty
subsets such that each subset minimizes some
measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of
all subsets.
K-means algorithm has time complexity O(RKN)
where K is the number of desired clusters and R
is the number of iterations to converges.
Euclidean distance metric between the coordinates
of any two genes in the space reflects ignorance
of a more biologically relevant measure of
distance. K-means is an unsupervised, iterative
algorithm that minimizes the within-cluster sum
of squared distances from the cluster mean.
The first cluster center is chosen as the
centroid of the entire data set and subsequent
centers are chosen by finding the data point
farthest from the centers already chosen. 200-400
iterations.

36
K-Means Clustering Algorithm

1) Select an initial partition of k clusters
2) Assign each object to the cluster with the
closest center
3) Compute the new centers of the clusters
4) Repeat step 2 and 3 until no object changes
cluster

37
Representation of expression data
T1
T2
T3
Gene 1
Time-point 1
Time-point 3
dij
Gene N
.
Time-point 2
Normalized Expression Data from microarrays
Gene 1
Gene 2
38
Identifying prevalent expression patterns (gene
clusters)
Time-point 1
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
39
Evaluate Cluster contents
Genes
MIPS functional category
Glycolysis
Nuclear Organization
Ribosome
Translation
Unknown

Write a Comment

User Comments (0)