BioInformatics (3) - PowerPoint PPT Presentation

About This Presentation
Title:

BioInformatics (3)

Description:

Computational Issues Data Warehousing: Organising Biological Information into a Structured ... Search and learning problems in sequence analysis (Whole genome) ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 40
Provided by: Yike1
Category:

less

Transcript and Presenter's Notes

Title: BioInformatics (3)


1
BioInformatics (3)
2
Computational Issues
  • Data Warehousing
  • Organising Biological Information into a
    Structured Entity (Worlds Largest Distributed
    DB)
  • Function Analysis (Numerical Analysis)
  • Gene Expression Analysis Applying sophisticated
    data mining/Visualisation to understand gene
    activities within an environment (Clustering )
  • Integrated Genomic Study Relating structural
    analysis with functional analysis
  • Structure Analysis (Symbolic Analysis)
  • Sequence Alignment Analysing a sequence using
    comparative methods against existing databases to
    develop hypothesis concerning relatives
    (genetics) and functions (Dynamic Programming and
    HMM)
  • Structure prediction from a sequence of a
    protein to predict its 3D structure (Inductive
    LP)

3
Data Warehousing Mapping Biologic into Data
Logic
4
Structure Analysis Alignments Scores
Local (motif) ACCACACA
ACACCATA Score 4(1) 4
Global (e.g. haplotype) ACCACACA xxx
ACACCATA Score 5(1) 3(-1) 2
Suffix (shotgun assembly) ACCACACA
ACACCATA Score 3(1) 3
5
A comparison of the homology search and the motif
search for functional interpretation of sequence
information.
Homology Search
Motif Search
New sequence
New sequence
Knowledge acquisition
Motif library (Empirical rules)
Sequence database (Primary data)
Retrieval
Similar sequence
Inference
Expert knowledge
Expert knowledge
Sequence interpretation
Sequence interpretation
6
Search and learning problems in sequence analysis
7
(Whole genome) Gene Expression Analysis
  • Quantitative Analysis of Gene Activities
    (Transcription Profiles)

Gene Expression
8
Biotinylated RNA from experiment
Each probe cell contains millions of copies of a
specific oligonucleotide probe
GeneChip expression analysis probe array
Streptavidin- phycoerythrin conjugate
Image of hybridized probe array
9
(Sub)cellular inhomogeneity
Cell-cycle differences in expression. XIST RNA
localized on inactive X-chromosome
( see figure)
10
Cluster Analysis
Protein/protein complex
Genes
DNA regulatory elements
11
Functional Analysis via Gene Expression
Pairwise Measures
Clustering
Motif Searching/...
12
Clustering Algorithms
A clustering algorithm attempts to find natural
groups of components (or data) based on some
similarity. Also, the clustering algorithm finds
the centroid of a group of data sets.To determine
cluster membership, most algorithms evaluate the
distance between a point and the cluster
centroids. The output from a clustering algorithm
is basically a statistical description of the
cluster centroids with the number of components
in each cluster.
13
Clusters of Two-Dimensional Data
14
Key Terms in Cluster Analysis
  • Distance Similarity measures
  • Hierarchical non-hierarchical
  • Single/complete/average linkage
  • Dendrograms ordering

15
Distance Measures Minkowski Metric
ref
16
Most Common Minkowski Metrics
17
An Example
x
3
y
4
18
Manhattan distance is called Hamming distance
when all features are binary.
Gene Expression Levels Under 17 Conditions
(1-High,0-Low)
19
Similarity Measures Correlation Coefficient
20
Similarity Measures Correlation Coefficient
Expression Level
Expression Level
Gene A
Gene B
Gene B
Gene A
Time
Time
Expression Level
Gene B
Gene A
Time
21
Distance-based Clustering
  • Assign a distance measure between data
  • Find a partition such that
  • Distance between objects within partition (i.e.
    same cluster) is minimized
  • Distance between objects from different clusters
    is maximised
  • Issues
  • Requires defining a distance (similarity) measure
    in situation where it is unclear how to assign it
  • What relative weighting to give to one attribute
    vs another?
  • Number of possible partition is
    super-exponential

22
hierarchical non-
Normalized Expression Data
23
Hierarchical Clustering Techniques
24
Hierarchical Clustering
Given a set of N items to be clustered, and an
NxN distance (or similarity) matrix, the basic
process hierarchical clustering is this
1.Start by assigning each item to its own
cluster, so that if you have N items, you now
have N clusters, each containing just one item.
Let the distances (similarities) between the
clusters equal the distances (similarities)
between the items they contain. 2.Find the
closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have
one less cluster. 3.Compute distances
(similarities) between the new cluster and each
of the old clusters. 4.Repeat steps 2 and 3
until all items are clustered into a single
cluster of size N.
25
The distance between two clusters is defined as
the distance between
  • Single-Link Method / Nearest Neighbor
  • Complete-Link / Furthest Neighbor
  • Their Centroids.
  • Average of all cross-cluster pairs.

26
Computing Distances
  • single-link clustering (also called the
    connectedness or minimum method) we consider
    the distance between one cluster and another
    cluster to be equal to the shortest distance from
    any member of one cluster to any member of the
    other cluster. If the data consist of
    similarities, we consider the similarity between
    one cluster and another cluster to be equal to
    the greatest similarity from any member of one
    cluster to any member of the other cluster.
  • complete-link clustering (also called the
    diameter or maximum method) we consider the
    distance between one cluster and another cluster
    to be equal to the longest distance from any
    member of one cluster to any member of the other
    cluster.
  • average-link clustering we consider the
    distance between one cluster and another cluster
    to be equal to the average distance from any
    member of one cluster to any member of the other
    cluster.

27
Single-Link Method
Euclidean Distance
a
a,b
b
a,b,c
a,b,c,d
c
c
d
d
d
(1)
(3)
(2)
Distance Matrix
28
Complete-Link Method
Euclidean Distance
a
a,b
a,b
b
a,b,c,d
c,d
c
c
d
d
(1)
(3)
(2)
Distance Matrix
29
Compare Dendrograms
Single-Link
Complete-Link
0
2
4
6
30
Ordered dendrograms
  • 2 n-1 linear orderings of n elements
  • (n genes or conditions)
  • Maximizing adjacent similarity is impractical.
    So order by
  • Average expression level,
  • Time of max induction, or
  • Chromosome positioning

Eisen98
31
Which clustering methods do you suggest for the
following two-dimensional data?
32
Nadler and Smith, Pattern Recognition
Engineering, 1993
33
Problems of Hierarchical Clustering
  • It concerns more about complete tree structure
    than the optimal number of clusters.
  • There is no possibility of correcting for a poor
    initial partition.
  • Similarity and distance measures rarely have
    strict numerical significance.

34
Non-hierarchical clustering
Normalized Expression Data
Tavazoie et al. 1999 (http//arep.med.harvard.edu)
35
Clustering by K-means
  • Given a set S of N p-dimension vectors without
    any prior knowledge about the set, the K-means
    clustering algorithm forms K disjoint nonempty
    subsets such that each subset minimizes some
    measure of dissimilarity locally. The algorithm
    will globally yield an optimal dissimilarity of
    all subsets.
  • K-means algorithm has time complexity O(RKN)
    where K is the number of desired clusters and R
    is the number of iterations to converges.
  • Euclidean distance metric between the coordinates
    of any two genes in the space reflects ignorance
    of a more biologically relevant measure of
    distance. K-means is an unsupervised, iterative
    algorithm that minimizes the within-cluster sum
    of squared distances from the cluster mean.
  • The first cluster center is chosen as the
    centroid of the entire data set and subsequent
    centers are chosen by finding the data point
    farthest from the centers already chosen. 200-400
    iterations.

36
K-Means Clustering Algorithm
  • 1) Select an initial partition of k clusters
  • 2) Assign each object to the cluster with the
    closest center
  • 3) Compute the new centers of the clusters
  • 4) Repeat step 2 and 3 until no object changes
    cluster

37
Representation of expression data
T1
T2
T3
Gene 1
Time-point 1
Time-point 3
dij
Gene N
.
Time-point 2
Normalized Expression Data from microarrays
Gene 1
Gene 2
38
Identifying prevalent expression patterns (gene
clusters)
Time-point 1
Normalized Expression
Time-point 3
Time -point
Time-point 2
Normalized Expression
Normalized Expression
Time -point
Time -point
39
Evaluate Cluster contents
Genes
MIPS functional category
Glycolysis
Nuclear Organization
Ribosome
Translation
Unknown
Write a Comment
User Comments (0)
About PowerShow.com