A redundancy detection approach to mining bioinformatics data - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

A redundancy detection approach to mining bioinformatics data

Description:

Some string metrics are domain dependent. Specialized to match names, address, etc. ... We propose a cheap string metric which requires 40% of the time required ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 22

Provided by: s397271194

Category:

more less

Transcript and Presenter's Notes

Title: A redundancy detection approach to mining bioinformatics data

1
A redundancy detection approach to mining
bioinformatics data

Problem definition
Solution approach
Graph theory optimization problem.
Experimental results

Abdellah Salhi and Horacio Camacho 15/02/2006
2
Problem definition

DNA information has not been fully exploited.
There is a lot of hope to devise new treatments
for illnesses such as cancer.
Searching sequences of bases has become a crucial
problem
Searching sequences in databases is
computationally intensive.
The human genome has more than 3 billion base
pairs.
Here we are concerned with the search of 25-base
DNA sequences.

3
Problem definition

Genome data is sliced into sequences of bases
C,G,T,A
Those sequences are nothing else than 25-length
words or strings.
Thus, the task of searching for redundancy of
records in a database can be approached, as one
would an ordinary database containing records in
string form.

4
Solution approach

Key is the unique record identifier in a
database.
Key equivalence is when two or more records point
to the same real world object

5
Main redundancy detection approaches

Probabilistic record linkage.
Learn naïve Bayes classifier with a binary
feature vector of comparing a pair of records
Merge/purge problem.
Sliding windows.
As a cluster/classification problem.
Many classifications.
As an optimization problem.
Difficult to find the global optimum.

6
Key equivalence as an optimization problem

The idea is to compress a database S by
removing some or all of redundant records
contained in it, resulting in H
where S is the size of the input database and
H is the size of the processed dataset.
There is a cost associated every time we compress
the database.
Cost of doing the equivalence assumption in a
pair
Is record T equivalent to record U?.
If two records match the cost of assumption is
very small.
Minimize the size of the database by minimizing
a cost function.

7
Cost of equivalence

Option 1 As in record linkage we can use a
probability of match given the feature vector.
Option 2 There are several string metrics
which compute weights taking values
1 for a perfect pair match.
0 for non match.

8
String metrics for DNA sequences

There are several string metrics
Some string metrics are domain dependent
Specialized to match names, address, etc.
We compare around 40 different string metrics
We generated 5 artificial datasets containing DNA
sequences.
Each dataset contains different sets of corrupted
redundant records
Probability of corruption ranging from 0.1 to 0.5

9
String metrics for DNA sequences

Evaluation of a string metric
Average precision , where
Is the number of correct pairs before
rank position l
Equal 1 if the actual pair matches, 0
otherwise.
Maximum F1, where
,
,

10
String metrics for DNA sequences

NeedlemanWunch seem to be the best method, but it
requires a lot of computational time.
We propose a cheap string metric which requires
40 of the time required by NeedlemanWunch
Sample 50 of the characters of sequence T at
equally sized locations
Sample from position -1 to position 1 of
sequence U
If the rate of agreement between T and U is gt
80, then compute NeedlemanWunch

11
Cost of equivalence

We use NeedlemanWunch string metric method to
obtain the cost of matching two names.
We can model this two rules into a directed graph
The directed graph must satisfy the following
rules
One record can not be equivalent to more than one
record.
Two linked records are joined by an acyclic
relation

a
b
c
12
Optimization model
Given a graph
find a subset
that minimizes the function
Subject to
(2)
(1)
(3)
where
to
Is a cost or distance from
to
represent an arc from
13
Optimization model

The model contains a constant value k.
If k0, the solution is empty and no record is
merged.
If k is a very big value, all records are matched
and
As k increases its value from 0, optimal solution
becomes negative, thus the inequality
Because we are minimizing, no cost is greater
than k.
Thus k takes the role of a threshold

14
Solve the model

The model belongs to a family of integer
programming problems with restrictions similar to
the TSP, but not as difficult.
It can be solved by conventional approaches
efficiently.

15
Solve the model

Any tree satisfies restrictions of our model.
A minimum-weight tree in a weighted graph which
contains all of the graph's vertices is a minimum
spanning tree.
A spanning forest of a connected graph G is a
forest whose components are subtrees of a
spanning tree of G.
A minimum spanning forest of a connected graph G
is a forest whose components are minimum spanning
trees of the corresponding components in G.
For a given k, the optimal solution to the model
can be obtained in polynomial time.

16
Proposed algorithm

Find the weighted complete graph G corresponding
to the given database
2. Find the minimum spanning tree of G
3. Assign the largest weight for which a good
match between records is found, to k
4. Remove all edges with weights gt k, from the
minimum spanning tree of G output by Algorithm 1.

17
Evaluate the quality of the solution

How good is our model at finding equivalent
records?
,
c Number of corrected rows linked
E' Number of edges in the solution set E'
E Number of redundant records in the
database
We evaluate the performance of our algorithm for
31 values of k ranged from zero to one.

18
Results

5 Artificially generated datasets
Larger datasets

19
Results

Redundant records detected in the human DNA

20
Conclusions

We cast the problem of searching DNA sequences as
a redundancy detection approach.
An optimization model of the integer programming
type has been devised for it.
We evaluate different string metrics in order to
define the most appropriate weights of our model.
A cheap metric was also proposed in order to
reduce the computational time.
The method suggested is efficient and robust.
Our model reduces the search space of linked
records from possible pairs to
The model is flexible, it can use different cost
functions.

21
K in different domains

Write a Comment

User Comments (0)