A redundancy detection approach to mining bioinformatics data - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

A redundancy detection approach to mining bioinformatics data

Description:

Some string metrics are domain dependent. Specialized to match names, address, etc. ... We propose a cheap string metric which requires 40% of the time required ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 22
Provided by: s397271194
Category:

less

Transcript and Presenter's Notes

Title: A redundancy detection approach to mining bioinformatics data


1
A redundancy detection approach to mining
bioinformatics data
  • Problem definition
  • Solution approach
  • Graph theory optimization problem.
  • Experimental results

Abdellah Salhi and Horacio Camacho 15/02/2006
2
Problem definition
  • DNA information has not been fully exploited.
  • There is a lot of hope to devise new treatments
    for illnesses such as cancer.
  • Searching sequences of bases has become a crucial
    problem
  • Searching sequences in databases is
    computationally intensive.
  • The human genome has more than 3 billion base
    pairs.
  • Here we are concerned with the search of 25-base
    DNA sequences.

3
Problem definition
  • Genome data is sliced into sequences of bases
  • C,G,T,A
  • Those sequences are nothing else than 25-length
    words or strings.
  • Thus, the task of searching for redundancy of
    records in a database can be approached, as one
    would an ordinary database containing records in
    string form.

4
Solution approach
  • Key is the unique record identifier in a
    database.
  • Key equivalence is when two or more records point
    to the same real world object

5
Main redundancy detection approaches
  • Probabilistic record linkage.
  • Learn naïve Bayes classifier with a binary
    feature vector of comparing a pair of records
  • Merge/purge problem.
  • Sliding windows.
  • As a cluster/classification problem.
  • Many classifications.
  • As an optimization problem.
  • Difficult to find the global optimum.

6
Key equivalence as an optimization problem
  • The idea is to compress a database S by
    removing some or all of redundant records
    contained in it, resulting in H
  • where S is the size of the input database and
    H is the size of the processed dataset.
  • There is a cost associated every time we compress
    the database.
  • Cost of doing the equivalence assumption in a
    pair
  • Is record T equivalent to record U?.
  • If two records match the cost of assumption is
    very small.
  • Minimize the size of the database by minimizing
    a cost function.

7
Cost of equivalence
  • Option 1 As in record linkage we can use a
    probability of match given the feature vector.
  • Option 2 There are several string metrics
    which compute weights taking values
  • 1 for a perfect pair match.
  • 0 for non match.

8
String metrics for DNA sequences
  • There are several string metrics
  • Some string metrics are domain dependent
  • Specialized to match names, address, etc.
  • We compare around 40 different string metrics
  • We generated 5 artificial datasets containing DNA
    sequences.
  • Each dataset contains different sets of corrupted
    redundant records
  • Probability of corruption ranging from 0.1 to 0.5

9
String metrics for DNA sequences
  • Evaluation of a string metric
  • Average precision , where
  • Is the number of correct pairs before
    rank position l
  • Equal 1 if the actual pair matches, 0
    otherwise.
  • Maximum F1, where
  • ,
    ,

10
String metrics for DNA sequences
  • NeedlemanWunch seem to be the best method, but it
    requires a lot of computational time.
  • We propose a cheap string metric which requires
    40 of the time required by NeedlemanWunch
  • Sample 50 of the characters of sequence T at
    equally sized locations
  • Sample from position -1 to position 1 of
    sequence U
  • If the rate of agreement between T and U is gt
    80, then compute NeedlemanWunch

11
Cost of equivalence
  • We use NeedlemanWunch string metric method to
    obtain the cost of matching two names.
  • We can model this two rules into a directed graph
  • The directed graph must satisfy the following
    rules
  • One record can not be equivalent to more than one
    record.
  • Two linked records are joined by an acyclic
    relation

a
b
c
12
Optimization model
Given a graph
find a subset
that minimizes the function
Subject to
(2)
(1)
(3)
where
to
Is a cost or distance from
to
represent an arc from
13
Optimization model
  • The model contains a constant value k.
  • If k0, the solution is empty and no record is
    merged.
  • If k is a very big value, all records are matched
    and
  • As k increases its value from 0, optimal solution
    becomes negative, thus the inequality
  • Because we are minimizing, no cost is greater
    than k.
  • Thus k takes the role of a threshold

14
Solve the model
  • The model belongs to a family of integer
    programming problems with restrictions similar to
    the TSP, but not as difficult.
  • It can be solved by conventional approaches
    efficiently.

15
Solve the model
  • Any tree satisfies restrictions of our model.
  • A minimum-weight tree in a weighted graph which
    contains all of the graph's vertices is a minimum
    spanning tree.
  • A spanning forest of a connected graph G is a
    forest whose components are subtrees of a
    spanning tree of G.
  • A minimum spanning forest of a connected graph G
    is a forest whose components are minimum spanning
    trees of the corresponding components in G.
  • For a given k, the optimal solution to the model
    can be obtained in polynomial time.

16
Proposed algorithm
  • Find the weighted complete graph G corresponding
    to the given database
  • 2. Find the minimum spanning tree of G
  • 3. Assign the largest weight for which a good
    match between records is found, to k
  • 4. Remove all edges with weights gt k, from the
    minimum spanning tree of G output by Algorithm 1.

17
Evaluate the quality of the solution
  • How good is our model at finding equivalent
    records?
  • ,
  • c Number of corrected rows linked
  • E' Number of edges in the solution set E'
  • E Number of redundant records in the
    database
  • We evaluate the performance of our algorithm for
    31 values of k ranged from zero to one.

18
Results
  • 5 Artificially generated datasets
  • Larger datasets

19
Results
  • Redundant records detected in the human DNA

20
Conclusions
  • We cast the problem of searching DNA sequences as
    a redundancy detection approach.
  • An optimization model of the integer programming
    type has been devised for it.
  • We evaluate different string metrics in order to
    define the most appropriate weights of our model.
  • A cheap metric was also proposed in order to
    reduce the computational time.
  • The method suggested is efficient and robust.
  • Our model reduces the search space of linked
    records from possible pairs to
  • The model is flexible, it can use different cost
    functions.

21
K in different domains
Write a Comment
User Comments (0)
About PowerShow.com