Discovering Substructures in Chemical Toxicity Domain - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Discovering Substructures in Chemical Toxicity Domain

Description:

A set of 12 substructures discovered by SUBDUE used to classify compounds in the ... discovering lead indicators of carcinogenic. activity ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 29
Provided by: inetem
Category:

less

Transcript and Presenter's Notes

Title: Discovering Substructures in Chemical Toxicity Domain


1
Discovering Substructures in Chemical Toxicity
Domain
Masters Project Defense by Ravindra Nath
Chittimoori Committee DR. Lawrence B. Holder,
DR. Diane J. Cook , DR. Lynn Peterson Department
of Computer Science and Engineering University of
Texas at Arlington
2
Outline
  • Chemical Toxicity Database
  • Motivation and Goal
  • Knowledge Discovery in Databases (KDD)
  • SUBDUE Knowledge Discovery System
  • Experiments with Unsupervised SUBDUE
  • Experiments with Supervised SUBDUE
  • Discussion of Results
  • Conclusions
  • Future Work

3
Chemical Toxicity Database
  • Carcinogenesis Prediction Problem
  • Toxicology Evaluation Challenge
  • Domain
  • Compounds - Total
  • Training set 162 136 298
  • Experimental set ? 27 ? 25 69

4
Motivation and Goal
  • Ever-increasing number of chemical compounds
  • Needs analysis to obtain the Structure-Activity
  • relationships of a compound
  • Determine SUBDUEs applicability to chemical
  • toxicity domain

5
Knowledge Discovery in Databases (KDD)
  • Process of identifying valid, novel, potentially
  • useful and understandable patterns in data
  • Goal of Knowledge Discovery
  • Verification
  • Discovery
  • Data mining methods
  • Model Representation, Evaluation and Search

6
Steps in KDD
  • Identify the goal of the process
  • Collect, create and prepare the dataset
  • Select the data mining method
  • Select the data mining algorithm
  • Transform the data
  • Execute the algorithm
  • Interpret/evaluate the discovered patterns
  • Consolidate the knowledge discovered

7
SUBDUE Knowledge Discovery System
  • SUBDUE discovers patterns substructures in
    structural data sets

Vertices objects or attributes Edges
relationships
shape
triangle
object
shape
on
square
object
4 instances of
8
SUBDUE - Input Representation
  • Each atom is represented as a vertex with
  • directed edges to the name, type and the
    partial
  • charge of the atom
  • Bonds are represented as undirected edges
  • Each group is represented as a vertex having a
  • string label specifying the group name with
  • directed edges to all participating atom
  • vertices

9
SUBDUE - Input Representation
  • Representation used in Unsupervised SUBDUE
  • A vertex having a string label specifying
    the
  • alert with directed edges to all the
    atoms in
  • the compound
  • Representation used in Supervised SUBDUE
  • A vertex for all the compounds with string
    label
  • compound
  • The compound vertex has directed edges to
    all
  • the vertices representing the activity of
    an
  • alert on a compound

10
Unsupervised SUBDUE Input Representation Example
C
10
0.063
10
0.062
C
t
t
n
p
n
p
Atom
Atom
1
gr
n - Name t - Type p - Partial charge po -
Positive gr - group
po
gr
po
Ames
Methyl
11
Supervised SUBDUE Input Representation Example
C
10
0.063
10
0.062
C
t
t
n
p
n
p
Atom
Atom
1
gr
contains
n - Name t - Type p - Partial charge gr -
group Com - Compound
gr
contains
Com
Methyl
Positive
Ames
12
SUBDUE - Model Evaluation
  • Minimum Description Length Principle
  • Best theory to describe any graph
  • Minimize I(S) I(G/S)
  • Graph Compression

13
Other important Concepts of SUBDUE
  • Inexact Graph Match Approach
  • Concept - Learning
  • Predefined Substructures

14
Unsupervised SUBDUE - Methodology
  • Training set further divided
  • 3 approaches to determine carcinogenicity of
    compounds in experimental set

-- Apply SUBDUE individually to the compounds --
Inclusion of pre-defined substructures -- Check
for matching of substructure in the compound
to be classified
15
Unsupervised SUBDUE - Results
10
3
0.062
0.057
c
br
t
p
t
p
n
n
atom
atom
1
  • Third approach used to classify compounds in
  • experimental set
  • Accuracy Level -gt 0.322
  • Cyanate ether groups are also discovered to
  • be indicators of carcinogenic activity

16
Supervised SUBDUE - Methodology
  • Create set of indicators of carcinogenic
    activity
  • Create set of indicators of noncarcinogenic
  • activity
  • Calculate value of substructures discovered in
  • carcinogenic and noncarcinogenic set
  • Select a set of substructures to be used in
  • classifying compounds in experimental set

17
Supervised SUBDUE - Methodology
  • Check for the existence of these substructures
    in
  • the compound to be classified
  • Calculate the Carcinogenic Activity Value of the
  • compound
  • Calculate the NonCarcinogenic Activity Value of
    the
  • compound
  • Determine the activity of the compound

18
Supervised SUBDUE - Results
  • A set of 12 substructures discovered by SUBDUE
    used to classify compounds in the experimental
    set
  • 6 substructures from carcinogenic set include
    substructures which form part of groups like
    amino, di10, methyl, ether, halide10 and
    substructure which indicates compound testing
    positive on AMES, Salmonella, etc.
  • 6 substructures from noncarcinogenic set include
    substructures which form part of groups like
    methoxy, Ar_Halide, di64, nitro and alkyl_halide
    and substructure which indicates compound testing
    negative on AMES, Salmonella, etc.

19
Supervised SUBDUE - Substructure Example -
Carcinogenic Set
positive
Ames
Salmonella
positive
Compound
Salmonella_n
positive
20
Supervised SUBDUE - Substructure Example -
Carcinogenic Set
Cl
93
-0.123
10
C
n
-0.024
t
t
p
n
p
Atom
Atom
n - Name t - Type p - Partial charge gr - group
gr
gr
Halide10
21
Supervised SUBDUE - Substructure Example -
NonCarcinogenic Set
negative
Ames
Salmonella
negative
Compound
Cytogen_ca
negative
22
Supervised SUBDUE - Substructure Example -
NonCarcinogenic Set
Cl
93
-0.124
10
0.477
C
n
t
t
p
p
n
Atom
Atom
n - Name t - Type p - Partial charge gr -
group A-H - Alkyl Halide
gr
gr
A-H
23
Supervised SUBDUE - Results
  • PTE-1 Results
  • Compounds
    - Total
  • PTE-1
    20 19 39
  • Correct Prediction 12 6
    18
  • Incorrect Prediction
    8 13 22
  • Accuracy 0.6 ( ), 0.315 (-) , 0.462 (total)

24
Supervised SUBDUE - Results
  • PTE-2 Results
  • Compounds
    - Total
  • PTE-2
    7 6 13
  • Correct Prediction 4 3 7
  • Incorrect Prediction
    3 3 6
  • of compounds whose activity is
    known
  • Accuracy 0.572 ( ), 0.5 (-) , 0.538 (total)

25
Results - Discussion
  • Unsupervised SUBDUE successful in discovering
  • lead indicators of carcinogenic activity
  • Supervised SUBDUE also successful in
  • discovering lead indicators of carcinogenic
  • activity
  • ILP System PROGOL PTE-1 (0.72), PTE-2 (0.62)
  • Ashby, TOPKAT are other toxicity prediction
  • methods

26
Conclusions
  • Consistent with results obtained by logic based
  • systems like PROGOL
  • Prefer to use Concept Learner when positive and
  • negative examples of target concept available
  • SUBDUE is capable of discovering lead
  • indicators of carcinogenic/noncarcinogenic
  • activity in chemical toxicity domain .

27
Future Work
  • PTE-3 Evaluation Challenge
  • Trimmed Data Sets (Partial Charge)
  • Newer Version of Concept Learning SUBDUE being
  • developed

28
Reference
http//cygnus.uta.edu/subdue
Write a Comment
User Comments (0)
About PowerShow.com