Title: Discovering Substructures in Chemical Toxicity Domain
1Discovering Substructures in Chemical Toxicity
Domain
Masters Project Defense by Ravindra Nath
Chittimoori Committee DR. Lawrence B. Holder,
DR. Diane J. Cook , DR. Lynn Peterson Department
of Computer Science and Engineering University of
Texas at Arlington
2Outline
- Chemical Toxicity Database
- Motivation and Goal
- Knowledge Discovery in Databases (KDD)
- SUBDUE Knowledge Discovery System
- Experiments with Unsupervised SUBDUE
- Experiments with Supervised SUBDUE
- Discussion of Results
- Conclusions
- Future Work
3Chemical Toxicity Database
- Carcinogenesis Prediction Problem
- Toxicology Evaluation Challenge
- Domain
- Compounds - Total
- Training set 162 136 298
- Experimental set ? 27 ? 25 69
-
4Motivation and Goal
- Ever-increasing number of chemical compounds
- Needs analysis to obtain the Structure-Activity
- relationships of a compound
- Determine SUBDUEs applicability to chemical
- toxicity domain
5Knowledge Discovery in Databases (KDD)
- Process of identifying valid, novel, potentially
- useful and understandable patterns in data
- Goal of Knowledge Discovery
- Verification
- Discovery
- Data mining methods
- Model Representation, Evaluation and Search
6Steps in KDD
- Identify the goal of the process
- Collect, create and prepare the dataset
- Select the data mining method
- Select the data mining algorithm
- Transform the data
- Execute the algorithm
- Interpret/evaluate the discovered patterns
- Consolidate the knowledge discovered
7SUBDUE Knowledge Discovery System
- SUBDUE discovers patterns substructures in
structural data sets
Vertices objects or attributes Edges
relationships
shape
triangle
object
shape
on
square
object
4 instances of
8SUBDUE - Input Representation
- Each atom is represented as a vertex with
- directed edges to the name, type and the
partial - charge of the atom
- Bonds are represented as undirected edges
- Each group is represented as a vertex having a
- string label specifying the group name with
- directed edges to all participating atom
- vertices
9SUBDUE - Input Representation
- Representation used in Unsupervised SUBDUE
- A vertex having a string label specifying
the - alert with directed edges to all the
atoms in - the compound
- Representation used in Supervised SUBDUE
- A vertex for all the compounds with string
label - compound
- The compound vertex has directed edges to
all - the vertices representing the activity of
an - alert on a compound
10Unsupervised SUBDUE Input Representation Example
C
10
0.063
10
0.062
C
t
t
n
p
n
p
Atom
Atom
1
gr
n - Name t - Type p - Partial charge po -
Positive gr - group
po
gr
po
Ames
Methyl
11Supervised SUBDUE Input Representation Example
C
10
0.063
10
0.062
C
t
t
n
p
n
p
Atom
Atom
1
gr
contains
n - Name t - Type p - Partial charge gr -
group Com - Compound
gr
contains
Com
Methyl
Positive
Ames
12SUBDUE - Model Evaluation
- Minimum Description Length Principle
- Best theory to describe any graph
- Minimize I(S) I(G/S)
- Graph Compression
-
13Other important Concepts of SUBDUE
- Inexact Graph Match Approach
- Concept - Learning
- Predefined Substructures
14Unsupervised SUBDUE - Methodology
- Training set further divided
- 3 approaches to determine carcinogenicity of
compounds in experimental set
-- Apply SUBDUE individually to the compounds --
Inclusion of pre-defined substructures -- Check
for matching of substructure in the compound
to be classified
15Unsupervised SUBDUE - Results
10
3
0.062
0.057
c
br
t
p
t
p
n
n
atom
atom
1
- Third approach used to classify compounds in
- experimental set
- Accuracy Level -gt 0.322
- Cyanate ether groups are also discovered to
- be indicators of carcinogenic activity
16Supervised SUBDUE - Methodology
- Create set of indicators of carcinogenic
activity - Create set of indicators of noncarcinogenic
- activity
- Calculate value of substructures discovered in
- carcinogenic and noncarcinogenic set
- Select a set of substructures to be used in
- classifying compounds in experimental set
17Supervised SUBDUE - Methodology
- Check for the existence of these substructures
in - the compound to be classified
- Calculate the Carcinogenic Activity Value of the
- compound
- Calculate the NonCarcinogenic Activity Value of
the - compound
- Determine the activity of the compound
18Supervised SUBDUE - Results
- A set of 12 substructures discovered by SUBDUE
used to classify compounds in the experimental
set - 6 substructures from carcinogenic set include
substructures which form part of groups like
amino, di10, methyl, ether, halide10 and
substructure which indicates compound testing
positive on AMES, Salmonella, etc. - 6 substructures from noncarcinogenic set include
substructures which form part of groups like
methoxy, Ar_Halide, di64, nitro and alkyl_halide
and substructure which indicates compound testing
negative on AMES, Salmonella, etc.
19Supervised SUBDUE - Substructure Example -
Carcinogenic Set
positive
Ames
Salmonella
positive
Compound
Salmonella_n
positive
20Supervised SUBDUE - Substructure Example -
Carcinogenic Set
Cl
93
-0.123
10
C
n
-0.024
t
t
p
n
p
Atom
Atom
n - Name t - Type p - Partial charge gr - group
gr
gr
Halide10
21Supervised SUBDUE - Substructure Example -
NonCarcinogenic Set
negative
Ames
Salmonella
negative
Compound
Cytogen_ca
negative
22Supervised SUBDUE - Substructure Example -
NonCarcinogenic Set
Cl
93
-0.124
10
0.477
C
n
t
t
p
p
n
Atom
Atom
n - Name t - Type p - Partial charge gr -
group A-H - Alkyl Halide
gr
gr
A-H
23Supervised SUBDUE - Results
- PTE-1 Results
- Compounds
- Total - PTE-1
20 19 39 - Correct Prediction 12 6
18 - Incorrect Prediction
8 13 22 - Accuracy 0.6 ( ), 0.315 (-) , 0.462 (total)
24Supervised SUBDUE - Results
- PTE-2 Results
- Compounds
- Total - PTE-2
7 6 13 - Correct Prediction 4 3 7
- Incorrect Prediction
3 3 6 - of compounds whose activity is
known - Accuracy 0.572 ( ), 0.5 (-) , 0.538 (total)
25Results - Discussion
- Unsupervised SUBDUE successful in discovering
- lead indicators of carcinogenic activity
- Supervised SUBDUE also successful in
- discovering lead indicators of carcinogenic
- activity
- ILP System PROGOL PTE-1 (0.72), PTE-2 (0.62)
- Ashby, TOPKAT are other toxicity prediction
- methods
26Conclusions
- Consistent with results obtained by logic based
- systems like PROGOL
- Prefer to use Concept Learner when positive and
- negative examples of target concept available
- SUBDUE is capable of discovering lead
- indicators of carcinogenic/noncarcinogenic
- activity in chemical toxicity domain .
27Future Work
- PTE-3 Evaluation Challenge
- Trimmed Data Sets (Partial Charge)
- Newer Version of Concept Learning SUBDUE being
- developed
28Reference
http//cygnus.uta.edu/subdue