Data Mining with Neural Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining with Neural Networks

Description:

Basic Terminology - MetaNeural Format - Descriptors, features, response (or activity) and ID ... Installing Basic Version of Analyze ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 38
Provided by: markemb
Category:

less

Transcript and Presenter's Notes

Title: Data Mining with Neural Networks


1
(No Transcript)
2
Data Mining with Neural Networks
  • Standard data mining terminology
  • Preprocessing data
  • Running neural networks via Analyze/StripMiner
  • Cherkasskys nonlinear regression problem
  • Magnetocardiogram data
  • CBA (chemical and biological agents) Data
  • Drug design with neural networks
  • The paradox of learning
  • Principal Component Analysis (PCA)
  • The Kernel Transformation and SVMs (Support
    Vector Machines)
  • Structural and empirical risk minimization
  • (Vapniks theory of statistical learning)

3
Standard Data Mining Terminology
  • Basic Terminology
  • - MetaNeural Format
  • - Descriptors, features, response (or activity)
    and ID
  • - Classification versus regression
  • - Modeling/Feature detection
  • - Training/Validation/Calibration
  • - Vertical and horizontal view of data
  • Outliers, rare events and minority classes
  • Data Preparation
  • - Data cleansing
  • - Scaling
  • Leave-one-out and leave-several-out validation
  • Confusion matrix and ROC curves

4
Installing Basic Version of Analyze
  • Put analyze and gnuplot and wgnuplt.hlp and
    wgnuplot.mnu in working folder
  • gnuplot scripts for plotting are
  • - analyze resultss.ttt 3305 for scatterplot
  • - analyze resultss.ttt 3313 for errorplot
  • - analyze resultss.ttt 3362 for baniary
    classification
  • More fancy graphics are in the .jar files
    (needs java runtime environment)
  • For basic help you can try
  • - analyze gt readme.txt
  • - analyze help 998
  • - analyze help 997
  • - analyze help 008
  • For beginners (unless the Java runtime
    environment is installed), I
  • recommend displaying results via gnuplot
    operators 3305, -3313 and 3362
  • To familiarize with Analyze, study the script
    files from this handout
  • Dont forget to scale data

5
Running neural networks in Analyze/Stripminer
  • Prepare a.pat and a.tes files for training and
    testing (or what you want to name it)
  • Make sure data are in MetaNeural format and
    properly scaled
  • (scaling analyze a.txt 8)
  • (splitting analyze a.txt.txt 20 seed 0
    keeps order)
  • (copy cmatrix.txt a.pat and copy dmatrix.txt
    a.tes)
  • Run neural network analyze a.pat 4331
  • copy a meta, edit meta and run again for
    overriding parameter settings
  • Results are in resultss.xxx and resultss.ttt for
    training and testing respectively
  • Either descale (option 4) and inspect
    results.xxx and results.ttt
  • (analyze resultss.xxx 4 analyze resultss.ttt
    4)
  • Or visualize via analyze resultss.ttt 3305 (and
    3313, and 3362)

6
A Vertical and a Horizontal View of the Data
Matrix
  • Vertical view feature space
  • Horizontal view data space

7
Preprocessing Basic scaling for neural networks
  • Mahalanobis scale descriptors
  • 0-1 scale response
  • Use operator 8 in Analyze code
  • e.g., typing analyze a.pat 8 will give
    scaled results in a.pat.txt
  • Note another handy operator is the splitting
    operator (20)
  • e.g., typing lt analyze a.pat.txt 20gt
  • will split file in cmatrix.txt and
    dmatrix.txt
  • usimg 0 as random number seed put the
    first data in cmatrix.txt
  • using a different seed scrambles up
    data

8
Cherkasskys Nonlinear Benchmark Data
  • Generate 500 data (400 training 100 testing)
  • Impossible data for linear models

K-PLS
PLS
Note eta 0.01 train to 0.02 error
9
Iris Data
  • For homework
  • copy a meta
  • Edit meta for different experiments
  • summarize and report on experiments

10
Classical Regression Analysis
A
Pseudo inverse
c
11
LS-SVM
  • Adding the ridge makes the matrix positive
    definite
  • The ridge also performs regularization!!!!
  • The problem is now equivalent to minimizing the
    following

Heuristic formula for lambda
12
Local Learning in Kernel Space
13
Local Learning in Kernel Space
S
S
x1
This layer gives a similarity score with each
datapoint
S
S
S
xi
S
Kind of a nearest neighbor weighted prediction
score
xM
S
Weights correspond to the dependent variable for
the entire training data
Make up kernels
S
14
(No Transcript)
15
What Does LS-SVM Do?
  • K-PLS is like a linear method in nonlinear
    kernel space
  • Kernel space is the latent space of support
    vector machines (SVMs)
  • How to make LS-SVM work?
  • - Select kernel transformation (e.g.,
    usually a Gaussian kernel)
  • - Select regularization parameter

16
What is in a Kernel?
  • A kernel can be considered as a (nonlinear)
    data transformation
  • - Many different choices for the kernel are
    possible
  • - Most popular is the Radial Basis Function
    or Gaussian kernel
  • The Gaussian kernel is a symmetric matrix
  • - Entries reflect nonlinear similarities
    amongst data descriptions
  • - As defined by

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Data Visualization with Cardiomag Program
cardiomag patients.txt 402
pat1.txt.txt
vis.txt
pat2.txt.txt
vis.txt.txt

pat_ID.jpg
wave_val.cat
pat_view.jar
patients.txt
data visualization mode (requires Java run time
environment)
Raw data
Wavelet transformed data
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Worth its Weight in Gold?
27
Data Mining Applications In DDASSL
  • QSAR drug design
  • Microarrays
  • Breast Cancer Diagnosis(TransScan)

DDASSL
Drug Design and Semi-Supervised Learning
28
66 Molecules 2 classes 469 Descriptors
29
Electron Density-Derived TAE-Wavelet Descriptors
  • 1 ) Surface properties are encoded on 0.002
    e/au3 surface
  • Breneman, C.M. and Rhem, M. 1997 J.
    Comp. Chem., Vol. 18 (2), p. 182-197
  • 2 ) Histograms or wavelet encoded of surface
    properties give TAE property descriptors

30
Validation Model 100x leave 10 out validations
31
(No Transcript)
32
Data StripMining Approach for Feature Selection
PLS, K-PLS, SVM, ANN
Fuzzy Expert System Rules
GA or Sensitivity Analysis to select descriptors
33
Kernel PLS (K-PLS)
  • Introduced by Rosipal and Trejo (J. Machine
    Learning, December 2001)
  • K-PLS gives almost identical (but more stable)
    results to SVMs for QSAR data
  • - K-PLS is more transparent.
  • - K-PLS allows to visualize in SVM Space
  • - Computationally efficient and few
    heuristics
  • - There is no patent on K-PLS
  • Consider K-PLS as a better nonlinear PLS

34
  • Binding affinities to human serum
  • albumin (HSA) log Khsa
  • Gonzalo Colmenarejo, GalaxoSmithKline
  • J. Med. Chem. 2001, 44, 4370-4378
  • 95 molecules, 250-1500 descriptors
  • 84 training, 10 testing (1 left out)
  • 551 Wavelet PEST MOE descriptors
  • Widely different compounds
  • Acknowledgements Sean Eakins (Concurrent)

  • N. Sukumar (Rensselaer)

35
WORK IN PROGRESS
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
DDASSL
Drug Design and Semi-Supervised Learning
36
APPENDIX Downloading and Installing the JAVA and
the JAVA Runtime Environment
  • To be able to make JAVA plots, the
    installation of JRE (the JAVA Runtime
  • Environment is required.
  • The current version is the JAVA 2 Standard
    Edition Runtime Environment 1.4
  • This provides complete runtime support for
    JAVA 2 applications.
  • In order to build a JAVA application you must
    download SDK.
  • The JAVA 2 SDK is a development environment
    for building applications,
  • applets, and components using the JAVA
    programming language.
  • The current version of JRE or JDK for a
    specific platform can be downloaded
  • from the following site
  • http//java.sun.com/j2s
    e/1.4/download.html
  • Make sure you set a path to the bin folder in
    the autoexec.bat file (or equivalent
  • for WindowsNT/XT or LINUX/UNIX.

37
Performance Indicators
  • The RPI definitions include r2 and R2 for the
    training set and q2 and Q2 for
  • the test set. r2 is the correlation
    coefficient and q2 is 1-the correlation
    coefficient
  • for the test set.
  • R2 is defined as
  • Q2 is defined as R2 for the test set

Note iv) In bootstrap mode q2 and Q2
are usually very close to each other,
significant differences between q2 and Q2
often indicate an improper choice
for the krnel width, or an error in data
scaling/pre-processing
Write a Comment
User Comments (0)
About PowerShow.com