Title: Data Mining with Neural Networks
1(No Transcript)
2Data Mining with Neural Networks
- Standard data mining terminology
- Preprocessing data
- Running neural networks via Analyze/StripMiner
- Cherkasskys nonlinear regression problem
- Magnetocardiogram data
- CBA (chemical and biological agents) Data
- Drug design with neural networks
- The paradox of learning
- Principal Component Analysis (PCA)
- The Kernel Transformation and SVMs (Support
Vector Machines) - Structural and empirical risk minimization
- (Vapniks theory of statistical learning)
3Standard Data Mining Terminology
- Basic Terminology
- - MetaNeural Format
- - Descriptors, features, response (or activity)
and ID - - Classification versus regression
- - Modeling/Feature detection
- - Training/Validation/Calibration
- - Vertical and horizontal view of data
- Outliers, rare events and minority classes
- Data Preparation
- - Data cleansing
- - Scaling
- Leave-one-out and leave-several-out validation
- Confusion matrix and ROC curves
4Installing Basic Version of Analyze
- Put analyze and gnuplot and wgnuplt.hlp and
wgnuplot.mnu in working folder - gnuplot scripts for plotting are
- - analyze resultss.ttt 3305 for scatterplot
- - analyze resultss.ttt 3313 for errorplot
- - analyze resultss.ttt 3362 for baniary
classification - More fancy graphics are in the .jar files
(needs java runtime environment) - For basic help you can try
- - analyze gt readme.txt
- - analyze help 998
- - analyze help 997
- - analyze help 008
- For beginners (unless the Java runtime
environment is installed), I - recommend displaying results via gnuplot
operators 3305, -3313 and 3362 - To familiarize with Analyze, study the script
files from this handout - Dont forget to scale data
5Running neural networks in Analyze/Stripminer
- Prepare a.pat and a.tes files for training and
testing (or what you want to name it) - Make sure data are in MetaNeural format and
properly scaled - (scaling analyze a.txt 8)
- (splitting analyze a.txt.txt 20 seed 0
keeps order) - (copy cmatrix.txt a.pat and copy dmatrix.txt
a.tes) - Run neural network analyze a.pat 4331
- copy a meta, edit meta and run again for
overriding parameter settings - Results are in resultss.xxx and resultss.ttt for
training and testing respectively - Either descale (option 4) and inspect
results.xxx and results.ttt - (analyze resultss.xxx 4 analyze resultss.ttt
4) - Or visualize via analyze resultss.ttt 3305 (and
3313, and 3362)
6A Vertical and a Horizontal View of the Data
Matrix
- Vertical view feature space
- Horizontal view data space
7Preprocessing Basic scaling for neural networks
- Mahalanobis scale descriptors
- 0-1 scale response
- Use operator 8 in Analyze code
- e.g., typing analyze a.pat 8 will give
scaled results in a.pat.txt - Note another handy operator is the splitting
operator (20) - e.g., typing lt analyze a.pat.txt 20gt
- will split file in cmatrix.txt and
dmatrix.txt - usimg 0 as random number seed put the
first data in cmatrix.txt - using a different seed scrambles up
data
8Cherkasskys Nonlinear Benchmark Data
- Generate 500 data (400 training 100 testing)
- Impossible data for linear models
K-PLS
PLS
Note eta 0.01 train to 0.02 error
9Iris Data
- For homework
- copy a meta
- Edit meta for different experiments
- summarize and report on experiments
10Classical Regression Analysis
A
Pseudo inverse
c
11LS-SVM
- Adding the ridge makes the matrix positive
definite - The ridge also performs regularization!!!!
- The problem is now equivalent to minimizing the
following
Heuristic formula for lambda
12Local Learning in Kernel Space
13Local Learning in Kernel Space
S
S
x1
This layer gives a similarity score with each
datapoint
S
S
S
xi
S
Kind of a nearest neighbor weighted prediction
score
xM
S
Weights correspond to the dependent variable for
the entire training data
Make up kernels
S
14(No Transcript)
15What Does LS-SVM Do?
- K-PLS is like a linear method in nonlinear
kernel space - Kernel space is the latent space of support
vector machines (SVMs) - How to make LS-SVM work?
- - Select kernel transformation (e.g.,
usually a Gaussian kernel) - - Select regularization parameter
16What is in a Kernel?
- A kernel can be considered as a (nonlinear)
data transformation - - Many different choices for the kernel are
possible - - Most popular is the Radial Basis Function
or Gaussian kernel - The Gaussian kernel is a symmetric matrix
- - Entries reflect nonlinear similarities
amongst data descriptions - - As defined by
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Data Visualization with Cardiomag Program
cardiomag patients.txt 402
pat1.txt.txt
vis.txt
pat2.txt.txt
vis.txt.txt
pat_ID.jpg
wave_val.cat
pat_view.jar
patients.txt
data visualization mode (requires Java run time
environment)
Raw data
Wavelet transformed data
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Worth its Weight in Gold?
27Data Mining Applications In DDASSL
- QSAR drug design
- Microarrays
- Breast Cancer Diagnosis(TransScan)
DDASSL
Drug Design and Semi-Supervised Learning
2866 Molecules 2 classes 469 Descriptors
29Electron Density-Derived TAE-Wavelet Descriptors
- 1 ) Surface properties are encoded on 0.002
e/au3 surface - Breneman, C.M. and Rhem, M. 1997 J.
Comp. Chem., Vol. 18 (2), p. 182-197 - 2 ) Histograms or wavelet encoded of surface
properties give TAE property descriptors
30Validation Model 100x leave 10 out validations
31(No Transcript)
32Data StripMining Approach for Feature Selection
PLS, K-PLS, SVM, ANN
Fuzzy Expert System Rules
GA or Sensitivity Analysis to select descriptors
33Kernel PLS (K-PLS)
- Introduced by Rosipal and Trejo (J. Machine
Learning, December 2001) - K-PLS gives almost identical (but more stable)
results to SVMs for QSAR data - - K-PLS is more transparent.
- - K-PLS allows to visualize in SVM Space
- - Computationally efficient and few
heuristics - - There is no patent on K-PLS
- Consider K-PLS as a better nonlinear PLS
34- Binding affinities to human serum
- albumin (HSA) log Khsa
- Gonzalo Colmenarejo, GalaxoSmithKline
- J. Med. Chem. 2001, 44, 4370-4378
- 95 molecules, 250-1500 descriptors
- 84 training, 10 testing (1 left out)
- 551 Wavelet PEST MOE descriptors
- Widely different compounds
- Acknowledgements Sean Eakins (Concurrent)
-
N. Sukumar (Rensselaer)
35WORK IN PROGRESS
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCT
GTAGGAGTGA TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCA
TCTTTCAGGGTTGTGCTTCTA TCTAGAAGGTAGAGCTGTGGTCGTTCAA
TAAAAGTCCTCAAGAGGTTGGTTAATACGCAT GTTTAATAGTACAGTAT
GGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG CTAGAA
GAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGG
AATG GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATG
AATCAAAATATCACA CATACCTTCAAACTATGTACAAATATTATATACC
AATAAAAAATCATCATCATCATCTCC ATCATCACCACCCTCCTCCTCAT
CACCACCAGCATCACCACCATCATCACCACCACCATC ATCACCACCACC
ACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG T
CATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCA
CCACTGTCA TTATCACCACCACCATCACCAACATCACCACCACCATTAT
CACCACCATCAACACCACCA CCCCCATCATCATCATCACTACTACCATC
ATTACCAGCACCACCACCACTATCACCACCA CCACCACAATCACCATCA
CCACTATCATCAACATCATCACTACCACCATCACCAACACCA CCATCAT
TATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCA
TCA CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCAC
CAACACCACCATCA CCACCACCACCACCATCATCACCACCACCACCATC
ATCATCACCACCACCGCCATCATCA TCGCCACCACCATGACCACCACCA
TCACAACCATCACCACCATCACAACCACCATCATCA CTATCGCTATCAC
CACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA CC
ATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCAC
CACCACCA CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAG
AATTTTGTTGGGATTCAGT ATCTGCCAAGATACCCATTCTTAAAACATG
AAAAAGCAGCTGACCCTCCTGTGGCCCCCT TTTTGGGCAGTCATTGCAG
GACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA CCCACCAC
CAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCT
GT
DDASSL
Drug Design and Semi-Supervised Learning
36APPENDIX Downloading and Installing the JAVA and
the JAVA Runtime Environment
- To be able to make JAVA plots, the
installation of JRE (the JAVA Runtime - Environment is required.
- The current version is the JAVA 2 Standard
Edition Runtime Environment 1.4 - This provides complete runtime support for
JAVA 2 applications. - In order to build a JAVA application you must
download SDK. - The JAVA 2 SDK is a development environment
for building applications, - applets, and components using the JAVA
programming language. - The current version of JRE or JDK for a
specific platform can be downloaded - from the following site
- http//java.sun.com/j2s
e/1.4/download.html - Make sure you set a path to the bin folder in
the autoexec.bat file (or equivalent - for WindowsNT/XT or LINUX/UNIX.
37Performance Indicators
- The RPI definitions include r2 and R2 for the
training set and q2 and Q2 for - the test set. r2 is the correlation
coefficient and q2 is 1-the correlation
coefficient - for the test set.
- R2 is defined as
- Q2 is defined as R2 for the test set
Note iv) In bootstrap mode q2 and Q2
are usually very close to each other,
significant differences between q2 and Q2
often indicate an improper choice
for the krnel width, or an error in data
scaling/pre-processing