Neural Networks in Bioinformatics - PowerPoint PPT Presentation

About This Presentation

Title:

Neural Networks in Bioinformatics

Description:

Asn typically though of one of the most preferred amino acid types in hydrogen bonding ... New input. M : the window size, 1 i M. N : the # of distinct residue ... – PowerPoint PPT presentation

Number of Views:1997

Avg rating:3.0/5.0

Slides: 32

Provided by: pci95

Category:

more less

Transcript and Presenter's Notes

Title: Neural Networks in Bioinformatics

1
Neural Networks in Bioinformatics
I-Fang Chung ifchung_at_ym.edu.tw Institute of
Bioinformatics, YM 4-27-2006
2
Experience and Education

1989-2000 Electrical and Control Engineering in
NCTU
2000-2003 (Postdoc) ECE Laboratory of
Intelligent Control
2003-2004 (Postdoc) Laboratory of DNA Information
Analysis of Human Genome Center, Institute of
Medical Science, Tokyo University
2004-now Institute of Bioinformatics, Yang-Ming

3
Outline

Motivation
To solve one problem in bioinformatics
Identification of RNA-Interacting Residues in
Protein
Current projects

4
Neural Networks

Neural networks are constructed to resemble the
behavior of human brains (neurons)
Characterizes the ability to learn, recall, and
generalize from training patterns

x1
Weights
wi1
x2
wi2
yi
a(.)
neti
Output path
xm
wim
5
Neural Networks (contd)

Good at tasks such as pattern matching,
classification, function approximation, and data
clustering
Good at tasks in bioinformatics such as coding
region recognition, protein structure prediction,
gene clustering

6
Basic Principles of Discrimination

Each object associated with a class label (or
response) Y ? 1, 2, , K and a feature vector
(vector of predictor variables) of G
measurements X (X1, , XG)
Aim predict Y from X.

Predefined Class 1,2,K
K
1
2
Objects
Y Class Label 2 X Feature vector
colour, shape
Classification rule ?
X red, square Y ?
7
Example
Learning set
?
Bad prognosis recurrence lt 5yrs
Good Prognosis recurrence gt 5yrs
Good Prognosis Matesis gt 5
Predefine classes Clinical outcome
Objects Array Feature vectors Gene expression
new array
Reference L vant Veer et al (2002) Gene
expression profiling predicts clinical outcome of
breast cancer. Nature, Jan.
Classification rule
8
Design Issues
Human brain
Domain knowledge, e.g. biology (molecule,
chemistry)
Problem definition (desired input/output mapping)
Output encoding
Neural Network
Applications
Molecular Structure
Sequence discrimination Feature
detection Classification Structure prediction
DNAATGCGCTC ProteinMASSTFYI

Pre-Processing
Post-Processing

Training Data Sets Testing Data Sets
System Evaluation
Network Architecture Learning Algorithm Parameter
adjustment
Feature representation (knowledge
extraction) Input encoding
9
Prediction of Protein 2nd Structures
Adopted from Qian and Sejnowski, 1988
10
Sliding Window
Chain_1
2-D info
Chain_2
Chain_3

Amino Acids

Sliding window concept
Considering a piece of strings as inputs
Only looking at central position in a piece of
strings to detect what kind of 2-D info. happens

11
Binary Bit Encoding Method
000001000000000000000

Input encoding for each input pattern
Unary encoding scheme for protein sequence
21 binary bits for 20 kinds of amino acid type (1
bit for overlapped terminal)
Input layer with multiple Input patterns
A window size w of consecutive residues been
considered.
21 w units for sequence only
Output layer with 3 units
To describe what kind of 2-D info. Happens (1,
0, 0 for helix, 0, 1, 0 for sheet, 0, 0, 1
for coil)
One hidden layer for non-linear 2-class pattern
classification

w
12
More Complex NN Structure PHD
Multiple sequence Alignment, it is a way to
compare multiple sequence, the result is called
alignment profile.
breakthroughuse evolutionary information in MSA
instead of single sequence
Adopted from Rost and Sander, 1993
13
Outline

Motivation
To solve one problem in bioinformatics
Identification of RNA-Interacting Residues in
Protein
Current projects

14
Identification of RNA-Interacting Residues in
Protein

Task
Predicting putative RNA-interacting sites within
a protein chain
Given a protein sequence? Finding the
RNA-binding positions (residues)
Method
Using feedforward neural network based on
sequence profiles
Analyzing and qualifying a large set of the
network weights trained on sequence profiles

15
Data Generation

Source Protein Data Bank (PDB)
Collect Protein-RNA complexes, resolved by X-ray
with 3.0Å
Remove redundant protein structures with sequence
identity over 70
86 non-homologous protein chains (21990 residues)
Residues in interaction sites
The closest distance between atoms of the protein
and the partner RNA is less than 7Å.
hydrogen bonds, stacking, electrostatic,
hydrophobic, and van der Waals, interactions
considered
Residues in interaction sites 21.7 (4782)

16
Classifier
Chain_1
interaction site or not
Chain_2
Chain_3
Amino acids

2D info.
Appearance probability
17
PSSM

Position Specific Iterative BLAST (PSI BLAST)
A strong measure of residue conservation in a
given location
Position specific scoring matrix (PSSM)
A 20-dimensional vector representing
probabilities of conservation against mutations
to 20 different amino acids including itself
The position of the important function of protein
will be kept in the course of evolving

18
Experimental Results (contd)

Agreement with structural studies of protein-RNA
interactions
Arg, Lys, Ser, Thr, Asp and Glu prefer to be in
hydrogen bonding
Phe and Ser are frequently located in van der
Waals interacting and stacking interacting
Some conflicting situations
Ala, Leu and Val known to less preferred types in
interactions
Asn typically though of one of the most preferred
amino acid types in hydrogen bonding

Adopted from Jeong and Miyano, 2006
19
Saliency Factor

Objective Define a matrix to represent the
importance of the presence of specific residues
at specific positions
Step1 Normalization of weight xij for each
input unit aij

M the window size, 1 i M N the of
distinct residue symbols, 1 j N H
the of hidden units, 1 k H
Adopted from Jeong and Miyano, 2006
20
Saliency Factor (contd)

Weight conservation the amount of weight
information represent at each position i in
the given window, defined as the difference
between the maximum entropy and the entropy of
the observed weight distribution
Saliency factor of residue j at window position
i
New input

M the window size, 1 i M N the of
distinct residue symbols, 1 j N H
the of hidden units, 1 k H
Adopted from Jeong and Miyano, 2006
21
Notations

Four kinds of measuring parameters are defined
True Positive (TP) the number of accurately
predicted interaction sites
True Negative (TN) the number of accurately
predicted not-interaction sites
False Positive (FP) the number of inaccurately
predicted interaction sites
False Negative (FN) the number of inaccurately
predicted not-interaction sites
Examples (1 positive, 0 negative)
0101000010011001111000 ?
Observed
1100001110001111110011 ? Predicted

TN
TP
FP
FN
22
Measuring Performance

Total accuracy
Percentage of all correctly predicted interaction
and not-interaction sites
Accuracy (Specificity)
To measure the probability that how many of the
predicted interaction sites are correct
Coverage (Sensitivity)
To measure the probability that how many of the
correct interaction sites are predicted
Mattews correlation coefficient (MCC)
Takes into account both under- and
over-predictions
ranges between 1 (perfect prediction) and -1
(completely wrong prediction)

23
Receiver Operating Characteristic (ROC) Curve
24
Experimental Results
Adopted from Jeong and Miyano, 2006
25
Experimental Results (contd)
Adopted from Jeong and Miyano, 2006
26
Experimental Results (contd)
underpredicted
interaction
overpredicted
not-interaction
Adopted from Jeong and Miyano, 2006
27
References

E. Jeong, I F. Chung, and S. Miyano, Prediction
of Residues in Protein-RNA Interaction Sites by
Neural Networks, Proc. of the 14th International
Conference on Genome Informatics, pp. 506-507,
2003.
E. Jeong, I F. Chung, and S. Miyano, A Neural
Network Method for Identification of
RNA-Interacting Residues in Protein, Proc. of
the 4th International Workshop on Bioinformatics
and Systems Biology, pp. 105-116, 2004.
E. Jeong and S. Miyano, A weighted profile based
method for protein-RNA interacting residue
prediction, Trans. on Comput. Syst. Biol., IV,
LNBI 3939, pp. 123 - 139, 2006.

28
Current Projects

To discover the relationship between protein
sequence and protein structure
To identification of RNA-interacting residues in
protein
To perform protein metal binding residue
prediction
To predict the phosphorylation sites
Microarray data analysis
Significant gene selection, clustering,
classification
Prediction of the polymorphic short tandem
repeats

29
Mini-Workshop Knowledge Discovery Techniques
for Bioinformatics
Dr. Limsoon Wong
30
Hierarchy of Protein Structure
2nd structure prediction
3rd structure prediction
31
Protein Secondary Structures
Anti-parallel beta sheet
Alpha helix
loop
Parallel beta sheet

Write a Comment

User Comments (0)