Neural Networks in Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Neural Networks in Bioinformatics

Description:

Asn typically though of one of the most preferred amino acid types in hydrogen bonding ... New input. M : the window size, 1 i M. N : the # of distinct residue ... – PowerPoint PPT presentation

Number of Views:1996
Avg rating:3.0/5.0
Slides: 32
Provided by: pci95
Category:

less

Transcript and Presenter's Notes

Title: Neural Networks in Bioinformatics


1
Neural Networks in Bioinformatics
I-Fang Chung ifchung_at_ym.edu.tw Institute of
Bioinformatics, YM 4-27-2006
2
Experience and Education
  • 1989-2000 Electrical and Control Engineering in
    NCTU
  • 2000-2003 (Postdoc) ECE Laboratory of
    Intelligent Control
  • 2003-2004 (Postdoc) Laboratory of DNA Information
    Analysis of Human Genome Center, Institute of
    Medical Science, Tokyo University
  • 2004-now Institute of Bioinformatics, Yang-Ming

3
Outline
  • Motivation
  • To solve one problem in bioinformatics
  • Identification of RNA-Interacting Residues in
    Protein
  • Current projects

4
Neural Networks
  • Neural networks are constructed to resemble the
    behavior of human brains (neurons)
  • Characterizes the ability to learn, recall, and
    generalize from training patterns

x1
Weights
wi1
x2
wi2
yi
a(.)
neti
Output path
xm
wim
5
Neural Networks (contd)
  • Good at tasks such as pattern matching,
    classification, function approximation, and data
    clustering
  • Good at tasks in bioinformatics such as coding
    region recognition, protein structure prediction,
    gene clustering

6
Basic Principles of Discrimination
  • Each object associated with a class label (or
    response) Y ? 1, 2, , K and a feature vector
    (vector of predictor variables) of G
    measurements X (X1, , XG)
  • Aim predict Y from X.

Predefined Class 1,2,K
K
1
2
Objects
Y Class Label 2 X Feature vector
colour, shape
Classification rule ?
X red, square Y ?
7
Example
Learning set
?
Bad prognosis recurrence lt 5yrs
Good Prognosis recurrence gt 5yrs
Good Prognosis Matesis gt 5
Predefine classes Clinical outcome
Objects Array Feature vectors Gene expression
new array
Reference L vant Veer et al (2002) Gene
expression profiling predicts clinical outcome of
breast cancer. Nature, Jan.
Classification rule
8
Design Issues
Human brain
Domain knowledge, e.g. biology (molecule,
chemistry)
Problem definition (desired input/output mapping)
Output encoding
Neural Network
Applications
Molecular Structure
Sequence discrimination Feature
detection Classification Structure prediction
DNAATGCGCTC ProteinMASSTFYI

Pre-Processing
Post-Processing


Training Data Sets Testing Data Sets
System Evaluation
Network Architecture Learning Algorithm Parameter
adjustment
Feature representation (knowledge
extraction) Input encoding
9
Prediction of Protein 2nd Structures
Adopted from Qian and Sejnowski, 1988
10
Sliding Window
Chain_1
2-D info
Chain_2
Chain_3

Amino Acids
  • Sliding window concept
  • Considering a piece of strings as inputs
  • Only looking at central position in a piece of
    strings to detect what kind of 2-D info. happens

11
Binary Bit Encoding Method
000001000000000000000
  • Input encoding for each input pattern
  • Unary encoding scheme for protein sequence
  • 21 binary bits for 20 kinds of amino acid type (1
    bit for overlapped terminal)
  • Input layer with multiple Input patterns
  • A window size w of consecutive residues been
    considered.
  • 21 w units for sequence only
  • Output layer with 3 units
  • To describe what kind of 2-D info. Happens (1,
    0, 0 for helix, 0, 1, 0 for sheet, 0, 0, 1
    for coil)
  • One hidden layer for non-linear 2-class pattern
    classification

w
12
More Complex NN Structure PHD
Multiple sequence Alignment, it is a way to
compare multiple sequence, the result is called
alignment profile.
breakthroughuse evolutionary information in MSA
instead of single sequence
Adopted from Rost and Sander, 1993
13
Outline
  • Motivation
  • To solve one problem in bioinformatics
  • Identification of RNA-Interacting Residues in
    Protein
  • Current projects

14
Identification of RNA-Interacting Residues in
Protein
  • Task
  • Predicting putative RNA-interacting sites within
    a protein chain
  • Given a protein sequence? Finding the
    RNA-binding positions (residues)
  • Method
  • Using feedforward neural network based on
    sequence profiles
  • Analyzing and qualifying a large set of the
    network weights trained on sequence profiles

15
Data Generation
  • Source Protein Data Bank (PDB)
  • Collect Protein-RNA complexes, resolved by X-ray
    with 3.0Å
  • Remove redundant protein structures with sequence
    identity over 70
  • 86 non-homologous protein chains (21990 residues)
  • Residues in interaction sites
  • The closest distance between atoms of the protein
    and the partner RNA is less than 7Å.
  • hydrogen bonds, stacking, electrostatic,
    hydrophobic, and van der Waals, interactions
    considered
  • Residues in interaction sites 21.7 (4782)

16
Classifier
Chain_1
interaction site or not
Chain_2
Chain_3
Amino acids

2D info.
Appearance probability
17
PSSM
  • Position Specific Iterative BLAST (PSI BLAST)
  • A strong measure of residue conservation in a
    given location
  • Position specific scoring matrix (PSSM)
  • A 20-dimensional vector representing
    probabilities of conservation against mutations
    to 20 different amino acids including itself
  • The position of the important function of protein
    will be kept in the course of evolving

18
Experimental Results (contd)
  • Agreement with structural studies of protein-RNA
    interactions
  • Arg, Lys, Ser, Thr, Asp and Glu prefer to be in
    hydrogen bonding
  • Phe and Ser are frequently located in van der
    Waals interacting and stacking interacting
  • Some conflicting situations
  • Ala, Leu and Val known to less preferred types in
    interactions
  • Asn typically though of one of the most preferred
    amino acid types in hydrogen bonding

Adopted from Jeong and Miyano, 2006
19
Saliency Factor
  • Objective Define a matrix to represent the
    importance of the presence of specific residues
    at specific positions
  • Step1 Normalization of weight xij for each
    input unit aij

M the window size, 1 i M N the of
distinct residue symbols, 1 j N H
the of hidden units, 1 k H
Adopted from Jeong and Miyano, 2006
20
Saliency Factor (contd)
  • Weight conservation the amount of weight
    information represent at each position i in
    the given window, defined as the difference
    between the maximum entropy and the entropy of
    the observed weight distribution
  • Saliency factor of residue j at window position
    i
  • New input

M the window size, 1 i M N the of
distinct residue symbols, 1 j N H
the of hidden units, 1 k H
Adopted from Jeong and Miyano, 2006
21
Notations
  • Four kinds of measuring parameters are defined
  • True Positive (TP) the number of accurately
    predicted interaction sites
  • True Negative (TN) the number of accurately
    predicted not-interaction sites
  • False Positive (FP) the number of inaccurately
    predicted interaction sites
  • False Negative (FN) the number of inaccurately
    predicted not-interaction sites
  • Examples (1 positive, 0 negative)
    0101000010011001111000 ?
    Observed
    1100001110001111110011 ? Predicted

TN
TP
FP
FN
22
Measuring Performance
  • Total accuracy
  • Percentage of all correctly predicted interaction
    and not-interaction sites
  • Accuracy (Specificity)
  • To measure the probability that how many of the
    predicted interaction sites are correct
  • Coverage (Sensitivity)
  • To measure the probability that how many of the
    correct interaction sites are predicted
  • Mattews correlation coefficient (MCC)
  • Takes into account both under- and
    over-predictions
  • ranges between 1 (perfect prediction) and -1
    (completely wrong prediction)

23
Receiver Operating Characteristic (ROC) Curve
24
Experimental Results
Adopted from Jeong and Miyano, 2006
25
Experimental Results (contd)
Adopted from Jeong and Miyano, 2006
26
Experimental Results (contd)
underpredicted
interaction
overpredicted
not-interaction
Adopted from Jeong and Miyano, 2006
27
References
  • E. Jeong, I F. Chung, and S. Miyano, Prediction
    of Residues in Protein-RNA Interaction Sites by
    Neural Networks, Proc. of the 14th International
    Conference on Genome Informatics, pp. 506-507,
    2003.
  • E. Jeong, I F. Chung, and S. Miyano, A Neural
    Network Method for Identification of
    RNA-Interacting Residues in Protein, Proc. of
    the 4th International Workshop on Bioinformatics
    and Systems Biology, pp. 105-116, 2004.
  • E. Jeong and S. Miyano, A weighted profile based
    method for protein-RNA interacting residue
    prediction, Trans. on Comput. Syst. Biol., IV,
    LNBI 3939, pp. 123 - 139, 2006.

28
Current Projects
  • To discover the relationship between protein
    sequence and protein structure
  • To identification of RNA-interacting residues in
    protein
  • To perform protein metal binding residue
    prediction
  • To predict the phosphorylation sites
  • Microarray data analysis
  • Significant gene selection, clustering,
    classification
  • Prediction of the polymorphic short tandem
    repeats

29
Mini-Workshop Knowledge Discovery Techniques
for Bioinformatics
Dr. Limsoon Wong
30
Hierarchy of Protein Structure
2nd structure prediction
3rd structure prediction
31
Protein Secondary Structures
Anti-parallel beta sheet
Alpha helix
loop
Parallel beta sheet
Write a Comment
User Comments (0)
About PowerShow.com