Selforganizing Map SOM in Protein Folding Based on HP Model

1 / 50

About This Presentation

Title:

Selforganizing Map SOM in Protein Folding Based on HP Model

Description:

Human Genome Project. Large molecule data in biology, such as DNA and protein. Genomics (????) ... a problem: how to reconstruct the target DNA from this data. ... –

Number of Views:34

Avg rating:3.0/5.0

Slides: 51

Provided by: leo88

Category:

more less

Transcript and Presenter's Notes

Title: Selforganizing Map SOM in Protein Folding Based on HP Model

1
Self-organizing Map (SOM) in Protein Folding
Based on HP Model

Xiang-Sun ZHANG
ZHANGroup_at_bioinfoamss.org
http//zhangroup.aporc.org
2003.12.2
2 Dec. 2003 at NCSU

2
Motivation

We are all concerning what we (OR researchers and
algorithm designers) can do in Bioinformatics?
What is the junction of Operations research and
Bioinfomatics?

3
Abstract

Many problems in Bioinformatics can be formulated
as large linear/nonlinear integer programming or
combinatorial problems which are NP-hard and
unsolvable within existing algorithms. Then
efficient approxi- mate methods are needed.
As examples, a heuristic algorithm for SBH and a
new SOM algorithm for solving the protein HP
model are presented.
Other related research works in our group are
introduced.

4
Problem areas in Bioinformatics

Human Genome Project
Large molecule data in biology, such as DNA and
protein
Genomics (????)
DNA sequencing
Gene prediction
Sequence alignment
Proteomics(50000 entries in google)/Protenomics
(hundreds entries in google)(????)
Structure prediction
Protein alignment

Operations Research
Over 8 millions entries on google

6
DNA Sequencing

ACGTGATCGATCGAGTACGAGAGTCTA
_______________________________
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA
ACGTGATCGATCGAGTACGAGAGTCTA

Two pieces of a target sequence with longer
overlap are preferably connected together, that
needs that
? the average size of the pieces is as long
as possible and
? the duplicates of the target sequence are
as many as possible.

8
A novel DNA sequencing technique, called
Sequencing By Hybridization (SBH), was proposed
as an alternative to the traditional sequencing
by gel electrophoresis. SBH is based on the DNA
chip (or DNA array). A DNA chip contains all
probes of length (i.e. a short k-nucleotide
fragment of DNA or called a k-tuple). Given a
probe and a target DNA, the target will bind
(hybridize) to the probe if there is a substring
of the target which fits the probe.
9
DNA Sequencing

DNA array (DNA chip)
AAATGCG(5 3-tuples, a chip with
3-tuples)

10
SBH uses classical probing scheme, i.e., by the
hybridization of an (unknown) DNA fragment with
this chip, the unknown target DNA can be tested
and its all k-tuple compositions (called a
spectrum) determined. SBH provides information
about k-tuples presented in target DNA, but does
not provide information about positions of these
k-tuples. This results in a problem how to
reconstruct the target DNA from this data.
11

Because of the limitation of technology, k
has not been taken as large as possible yet
(generally less than 30---already a big chip).
This possibly leads to the branching phenomenon
in the sequence reconstruction and multiple
reconstruction.
On the other hand, there are two cases of
errors possibly occur negative errors (i.e. some
k-tuples in the sequence which are not
hybridized) and positive errors (i.e. some
hybridized probes which are not k-tuples in the
sequence). Therefore, for larger DNA fragments,
the problem of sequence reconstruction becomes
rather complicated and hard to analyze.

In the case of error-free SBH and ideal
spectrum (i.e. consists of n-k1 different
k-tuples where n is the length of the DNA
fragment), it is known that the SBH
reconstruction problem is equivalent to finding
an Eulerian path in a corresponding graph, and
the algorithm can be implemented in linear time.
An occurrence of positive and negative errors
and repetitions of k-tuple in the DNA fragment
will result in a computational difficulty, i.e.,
the Problem becomes a strongly NP-hard one.

13
Sequencing by Hybridization

DNA fragment ATACGAAGA
ß
Spectrum
Error Positive (misread) / Negative (missing,
repetition)

ATA TAC ACG CGA
GAA AAG AGA Ideal case
ATA TAC AGG CGA
GAA AAG AGA With errors
14

1989,Pevzner, SBH reconstruction problem is
equivalent to finding an Eulerian path in a
related graph.
1990,Fleischner, the algorithm can be implemented
in linear time.
1991,Dramanac,et al., an algorithm for SBH with
errors under assumption that only the first or
last nucleotide in the data can be erroneous.
1993,Lipshutz, use empirically derived rates of
positive and negative errors and other
assumptions. No convergence analysis.
1999,Blazewicz,et al., branch and bound method in
the case of only positive errors.
2000,Blazewicz,et al., a heuristic algorithm
producing near-optimal solutions.

15
SBH Reconstruction Problem

Design efficient heuristic algorithms
Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A
new approach to the reconstruction of DNA
sequencing by hybridization. Bioinformatics, vol
19(1), pages 14-21, 2003.
Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu.
Combinatorial optimization problems in the
positional DNA sequencing by hybridization and
its algorithms. System Sciences and Mathematics,
vol 3, 2002. (in Chinese)
Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang.
Application of neural networks in the
reconstruction of DNA sequencing by
hybridization. In Proceedings of the 4th ISORA,
2002.

16
Basic Observation

The spectrum corresponds to a graph each k-tuple
to a vertex and two connected k-tuples to an
edge. The structure of the graph is represented
by
the adjacency matrix
A reconstruction of the spectrum is a path in
the graph. Information about all
paths are implied in the power of the
adjacency matrix

Some criteria, using information in the power of
adjacency matrix, which can determine the most
possible k-tuples at both ends and in the middle
of all possible reconstructions of the target DNA
in a polynomial time
are given.
A novel means which can transform the negative
errors into the positive errors is proposed. It
enables us to handle both types of errors easily.

18
Protein Structure Prediction

Predict protein 3D structure from (amino acid)
sequence
Sequence secondary structure 3D structure
function

19
Proteins Secondary Structure

a-helix (30-35)a-??
b-sheet / b-strand (20-25)b-??
Coil (40-50) ?????
Loop ?
b-turn b-??

20
3D Structure of Protein
Turn or coil
Alpha-helix
Beta-sheet
Loop and Turn
21
Protein 3D Structure Detection

X-ray diffraction
X-?????
Expensive
Slow

22
Protein Structure Prediction

Prediction is possible because
Sequence information uniquely determines 3D
structure
Sequence similarity (gt50) tends to imply
structural similarity
Prediction is necessary because
DNA sequence data protein sequence data
structure data

23
Three Methods of Protein StructurePrediction

Goal
Find best fit of sequence to 3D structure
Comparative (homology) modeling (?????)
Construct 3D model from alignment to protein
sequences with known structure
Threading (fold recognition) (?????)
Pick best fit to sequences of known 2D / 3D
structures (folds)
Ab initio / de novo methods (?????)
Attempt to calculate 3D structure from scratch
Molecular dynamics
Energy minimization
Lattice models

24
Lattice Models

Suppose that each amino acid occupies one point
in a space lattice

It is called an Exact Model

25
HP Model (Simple Model)

Twenty amino acids can be divided into two
classes Hydrophobic/Non-polar (H)
(??) Hydrophilic/Polar (P) (??)
The contacts between H points are favorable

hydrophobic amino acid hydrophilic
amino acid Covalent bond H-H contact

Goal maximize the number of H-H contacts

26
Basic Ideas

Each acid (neuron) in the primary sequence
occupies one lattice point (city).
The distance between two cities mapped by two
neighboring neurons is forced to be 1 as a
covalent bond length between the amino acids in a
protein molecule.
Move the neurons to have more H-H contacts, I.e.,
emphasis on forming hydrophobic core.

27
Main Observation

A Traveling Salesman Problem with an energy
function concerning the H-H contacts that would
be maximized.

28
Mathematical Model (in square lattice)

Let the both of sequence and lattice size be ,
let
for the i-th acid taking the j-th lattice point
or not. Let
be the neighboring set of point j.
Let and the
coordinates of point j be

29
Complexity

NP-hard problem even in the case of two
dimensional HP model
P.Crescenzi, et al.
On the complexity of protein folding,
Journal of Computational Biology, 5(3)
423-, 1998
Many local solutions
GA MC SA ----- time consuming

30
SOM Approach

Existing algorithm
Motivated by Self-Organizing-Map for TSP
Incorporation of HP Information
Compact lattice
(the sequence
exactly fills the
lattice)
A 36-long sequence
In a 6x6 lattice

31
New SOM Approach

Motivation
Consider a bigger lattice than
the sequence to have more
flexible shapes than the only
rectangular shape
Equivalent to a PCTSP
(Price Collecting Traveling
Salesman Problem) a man
travels only a part of the city
set with some expectation.
Difficulties caused
Number of cities gt number of neurons

32
PCTSP

A traveling salesman who gets a prize in
every
city k that he visits and pays a penalty for
every city that he fails to visit, and who
travels
between cities i and j at cost , wants to
minimize
the sum of his travel cost and net penalties,
while
including in his tour enough cities to collect a
prescribed amount of prize money.

33
The New SOM model is corresponding to the integer
programming

where mgtn and the total variables are (n1)m.

34
New SOM Approach

Innovate Points
Heuristic initialization to imitate a protein
Learning sample set partition strategy
Learning sample set reduction strategy
Local search procedure to overcome the
multi-mapping phenomena

35
Numerical Results

Constructed HP sequences
(Length of 17)

HP benchmark (up to 36 amino acids)

36
SOM Approach for 2D HP-Model

Xiang-Sun Zhang, Yong Wang, Zhong-Wei Zhan,
Ling-Yun Wu, Luonan Chen. A New SOM Approach for
2D HP-Model of Proteins' Structure Prediction.
Submitted to RECOMB04.
Yong Wang, Zhong-Wei Zhan, Ling-Yun Wu, Xiang-Sun
Zhang. Improved Self-Organizing Map Algorithm for
Protein Folding and its Realization. Submitted
to J. of Systems Science and Mathematical
Sciences. (in Chinese)

37
Main Inprovements

Find the global maximum H-H contacts
configurations in all the tests
Find more optimal conformations
Fast -- running time is linear with the sequence
length

38
Unique Optimal Folding Problem

What proteins in the two dimensional HP model
have unique optimal (minimum energy) folding?
(Brian Hayes, 1998)
Oswin Aichholzer proved that in square lattice
There are closed chains of monomers with this
property for all even lengths.
There are open monomer chains with this property
for all lengths divisible by four.

39
Square Lattice and Triangular Lattice
40
Our Results

For any n 18k (k is a positive integer), there
exists an n-node (open or closed) chain with at
least optimal foldings all with isomorphic
contact graphs of size n/2.
On 2D triangular lattice, for any integer ngt 19,
there exist both closed and open chains of n
nodes with unique optimal folding.

41
Proteins With Unique Optimal Foldings

Zhen-Ping Li, Xiang-Sun Zhang, Luo-Nan Chen,
Protein with Unique Optimal Foldings on a
Triangular Lattice in the HP Model, Submitted to
Journal of Computational Biology.

42
Examples of Optimal Foldings
43
3D Protein Structure Alignment