Selforganizing Map SOM in Protein Folding Based on HP Model

1 / 50
About This Presentation
Title:

Selforganizing Map SOM in Protein Folding Based on HP Model

Description:

Human Genome Project. Large molecule data in biology, such as DNA and protein. Genomics (????) ... a problem: how to reconstruct the target DNA from this data. ... –

Number of Views:34
Avg rating:3.0/5.0
Slides: 51
Provided by: leo88
Category:

less

Transcript and Presenter's Notes

Title: Selforganizing Map SOM in Protein Folding Based on HP Model


1
Self-organizing Map (SOM) in Protein Folding
Based on HP Model
  • Xiang-Sun ZHANG
  • ZHANGroup_at_bioinfoamss.org
  • http//zhangroup.aporc.org
  • 2003.12.2
  • 2 Dec. 2003 at NCSU

2
Motivation
  • We are all concerning what we (OR researchers and
    algorithm designers) can do in Bioinformatics?
  • What is the junction of Operations research and
    Bioinfomatics?

3
Abstract
  • Many problems in Bioinformatics can be formulated
    as large linear/nonlinear integer programming or
    combinatorial problems which are NP-hard and
    unsolvable within existing algorithms. Then
    efficient approxi- mate methods are needed.
  • As examples, a heuristic algorithm for SBH and a
    new SOM algorithm for solving the protein HP
    model are presented.
  • Other related research works in our group are
    introduced.

4
Problem areas in Bioinformatics
  • Human Genome Project
  • Large molecule data in biology, such as DNA and
    protein
  • Genomics (????)
  • DNA sequencing
  • Gene prediction
  • Sequence alignment
  • Proteomics(50000 entries in google)/Protenomics
    (hundreds entries in google)(????)
  • Structure prediction
  • Protein alignment

5
  • Operations Research
  • Over 8 millions entries on google

6
DNA Sequencing
  • ACGTGATCGATCGAGTACGAGAGTCTA
  • _______________________________
  • ACGTGATCGATCGAGTACGAGAGTCTA
  • ACGTGATCGATCGAGTACGAGAGTCTA
  • ACGTGATCGATCGAGTACGAGAGTCTA
  • ACGTGATCGATCGAGTACGAGAGTCTA

7
  • Two pieces of a target sequence with longer
    overlap are preferably connected together, that
    needs that
  • ? the average size of the pieces is as long
  • as possible and
  • ? the duplicates of the target sequence are
  • as many as possible.

8
A novel DNA sequencing technique, called
Sequencing By Hybridization (SBH), was proposed
as an alternative to the traditional sequencing
by gel electrophoresis. SBH is based on the DNA
chip (or DNA array). A DNA chip contains all
probes of length (i.e. a short k-nucleotide
fragment of DNA or called a k-tuple). Given a
probe and a target DNA, the target will bind
(hybridize) to the probe if there is a substring
of the target which fits the probe.
9
DNA Sequencing
  • DNA array (DNA chip)
  • AAATGCG(5 3-tuples, a chip with
    3-tuples)

10
SBH uses classical probing scheme, i.e., by the
hybridization of an (unknown) DNA fragment with
this chip, the unknown target DNA can be tested
and its all k-tuple compositions (called a
spectrum) determined. SBH provides information
about k-tuples presented in target DNA, but does
not provide information about positions of these
k-tuples. This results in a problem how to
reconstruct the target DNA from this data.
11
  • Because of the limitation of technology, k
    has not been taken as large as possible yet
    (generally less than 30---already a big chip).
    This possibly leads to the branching phenomenon
    in the sequence reconstruction and multiple
    reconstruction.
  • On the other hand, there are two cases of
    errors possibly occur negative errors (i.e. some
    k-tuples in the sequence which are not
    hybridized) and positive errors (i.e. some
    hybridized probes which are not k-tuples in the
    sequence). Therefore, for larger DNA fragments,
    the problem of sequence reconstruction becomes
    rather complicated and hard to analyze.

12
  • In the case of error-free SBH and ideal
    spectrum (i.e. consists of n-k1 different
    k-tuples where n is the length of the DNA
    fragment), it is known that the SBH
    reconstruction problem is equivalent to finding
    an Eulerian path in a corresponding graph, and
    the algorithm can be implemented in linear time.
  • An occurrence of positive and negative errors
    and repetitions of k-tuple in the DNA fragment
    will result in a computational difficulty, i.e.,
    the Problem becomes a strongly NP-hard one.

13
Sequencing by Hybridization
  • DNA fragment ATACGAAGA

  • ß
  • Spectrum
  • Error Positive (misread) / Negative (missing,
    repetition)

ATA TAC ACG CGA
GAA AAG AGA Ideal case
ATA TAC AGG CGA
GAA AAG AGA With errors
14
  • 1989,Pevzner, SBH reconstruction problem is
    equivalent to finding an Eulerian path in a
    related graph.
  • 1990,Fleischner, the algorithm can be implemented
    in linear time.
  • 1991,Dramanac,et al., an algorithm for SBH with
    errors under assumption that only the first or
    last nucleotide in the data can be erroneous.
  • 1993,Lipshutz, use empirically derived rates of
    positive and negative errors and other
    assumptions. No convergence analysis.
  • 1999,Blazewicz,et al., branch and bound method in
    the case of only positive errors.
  • 2000,Blazewicz,et al., a heuristic algorithm
    producing near-optimal solutions.

15
SBH Reconstruction Problem
  • Design efficient heuristic algorithms
  • Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A
    new approach to the reconstruction of DNA
    sequencing by hybridization. Bioinformatics, vol
    19(1), pages 14-21, 2003.
  • Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu.
    Combinatorial optimization problems in the
    positional DNA sequencing by hybridization and
    its algorithms. System Sciences and Mathematics,
    vol 3, 2002. (in Chinese)
  • Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang.
    Application of neural networks in the
    reconstruction of DNA sequencing by
    hybridization. In Proceedings of the 4th ISORA,
    2002.

16
Basic Observation
  • The spectrum corresponds to a graph each k-tuple
    to a vertex and two connected k-tuples to an
    edge. The structure of the graph is represented
    by
  • the adjacency matrix
  • A reconstruction of the spectrum is a path in
    the graph. Information about all
  • paths are implied in the power of the
    adjacency matrix

17
  • Some criteria, using information in the power of
    adjacency matrix, which can determine the most
    possible k-tuples at both ends and in the middle
    of all possible reconstructions of the target DNA
    in a polynomial time
  • are given.
  • A novel means which can transform the negative
    errors into the positive errors is proposed. It
    enables us to handle both types of errors easily.

18
Protein Structure Prediction
  • Predict protein 3D structure from (amino acid)
    sequence
  • Sequence secondary structure 3D structure
    function

19
Proteins Secondary Structure
  • a-helix (30-35)a-??
  • b-sheet / b-strand (20-25)b-??
  • Coil (40-50) ?????
  • Loop ?
  • b-turn b-??

20
3D Structure of Protein
Turn or coil
Alpha-helix
Beta-sheet
Loop and Turn
21
Protein 3D Structure Detection
  • X-ray diffraction
  • X-?????
  • Expensive
  • Slow

22
Protein Structure Prediction
  • Prediction is possible because
  • Sequence information uniquely determines 3D
    structure
  • Sequence similarity (gt50) tends to imply
    structural similarity
  • Prediction is necessary because
  • DNA sequence data protein sequence data
    structure data

23
Three Methods of Protein StructurePrediction
  • Goal
  • Find best fit of sequence to 3D structure
  • Comparative (homology) modeling (?????)
  • Construct 3D model from alignment to protein
    sequences with known structure
  • Threading (fold recognition) (?????)
  • Pick best fit to sequences of known 2D / 3D
    structures (folds)
  • Ab initio / de novo methods (?????)
  • Attempt to calculate 3D structure from scratch
  • Molecular dynamics
  • Energy minimization
  • Lattice models

24
Lattice Models
  • Suppose that each amino acid occupies one point
    in a space lattice
  • It is called an Exact Model

25
HP Model (Simple Model)
  • Twenty amino acids can be divided into two
    classes Hydrophobic/Non-polar (H)
    (??) Hydrophilic/Polar (P) (??)
  • The contacts between H points are favorable

hydrophobic amino acid hydrophilic
amino acid Covalent bond H-H contact
  • Goal maximize the number of H-H contacts

26
Basic Ideas
  • Each acid (neuron) in the primary sequence
    occupies one lattice point (city).
  • The distance between two cities mapped by two
    neighboring neurons is forced to be 1 as a
    covalent bond length between the amino acids in a
    protein molecule.
  • Move the neurons to have more H-H contacts, I.e.,
    emphasis on forming hydrophobic core.

27
Main Observation
  • A Traveling Salesman Problem with an energy
    function concerning the H-H contacts that would
    be maximized.

28
Mathematical Model (in square lattice)
  • Let the both of sequence and lattice size be ,
    let
  • for the i-th acid taking the j-th lattice point
    or not. Let
  • be the neighboring set of point j.
  • Let and the
    coordinates of point j be


29
Complexity
  • NP-hard problem even in the case of two
    dimensional HP model
  • P.Crescenzi, et al.
  • On the complexity of protein folding,
  • Journal of Computational Biology, 5(3)
  • 423-, 1998
  • Many local solutions
  • GA MC SA ----- time consuming

30
SOM Approach
  • Existing algorithm
  • Motivated by Self-Organizing-Map for TSP
  • Incorporation of HP Information
  • Compact lattice
  • (the sequence
  • exactly fills the
  • lattice)
  • A 36-long sequence
  • In a 6x6 lattice

31
New SOM Approach
  • Motivation
  • Consider a bigger lattice than
  • the sequence to have more
  • flexible shapes than the only
  • rectangular shape
  • Equivalent to a PCTSP
  • (Price Collecting Traveling
  • Salesman Problem) a man
  • travels only a part of the city
  • set with some expectation.
  • Difficulties caused
  • Number of cities gt number of neurons

32
PCTSP
  • A traveling salesman who gets a prize in
    every
  • city k that he visits and pays a penalty for
  • every city that he fails to visit, and who
    travels
  • between cities i and j at cost , wants to
    minimize
  • the sum of his travel cost and net penalties,
    while
  • including in his tour enough cities to collect a
  • prescribed amount of prize money.

33
The New SOM model is corresponding to the integer
programming
  • where mgtn and the total variables are (n1)m.

34
New SOM Approach
  • Innovate Points
  • Heuristic initialization to imitate a protein
  • Learning sample set partition strategy
  • Learning sample set reduction strategy
  • Local search procedure to overcome the
    multi-mapping phenomena

35
Numerical Results
  • Constructed HP sequences
  • (Length of 17)
  • HP benchmark (up to 36 amino acids)

36
SOM Approach for 2D HP-Model
  • Xiang-Sun Zhang, Yong Wang, Zhong-Wei Zhan,
    Ling-Yun Wu, Luonan Chen. A New SOM Approach for
    2D HP-Model of Proteins' Structure Prediction.
    Submitted to RECOMB04.
  • Yong Wang, Zhong-Wei Zhan, Ling-Yun Wu, Xiang-Sun
    Zhang. Improved Self-Organizing Map Algorithm for
    Protein Folding and its Realization. Submitted
    to  J. of Systems Science and Mathematical
    Sciences. (in Chinese)

37
Main Inprovements
  • Find the global maximum H-H contacts
    configurations in all the tests
  • Find more optimal conformations
  • Fast -- running time is linear with the sequence
    length

38
Unique Optimal Folding Problem
  • What proteins in the two dimensional HP model
    have unique optimal (minimum energy) folding?
    (Brian Hayes, 1998)
  • Oswin Aichholzer proved that in square lattice
  • There are closed chains of monomers with this
    property for all even lengths.
  • There are open monomer chains with this property
    for all lengths divisible by four.

39
Square Lattice and Triangular Lattice
40
Our Results
  • For any n 18k (k is a positive integer), there
    exists an n-node (open or closed) chain with at
    least optimal foldings all with isomorphic
    contact graphs of size n/2.
  • On 2D triangular lattice, for any integer ngt 19,
    there exist both closed and open chains of n
    nodes with unique optimal folding.

41
Proteins With Unique Optimal Foldings
  • Zhen-Ping Li, Xiang-Sun Zhang, Luo-Nan Chen,
    Protein with Unique Optimal Foldings on a
    Triangular Lattice in the HP Model, Submitted to
    Journal of Computational Biology.

42
Examples of Optimal Foldings
43
3D Protein Structure Alignment
  • Motivation
  • Group proteins by structural similarity
  • Determine impact of individual residues on
    protein structure
  • Identify distant homologues of protein families
  • Predict function of proteins with low sequence
    similarity
  • Identify new folds / targets for x-ray
    crystallography

44
3D Protein Structure Alignment
  • Correspondence between atoms
  • Pairwise sequence alignment
  • Locations of atoms
  • Protein Data Bank (in PDB file)
  • Bond angles / lengths
  • X,Y,Z atom coordinates
  • Evaluation metric
  • 6 degrees of freedom
  • 3 degrees of translation (A)
  • 3 degrees of rotation (R)
  • Root Mean Square Deviation (RMSD)
  • n number of atoms
  • di distance between corresponding atoms i

45
Structure Alignment Problem
46
Match two rigid bodies by rotating and removing
them in the 3D space
47
Structure Alignment Problem
  • A nonlinear integer programming problem

48
Structure Alignment Problem
  • Luo-Nan Chen, Tian-Shou Zhou, Yun Tang, Xiang-Sun
    Zhang. Structure of Alignment of Protein by Mean
    Field Annealing. Submitted to ICSB2003.

49
On-going Research
  • Protein structure prediction
  • Algorithms for HP model
  • Threading methods
  • Protein structure alignment
  • Novel model for structure alignment
  • SBH reconstruction
  • Algorithms for new pattern SBH methods
  • SNP(Single Nucleotide Polymorphism) and Haplotype
    analysis

50
Summary
  • Problems in Bioinformatics are simple in
    description but complicated in solving
  • Many problems in Proteomics are in deterministic
    nature
  • Combinatorial
  • Continuous model
  • while many problems in Genomics are in
  • stochastic nature
  • Model a problem accurately but solves it
  • approximately
Write a Comment
User Comments (0)
About PowerShow.com