A Course on Bioinformatics - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

A Course on Bioinformatics

Description:

The table will indicate for each position Pi if we fail at i, where to start next. O(|P|2 + |T|). Knuth-Morris-Pratt, Boyer-Moore: O(|P|+|T|). – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 104
Provided by: idiNtnuN6
Category:

less

Transcript and Presenter's Notes

Title: A Course on Bioinformatics


1
A Course on Bioinformatics
Ming Li Bioinformatics Lab Computer Science
Dept., UCSB
2
Chapter 0. Why Study Bioinformatics?
  • The trend of genetic data growth
  • Many experts predict 21st century will be a
    century of biotechnology.
  • Genomes Human, Rice, Mouse, Yeast, 50 species
    bacteria, biggest digital gold mine ever.

30 billion in year 2005
3
Chapter 1. Biology
Phenotype
4
Animal CELL
Mitochondrion
Nucleolus (rRNA synthesized)
Cytoplasm
Nucleus
Plasma membrance Cell coat
Chromatin
Lots of other stuff/organelles/ribosome
5
Two kinds of Cells
  • Prokaryotes no nucleus (bacteria)
  • Their genomes are circular
  • Eukaryotes have nucleus (animal,plants)
  • Linear genomes with multiple chromosomes in
    pairs. When pairing up, they look like

Middle centromere Top p-arm Bottom q-arm
6
Mitosis and Meiosis
  • Mitosis homologous chromosomes pair up,
    duplicate, one set to each cell.
  • Meiosis chromosomes split to haploid
    (reproductive) cells.
  • Haploid Number of Chromosomes
  • Human 23
  • Rice 12
  • Fruit Fly 4
  • Corn 10
  • Chimpanzee 24
  • House mouse 20

7
The DNA Sequences
  • All are made of H (hydrogen), C (carbon), N
    (nitrogen), O (oxygen), S (sulfur), P
    (phosphorus).
  • A (single chain) DNA sequence looks like

5
o
o
o
Sugar --- base
CH
A,C,G,T,U
2
c
c
Phosphate (PO )
c
c
4
Sugar --- base
A nucleotide
o
o
3
phosphate
Ribose RNA Deleting circled O, ? Deoxyribose
--DNA
Sugar --- base
etc
8
The 4 bases
Thymine
Adenine
H
H
H
C
o
A-T
N
H
H
H
N
C
C
C
c
C
C
N
N
H
H
N
C
N
C
C
N
o
H
H
H
Note this is flat!
o
H
N
H
N
C
C
C
Uracil replaces T in RNA
c
C
C
N
N
H
H
N
C
N
C
G-C
C
N
o
N
H
Cytosine
Guanine
H
9
Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
10
Genes and Proteins
  • One gene encodes one protein.
  • Like a program, it starts with start codon (e.g.
    ATG), then each three code one amino acid. Then a
    stop codon (e.g. TGA) signifies end of the gene.
  • Sometimes, in the middle of a (eukaryotic) gene,
    there are introns that are spliced out (as junk)
    during transcription. Good parts are called
    exons. This is the task of gene finding.

11
A.A. Coding Table
  • Arginine (ARG) CG
  • Asparagine (ASN) AAT, AAC
  • Glutamine (GLN) CAA, CAG
  • Cysteine (CYS) TGT, TGC
  • Methionine (MET) ATG
  • Phenylalanine (PHE) TTT,TTC
  • Tyrosine (TYR) TAT, TAC
  • Tryptophan (TRP) TGG
  • Histidine (HIS) CAT, CAC
  • Proline (PRO) CC
  • Stop TGA, TAA, TAG
  • Glycine (GLY) GG
  • Alanine(ALA) GC
  • Valine (VAL) GT
  • Leucine (LEU) CT
  • Isoleucine (ILE) AT(-G)
  • Serine (SER) AGT, AGC
  • Threonine (THR) AC
  • Aspartic Acid (ASP) GAT,GAC
  • Glutamic Acid(GLU) GAA,GAG
  • Lysine (LYS) AAA, AAG
  • Start ATG, CTG, GTG

12
Bad Genes --gt Genetic Diseases
  • Hemophilia on X chromosome.
  • Cystic firosis on chr. 7, CFTR gene. 4
    Caucasians carry the defective gene. (recessive)
  • Sickle-Cell Anemia single nucleotide mutation in
    the first exon of beta-globin gene (removes a
    cutting site). 1 in 12 African Americans are
    carriers. (sick for homozygotes)
  • BRCA1 gene (chr. 17q) responsible for ½
    inherited breast cancer (10 of breast cancer)
  • Fragile X syndrome (mentally retard) 1 in 1250
    males, 2500 females (dominate, but females have
    partially expressed good gene). FMR-1 gene
    tri-nucleotide repeats gt200 causes disease.
  • P53 gene chr. 17p, responsible for ½ of all
    cancers
  • X-rated XX, XY, XO(f), XXY(m), XYY(m)

13
Where to get data?
  • Larry Miller GenBank
  • http//www.ncbi.nlm.nih.gov
  • Dr. Huandong Sun
  • SWISS-PROT http//www.expasy.ch/sprot
  • PDB http//www.pdb.bnl.gov/
  • Go to our Bioinformatics Lab website
    www.cs.ucsb.edu/mli/Bioinf/resources/index.html
  • for all the information.

14
Chapter 2. Research and Industry
  • Bioinformatics Labs and education programs
    established in almost all major universities,
    scattered in Biology, Physics, Computer Science
  • Research topics
  • Gene-finding
  • Mapping
  • Sequencing/sequence assembly.
  • Sequence Comparison alignment, database search
  • Mining DNA/proteins, finding motifs
  • Genome arrangement
  • DNA arrays
  • Computational proteomics, protein folding
  • Phylogeny reconstruction and analysis

15
The Commercial Market
  • Current bioinformatics market is worth 300
    million / year. Half data, half software.
  • 2 billion / year bioinformatics market in 5
    years, predicted by Oscar Gruss Son.
  • Wide range of different demands
  • Bioinformatics software require sophisticated
    algorithm support, ranging from probabilistic/
    approximation algorithms, data mining, learning
    algorithms, databases, to GUI design

16
BUSINESS Landscape
Pharmaceutical Companies
Universities Research Labs
Hospitals, Biotech firms
Sell Data Service
Sell large Systems
Sell Web Service
17
Bioinformatics Companies
  • Genomatrix Software, Genaissance Pharmaceuticals,
    Lynx, Lexicon Genetics, DeCode Genetics, CuraGen,
    AlphaGene, Bionavigation, Pangene, InforMax,
    TimeLogic, GeneCodes, LabOnWeb.com, Darwin,
    Celera, Incyte, BioResearch Online, BioTools,
    Oxford Molecular, Genomica, NetGenics, Rosetta,
    Lion BioScience, DoubleTwist, eBioinformatics,
    Prospect Genomics, Neomorphic, Molecular Mining,
    GeneLogic, GeneFormatics, Molecular Simulations,
  • Total 50.

18
Challenges to Computer Science
  • Enormous size of genomics data suitable for
    asymptotic analysis
  • Many NP-hard problems defy simple
    minded/non-professional approaches multiple
    alignment, distant homology, motif finding,
    protein folding, phylogeny, gene relationship in
    expression data, mining and learning,
  • Approaching these problems from computer science
    perspective will be fruitful.

19
Algorithmic Techniques in CB
  • Dynamic programming (alignment, etc)
  • Divide and conquer (Protein folding)
  • Approximation algorithms (sequence assembly,
    phylogeny)
  • Greedy algorithms (sequence assembly)
  • Heuristics
  • Linear programming and relaxation

20
Some CS jargons
  • NP-Complete easy to guess, hard to find.
  • Approximation algorithm If the minimal solution
    is f, your solution is ggtf, then the
    approximation ratio is g/f.
  • PTAS (Polynomial time approximation scheme) this
    is best kind of approximation algorithms for any
    e, we can achieve (1e)optimal in polynomial time
    (exponent might depend on e).

21
Chapter 3. DNA Sequencing
  • Two ways to copy DNA
  • 1. Polymerase Chain Reaction (1986) make many
    copies of a fragment (500). Needs primers (end
    segments). Cleave into 2. Anneal primers (5- 3,
    and 3- 5). Make two double chains. Repeat
    1,2,4,8,16,

3

5
5
3
5
3

5
3
22
  • 2. Cloning. Insert a large piece of DNA into a
    cloning vector (virus, bacteria, yeast -YAC).
    Then the vector is inserted into a host cell to
    duplicate naturally.
  • DNA Sequencing
  • Make many copies (single strand)
  • Cut them into fragments of lengths 500.
  • For each fragment of length L, use some process
    like PCR, generating all lengths 1 L with some
    fluorescence dye. In old scheme, you generate all
    fragments end with A, then with C, then G, then
    T, run them in 4 lanes of electrophoresis gel. In
    the new scheme, you have 4 colors (of the dye)
    all fragments in 1 lane.
  • Then assemble all fragments into the shortest
    common superstring by GREEDY repeatedly merge
    the pair with max overlap until finish.

23
Shortest Common Superstring
  • In FOCS 1990, we started formalize and analyze
    the following learning problem Infer orginal DNA
    sequence from fragments. Or given n strings,
    find the shortest common superstring. (1994, J.
    ACM)
  • The problem was proved to be NPC, 1980.
  • Open for 10 years does GREEDY work? (I.e. does
    it give linear approximation?)
  • We solved this, proving 3, STOC91.
  • Improvements by many people to 2.89, 2.81, 2.79,
    2.75, 2.66,
  • Formal Statement Given Ss1, sn, find a
    shortest s such that for all i, si is a substring
    of s.
  • E.g. alf ate half lethal alpha alfalfa ?
    lethalphalfalfate

24
Theorem. GREEDY achieves 4.
  • Proof. Given Ss1, ,sm, construct G
  • Nodes are s1, ,sm
  • Edges if then
    add edge
  • where pref is the pref length. I.e.
    siprefoverlap length with sj
  • SCS(S) length shortest Hamilton cycle in G
  • (Modified) Greedy restated find all cycles with
    minimum weights in G, then open cycle,
    concatenate to obtain the final superstring.

sj
pref
si
pref
si
sj
25
This minimum cycle exists
  • Assuming initial Hamilton cycle has w(C) n
  • Then merging si with sj is equivalent to breaking
    into two cycles. We have
  • w(C1) w(C2) lt n
  • Proof We merged (si, sj) because they have max
    overlap. Picture shows
  • d(si,sj)d(s,s)ltd(si,s)d(s,sj)
  • Continue this process end with self-cycles C1,
    C2, C3, C4,
  • Sw(Ci) lt n.

C
si
sj
s
s

S
sj
C1
si
S
si
sj
s
s
C2
26
  • Then we open cycles
  • Let Wiw(Ci)
  • Li longest string in Ci
  • open Cilt Wi Li
  • n gt SWi
  • Lemma. S1 and S2 overlap lt w1w2
  • S(Li-2Wi) lt n, by lemma, since Lis must be in
    final SCS.
  • SltS(LiWi)
  • S(Li-2Wi)S3Wi
  • lt n 3n
  • 4n.
  • QED

w1
w1
w1
w1
s1
w2
w2
w2
s2
27
Chapter 4. Sequence Comparison
  • Sequence Alignment
  • s(i,j)max
  • s(i-1,j) d(vi,-)
  • s(i,j-1)d(-,wj)
  • s(i-1,j-1)d(vi,wj)
  • No gap penalties,
  • where d, for proteins, is either PAM or BLOSUM.
  • d(-,x)d(x,-) -a, d(x,y)-u. When a0,
    uinfinity, it is LCS.
  • Longest Common Subsequence (LCS).
  • Vv1v2 vn
  • Ww1w2 wm
  • s(i,j) length of LCS
  • of V1..i and W1..j
  • Dynamic Programming
  • s(i-1,j)
  • s(i,j)max s(i,j-1)
  • s(i-1,j-1)1,viwj

28
Misc. Concerns
  • Local sequence alignment, add si,j0.
  • Gap penalties. For good reasons, we charge first
    gap cost a, and then subsequent continuous
    insertions blta.
  • Space efficient sequence alignment. Hirschberg
    alg. in n2 time, O(n) space.
  • Multiple alignment of k sequences nk

29
BLAST / Psi-BLAST / Gap-BLAST
  • Popular software, using heuristics. By Altschul,
    Gish, Miller, Myers, Lipman, 1990.
  • E(epected)-value e dmne -lS, here S is score, m
    is database length and n is query length.
  • Meaning e is number of hits one can expect to
    see just by chance when searching a database of
    size m.
  • Basic Strategy For all 3 a.a. (and closely
    related) triples, remember their locations in
    database. Given a query sequence S. For all
    triples in S, find their location in database,
    then extend as long as e-value significant.
    Similar strategy for DNA (7-11 nucleotides).

30
Here is an example of a BLAST match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment, I inserted some
white space in the alignment that are not in the
BLAST output I will explain in class. These
white spaces were inserts so that things line up
right (the k-th letter goes with the k-th letter).
31
Chap. 4. Tools From CS
  • Suffix Array. Give sequence
  • she_sells_seashell_on_the_sea_shore
  • Suffix array is an array of pointers pointing to
    suffices of the sequence sorted.
  • E.g., all suffices of above are e, re, ore,
    hore, shore,
  • _shore, a_shore, ea_shore, sea_shore, and so on.
  • Then these are sorted and their pointers
    (pointing to a suffix location in sentence) are
    stored in an array (the suffix array).

32
Why Suffix Array?
  • It can be built very fast.
  • It can answer queries very fast
  • How many times ATG appears (their pointers are
    all jammed together).
  • What is G-C contents.
  • Disadvantages
  • Cant do approximate matching
  • Hard to insert new stuff (need to rebuild the
    array) dynamically.
  • Pointers can cost too much space. 3G pointers?

33
Fast Pattern Matching
  • Given T (text) and P (pattern), is P in T?
  • Slow algorithm for each position in T, check if
    P is in T. This costs O(PT).
  • Fast algorithm no back tracking. P is short, we
    can calculate a table for P first if we have
    matched PP1 Pk with T halfway fail to match at
    Pi with Tj, since P1 Pi-1 already matched the
    text, we should know where to start in P and
    continue to match at Tj1 The table will
    indicate for each position Pi if we fail at i,
    where to start next. O(P2 T).
  • Knuth-Morris-Pratt, Boyer-Moore O(PT).

34
Chapter 5 Multiple alignment
  • To do phylogenetic analysis
  • Same protein from different species
  • Optimal multiple alignment probably implies
    history
  • Discover irregularities, such as Systic Fibrosis
    gene
  • To find conserved regions
  • Local multiple alignment reveals conserved
    regions
  • Conserved regions usually are key functional
    regions
  • These regions are prime targets for drug
    developments

35
Definitions
  • Given sequences s1, sn, multiple alignment M
    puts them in n rows, one sequence per row, with
    spaces inserted to get supersequences S1, , Sn,
    SiL.
  • SP-Alignment Minimize sum of Hamming(Si,Sj) over
    all pairs i, j.
  • Star-Alignment find center sequence S, minimize
    sum of Hamming(S,Si) over all i.
  • We will concentrate on SP-alignment.

36
CLUSTAL-W
  • Standard popular software
  • It does multiple alignment as follows
  • Align 2
  • Repeat keep on adding a new sequence to the
    alignment until no more, or do tree-like
    heuristics.
  • Problem It is simply a heuristics.
  • Alternative dynamic programming nk for k
    sequences. This is simply too slow.
  • We need to understand the problem and solve it
    right.

37
Making the Problem Simpler!
  • Multiple alignment is very hard
  • For k sequences, nk time, by dynamic programming
  • NP hard in general, not clear how to approximate
  • Popular practice -- alignment within a band the
    p-th letter in one sequence is not more than c
    places away from the p-th letter in another
    sequence in the final alignment the alignment
    is along a diagonal bandwidth 2c.
  • Used in final stage of FASTA program.

38
Literature (for details, see our STOC00 paper)
  • NP hardness under various models Wang-Jiang
    (JCB), Li-Ma-Wang (STOC99), W. Just
  • Approximation results Gusfield (2- 1/L), Bafna,
    Lawler, Pevzner (CPM94, 2-k/L), star alignment.
  • Sankoff, Kruskal discussed within a band
  • Pearson showed alignment within a band gives very
    good results for a lot protein superfamilies.
  • Altschul and Lipman, Chao-Pearson-Miller,
    Fickett, Ukkonen, Spouge (survey) all have
    studied alignment within a band.

39
The following were proved
  • SP-Alignment
  • NP hard
  • PTAS for constant band
  • PTAS for constant number of insertion/deletion
    gaps per sequence on average (for coding regions,
    this assumption makes a lot of sense)
  • Star-Alignment
  • PTAS in constant band
  • PTAS for constant number of insertion/deletion
    gaps per sequence on average

40
We will do only SP-alignment
  • Notation in an alignment, a block of inserted
    --- is called a gap. If a multiple alignment
    has c gaps on the average for each sequence, we
    call it average c-gap alignment.
  • We first design a PTAS for the average c-gap SP
    alignment.
  • Then using the PTAS for the average c-gap SP
    alignment, we design a PTAS for SP-alignment
    within a band.

41
Average c-gap SP Alignment
  • Key Idea choose r representative sequences, we
    find their correct alignment in the optimal
    alignment, by exhaustive search. Then we use this
    alignment as reference.
  • Then we align every other sequence against this
    alignment.
  • Then choose the best.
  • All we have to show is that there are r sequences
    whose letter frequencies in each column of their
    alignment approximates the complete alignment.

42
Some over-simplified reasoning
  • If M is optimal average c-gap SP alignment
  • In this alignment, many sequences have less than
    cl gaps.
  • So if we take r of these sequences, and try every
    possibly way, one way coincides with M.
  • Then hopefully, its letter frequencies in each
    column more or less approximates that of Ms
  • Then we can simply optimally align all the rest
    of the sequences one by one according to this
    frequency matrix.

43
Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
44
AverageSPAlign
  • for Lm to nm
  • for any r sequences
  • for all possible alignment M of length L
  • and with no more than cl gaps
  • align all other sequences to M //one
    alignment
  • Output the best alignment.

45
SP Alignment within c-Band
  • Basic Idea
  • Dynamically cut seq-uences into segments
  • Each segment satisfies the average c-gap
    condition. Hence use previous algorithm
  • Then assemble the segments together
  • Divide-Conquer.

Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
46
The final Algorithm diagonalAlign
  • while (not finished)
  • find a maximum prefix for each sequence (same
    length) such that AverageSPAlign returns
    lowcost. Keep the multiple alignment for this
    segment
  • Concatenate the multiple alignments for all
    segments together to as final alignment.

47
Discussion
  • Current PTAS is extremely slow.
  • However, the design of PTAS might provide useful
    hints to design fast programs.
  • It is an interesting project to implement this
    PTAS (combined with some heuristics) and test
    this idea.

48
Chapter 6. Motif Finding
  • Finding motifs/conserved regions in proteins is
    important in drug design and proteomics.
  • The problem is also called local alignment.
  • Many programs exist -- all based on heuristics
  • We proved in STOC99 it is NP-hard.
  • We provided a polynomial time approximation
    scheme with guaranteed performance in polynomial
    time in STOC99.
  • Based on this theoretical result, we have
    implemented the COPIA system.

49
Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
50
The PTAS (Li, Ma, Wang, STOC99)
  • Input S1, , Sm, integer L.
  • Output t1, ,tm, tiL (motifs)
  • For every r length L substring, compute the
    consensuse t of them.
  • In all strings, find closest substring to t of
    length L. From these, find the new consensus s.
  • Choose the best.
  • Theorem This algorithm outputs a solution no
    worse than 11/sqrt(r) optimal. Time complexity
    is (nm)O(r).

51
COPIA (COnsensus Pattern Identification
Analysis, Liang-Li-Ma) (http//dna.cs.ucsb.edu/
copia/copia_submit.html) and Others
  • Straightforward implementation of PTAS is
    extremely slow.
  • But we can do a few things to speed up
  • E.g. instead of choosing all r, choose randomly
  • Many programs exist. Some use HMM. Most popular
    perhaps is the Gibbs sampling method by Lawrence
    et al (see page 149 textbook). It is an iterative
    procedure as follows Input, S1, , Sm
  • Randomly choose a word wi from each Si
  • Randomly choose r
  • Create a frequency matrix (qij) from the m-1
    words not from Sr.
  • For each position in Sr, compute the probability
    a word fits the frequency matrix (qij), and use
    that word with the highest probability, repeat.

52
Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)
  • We introduce here a new technique for algorithm
    design (approximation).
  • For simplicity, lets consider a simplified
    problem. Given a set Ss1, sn sequences, each
    of length n, find string t not in S that is far
    away from all si in S find max d such that
    H(t,si) gt d.
  • Formulating an LP problem. If xx1 xn and
    sis1, sn, then H(x,si)x1 x2 xn,
    where xj xj if sj0, xj 1-xj if sj1, now we
    have an integer program
  • max D
  • S aijxjgtCiD for each si, aij0,1,-1 Ci are
    from Sxi
  • xi 0, 1

53
LP Relaxation
  • Integer Program (as we have just described) is
    NP-hard. So it is hard to solve it directly.
  • However, Linear Program is polynomial time
    solvable. So we relax our requirement that
    solution x1, xn have to be 0,1. Instead we may
    use 0?xi?1.
  • Then using a technique called randomized rounding
    we convert our fraction solution to integer
    solution by set xi1 with probability xi, xi0
    with probability 1-xi.
  • What does this give us? Well, for each aij xj
    Ci, after randomized rounding, we lose about
    logn, thus if D is O(n), we are within D-logD
    with high probability, so we have a very good
    approximation.
  • We can then derandomize the process getting a PTAS

54
Chapter 7 Protein Folding by Threading
  • Target ketosteroid isomerase KSI
  • Template 1ounA C_alpha-RMSD 1.97A and seq.
    identity 9 with KSI
  • 2.73A RMSD for all backbone atoms
  • 200 residues (M. Summers, in Science one of
    HIV proteins)

Blue predicted model by PROSPECT Red NMR
structure
55
Protein Folding Prediction by ThreadingThere are
many interesting work by Sippl-Weitchus,
Lathrop-Smith, Jones-Thornton, Godzik et al,
Bryant-Altschul, and many others. But we will
concentrate on one program PROSPECT by Ying Xu et
al.
  • Template library FSSP. It has currently about
    2000 known protein structures and specifies, for
    each protein
  • Core (?-helix, ?-sheet) regions.
  • For each a.a. position in sequence, secondary
    structure (3 options), Sol(vent) exposed,
    buried, or intermediate. (s.s.xsol total 9
    possibilities).
  • Distance interaction when lt8 angstrom. If in the
    same core, if gt4angstrom, also specify
    interaction.
  • Input Query sequence.

core
core
core
loop
loop
loop
loop
Protein
56
Goal Find Best Template
  • To minimize, over all templates T in FSSP
  • ETw1Emutatew2Esingletonw3Epairw4Egap
  • where
  • Emutate is alignment cost (as sequences, using
    PAM250)
  • Esingleton is a residues local preference of the
    ss and solvent environment. (Stored in table)
  • Epair specifies some (non-neighbor) pairs should
    be close in space (Energy level also in a table).
  • Egap is gap penalty. Not allowed in cores in
    PROSPECT.

57
Threading Algorithm
  • The problem in general is NPC.
  • The problematic part is Epair. Other parts can be
    optimized in polynomial time.
  • In order to compute Epair, draw an edge between
    each pair when there is pairwise interaction in
    FSSP.
  • PROSPECT uses a divide and conquer plus dynamic
    programming each step find best place to cut the
    problem into two parts (with smallest cut)
  • DP1,n,1,mmini,kDP1,i,1,k,x1, xc
  • DPi1,n,k1,mx1,,
    xc
  • Can precompute optimal cut fix i above.

58
Time Complexity
  • O(ncnm), c3,4 on average. c8 max. for all
    proteins in FSSP.
  • User provided information will also help
    threading, these include
  • Secondary structure information
  • Inter-residue distances (including NMR data).
  • Gap length requirements
  • Problems remain
  • Coefficients for all (wis). It is better to be
    specific.
  • Can this be done random polynomial time? Or
    efficient approximation? Currently too slow.
  • Fast computation is important for distant
    homology.

59
Side Topic Structure Alignment
  • Problem given two proteins with structures,
    align them together.
  • A simple algorithm repeat for each combination
    of three aa from one protein and three from
    another, align the six, then align the rest (by
    dynamic programming). This algorithm is about n8
    by Takutsu.

60
Chapter 8. Repeating Patterns
  • Over 45 human genome are (approximate) repeats.
    More in plants.
  • Some claim that genomics is a science of
    finding/studying genomic repeats. Genomes evolves
    by duplicating (then evolve) parts.
  • How do we find approximate genome level repeats?
  • We wrote the such a program PattenHunter (Ma,
    Tromp) and visualization tool (Miller)
  • Larry Miller/John Tromp demo.

61
Repeats Arabidopsis Chr 2 vs Chr 1p
62
Chapter 9. Coding Region Detection
  • Prokaryotic genome and enkaryotic genomes have
    different properties
  • Regulatory regions
  • Starting and ending codons
  • Gene density
  • Introns
  • Such programs usually use HMM to detect ORFs and
    intron/exon regions, and use EST databases, BLAST
    search, and species-specific statistics GenScan,
    Genie, GenMark..
  • Other programs use Neural Networks GRAIL

63
How it works in nature
  • How does it work in nature
  • Prokaryotes do not have introns
  • Single cell eukaryotes have less introns
  • Other eukaryotes may have up to 90 introns.

64
  • Spliceosome cuts off the introns which often
  • start with GU, ends with AG
  • In the middle, branch site ends with AC (match
    GU)

Note Yc/u, Nany
65
Hidden Markov Model
  • HMM is lt?,Q,A,Egt where
  • ? is symbol alphabet
  • Q is set of states (that emit symbols)
  • A is QxQ matrix of state transition
    probabilities
  • E is Qx? matrix of emission probabilities.
  • A path ??1 ?n is a sequence of states. The
    probability a sequence is generated by ? is
  • P(x?) ? P(xi?i)P(?i ?i1)
  • Decoding Problem Given x, find ? maximizing
    P(x?).
  • Fake/fair Coin Example Coinlt?,Q,A,Egt
  • ?0,1 (tail/head)
  • QF,B (fair, biased)
  • aFFaBB0.9, aFBaBF0.1
  • eF(0)1/2, eF(1)1/2, eB(0)1/4, eB(1)3/4
  • Decoding Problem solved by Viterbi in 1967, using
    Dynamic Programming.
  • Parameter estimation

66
A HMM Scheme for Gene Finder
67
Chapter 10. Phylogeny (Good reference Biological
sequence analysis, probabilistic models of
proteins Durbin, Eddy, Krogh, Mitchison,
Cambridge Univ Press.)
  • Consider a gene seq. from each of k species, it
    is possible to infer evolutionary relationship of
    these k species from these k genes.
  • Many heuristics for building evolutionary trees
    exist. The problem is known to be NP-hard.
  • We will discuss various algorithms
  • Max likelihood method fastNDAml/PROTml
  • Neighbor Joining
  • Pasimony
  • UPGMA
  • Quartet cleaning (SODA00)
  • PTAS finding the most consistent tree (FOCS98).

68
Tree of Evolution
  • AAAGGTACC

G ? T mutation
AAATGTACC
A ? G mutation
AAATGTACC
AAATGTGCC
A ? T mutation
AAATGTGCC
TAATGTGCC
69
UPGMA (Sokal, Michener, 1958)
0.1
0.1
0.1
0.4
0.4
  • Initialize
  • Ci si, for all i.
  • Repeat until one cluster left
  • Find two clusters Ci, Cj with minij
    dij(?dpq)/CiCj, p?Ci, q?Cj
  • Define node k with i,j as children, edge weight
    dij
  • Form cluster k, remove i,j clusters.

Problem of UPGMA
70
Neighbor Joining (Saitou-Nei, 1987)
  • Initialize
  • Tsequences, LT
  • Choose i,j?L such that dij-ri-rj minimized. Rest
    similar to UPGMA with similar modification on
    edge weights to k.
  • Here, ri, rj are the average distances from i,j
    to other nodes in L to compensate long edges.

71
Parsimony
  • Finds the tree with minimal number of
    substitutions.
  • Extremely slow. Use branch and bound.
  • Even this problem is not easy given the tree
    with leaves as sequences, find the best way to
    assign sequences to internal nodes so that the
    mutations are minimized. (Jiang-Lawler-Wang, 1994
    STOC)

72
Maximum Likelihood
  • Jukes-Cantor (1969) model equal substitution
    probability (1 parameter ?)
  • Kimura (1980) (2 parameter) model purine (A,G)
    to purine and pyrimidine (C,T,U) to pyrimidine
    substitution at different rates (? and ?).
  • Max Likelihood method wants a tree with max
    probability under these models. You have to
    exhaustively search through all trees, extremely
    slow process.

73
Quartet Methods
  • For every four sequences, construct a tree of 4
    nodes (quartet), using max likelihood say.
  • Then build a large tree consistent with (most of)
    these small trees (of size 4). But this step is
    (NP) hard.
  • New appoach data correction (see our papers in
    FOCS98, SODA00 on our bioinformatics lab
    website)

74
Quartets and Correction
c
a
Original tree
d
b
c
a
b
e
c
d
a
d
a
b
e
e
b
d
b
d
a
e
c
correction
c
a
e
c
error
e
d
75
HyperCleaning Software
  • For less than 30 taxa, HyperCleaning is
    comparable to fastDNAml (using maximum likehood
    score), and better than NJ.
  • For more than 30 taxa, true ML and MP methods
    take days and produces poor results.
    HyperCleaning do well, with better ML score.

76
(No Transcript)
77
Chapter 11. DNA Compression
  • Life represents order, not chaos. It should be
    compressible. Biological evidences
  • In eukaryotes long tandem repeats
  • multiple copies of essential genes (rRNA)
  • only 1000 protein folding patterns
  • genes duplicate themselves for evolutionary or
    selfishness purposes

78
GenCompress (Chen, Kwong, Li, GIW99,RECOMB00)
  • Lempel-Ziv style, one pass.
  • Encodes approx. matches (edit distance).
  • Encodes approximate reverse complements.
  • Carefully designed gain function and encoding.
    Arithmetic encoding when needed.
  • Works for conditional compression.
  • Best compression ratio for benchmark data.

79

Comparison with Biocompress-2
Biocompress-2 is by Grumbach-Tahi
80
Comparison with Cfact
Cfact is by Rivals et al
81
Chapter 12. Whole-Genome Phylogeny
  • Completely Sequenced
  • C. elegans, H. sapiens, D. melanogaster, Rice
  • 50 species of archaea bacteria and eubacteria
  • What can we do with them?
  • Snel, Bork, Huynen compare gene contents
  • Boore, Brown gene order
  • Sankoff, Pevzner, Kececioglu reversal/translocati
    on/synteny

82
New Thinking
  • Lets first give up traditional a priori
    conception of important similarities between 2
    sequences.
  • Can we simply define a distance d(x,y) that is
    universal if two seqs are close in any sense,
    then they are close under d(x,y)?

83
Defining a Distance
  • Shared Information
  • K(xy) is Kolmogorov complexity
  • of x condition on y, defined as
  • the length of shortest program
  • that outputs x on input y.
  • K(x) K(xy) is mutual
  • information

K(x) - K(xy) d(x,y) 1 -
----------------------
K(xy)
84
Universality
  • It can be proved, for any other computable
    normalized measure D(x,y) that satisfies some
    reasonable neighborhood density property, then we
    have there is a constant c such that for all
    x,y, we have
  • d(x,y) lt cD(x,y)
  • Informally speaking any similarity detected by D
    is also detected by d !

85
Approximate K(xy)
  • But K(xy) is not computable!!
  • We approximate K(xy) by Compress(xy).
  • DNAs are over alphabet A,C,G,T. Trivial
    algorithm gives 2 bits per base.
  • All commercial software like compress,
    compact, pkzip, arj give gt 2 bits/base
  • We will use GenCompress, since this is the best
    available compression alg for DNA sequences.

86
First Experiment Genome Tree
  • On a single gene (or several), current methods
  • Max. likelihood assumes statistical evolutionary
    models, compute the most likely tree. Based on
    multiple alignment.
  • Max. parsimony needs multiple alignment, then
    finds the best tree, minimizing cost.
  • Distance-based methods NJ Quartet methods,
    Fitch-Margoliash method.
  • Trouble different gene trees, manual alignment,
    horizontally transferred genes, do not handle
    genome level events.

87
Whole Genome Phylogeny
  • We provide an alternative approach. It
  • uses all the information in a genome
  • completely automated
  • needs no alignment robust, fast, accurate
  • no model -- simply universal
  • generalizes alignment distance, reversal
    distance, translocation distance
  • one genome, one evolution, one tree.

88
Eutherian Orders
  • It has been a disputed issue which two groups of
    placental mammals are closer Primates,
    Ferungulates, Rodents.
  • Hasegawas group concatenated 12 proteins from
    rat, house mouse, grey seal, harbor seal, cat,
    white rhino, horse, finback whale, blue whale,
    cow, gibbon, gorilla, human, chimpanzee, pygmy
    chimpanzee, orangutan, sumatran orangutan, with
    opossum, wallaroo, platypus as out group, 1998.
  • They used Max Likelihood method.

89
Eutherian Orders ...
  • We use exactly the same species.
  • And complete mtDNA genome
  • We computed d(x,y) for each pair of species, and
    used Neighbor Joining in Molphy package (and our
    own hypercleaning).
  • We constructed exactly the same tree. Confirming
  • ((Primates, Ferungulates), Rodents)

90
Evolutionary Tree of Mammals
91
2nd Exp Chaining Chain Letters
  • Charles Bennett collected 33 chain letters
    1980--1997. Using our new method, we
    reconstructed their history. Answered open
    questions of chain letter experts. Will appear in
    Scientific American.
  • Like a gene, they are about 2000 characters like
    a virus, they have infected billions of people,
    they mutate just like a genome, even has
    horizontal transfer. Traditional phylogeny
    methods should also work on them. But they dont
    alignment fail--translocated sentences no model
    of evolution.
  • We used our own method to calculate shared
    information between each pair of chain letters.

92
A sample letter
93
Another typical chain letter
with love all things are possible this paper has
been sent to you for good luck. the original is
in new england. it has been around the world
nine times. the luck has been sent to you. you
will receive good luck within four days of
receiving this letter. provided, in turn, you
send it on. this is no joke. you will receive
good luck in the mail. send no money. send
copies to people you think need good luck. do
not send money as faith has no price. do not keep
this letter. It must leave your hands within 96
hours. an r.a.f. (royal air force)
officer received 470,000. joe elliot received
40,000 and lost them because he broke the
chain. while in the philippines, george welch
lost his wife 51 days after he received the
letter. however before her death he received
7,755,000. please, send twenty copies and see
what happens in four days. the chain comes from
venezuela and was written by saul anthony de
grou, a missionary from south america. since
this letter must tour the world, you must make
twenty copies and send them to friends and
associates. after a few days you will get a
surprise. this is true even if you are not
superstitious. do note the following
constantine dias received the chain in 1953. he
asked his secretary to make twenty copies and
send them. a few days later, he won a lottery of
two million dollars. carlo daddit, an office
employee, received the letter and forgot it had
to leave his hands within 96 hours. he lost his
job. later, after finding the letter again, he
mailed twenty copies a few days later he got a
better job. dalan fairchild received the letter,
and not believing, threw the letter away, nine
days later he died. in 1987, the letter was
received by a young woman in california, it was
very faded and barely readable. she promised
herself she would retype the letter and send it
on, but she put it aside to do it later. she was
plagued with various problems including
expensive car repairs, the letter did not leave
her hands in 96 hours. she finally typed the
letter as promised and got a new car. remember,
send no money. do not ignore this. it works. st.
jude
94
Phylogeny of 33 Chain Letters
Confirmed by VanArsdales study, answers an open
question
95
Chapter 13. Expression Arrays
  • Expression arrays can hold thousands of genes
    (DNA) such that cDNA (complementary DNA) can
    hybridize on them.
  • The process is explained next page (J. Buhler).
  • This has opened great new opportunities for
    proteomics and biological pathway studies. For
    example, there are expression arrays containing
    complete set of genes for Yeast (Pat Brown Lab,
    Stanford).
  • Now that the human genome is completed with about
    30k genes. It is conceivable for a chip to
    contain the complete set of human genes.

96
1. Two kind cells
2. Reverse transcribe mRNA to cDNA
3. Label cDNA with fluorescein
4. Hybridize on the DNA array 5. Read
array 6. Analyze it.
97
Analysis Techniques
  • Finding genes expressed in one kind of cells and
    not in the other.
  • Cluster analysis finding what genes expressed
    together under some condition, to deduce
    pathways.
  • Data mining, machine learning techniques gene
    expression dependency. E.g. the expression of
    gene A implies the expression of gene B and no
    expression of gene C.
  • Endless possibilities, enormous data.
  • http//www.cs.ucsb.edu/mli/expression.html

98
An example of clustering
99
Chapter 14. Bioinformatics Platform
  • GCG (Wisconsin package)
  • Biotools
  • MacVector
  • Doubletwist webservice
  • NCBI webservice
  • Omega
  • Specialized packages like PAUP (phylogeny),
    GenePix (expression array)
  • Many more.

100
DryLab Open, distributed, platform independent
101
Chapter 15 Projects/Missing Topics
  • I wish I could cover all, but the time is short.
    Part of the missing topics will be covered by
    student projects/presentations.
  • Each student will work on one project.
  • And present the project in class.
  • Most projects involve serious programming in
    either C or Java.
  • Please start early and discuss with me first.
    Check our web page for projects.
  • When a project is large, I will allow two
    students to work on a project together.

102
Major missing topics
  • Genome rearrangement distances
  • Physical mapping
  • DNA arrays / Sequence by hybridization
  • Drug target design (from computational biology to
    computational chemistry).
  • RNA structure / folding.
  • Statistical studies / properties.

103
Acknowledgements
  • I would like to thank many coauthors, colleagues,
    students whose work I used here J. Badger, C.
    Bennett, B. Brejova, X. Chen, T. Jiang, P.
    Kearney, S. Kwong, C. Liang, G.H. Lin, B. Ma, L.
    Miller, H.D. Sun, J. Tromp, T. Vinar, L. Wang, Z.
    Wang, D. Xu, Y. Xu, H. Zhang.
  • I have copied some materials/figures from the
    internet I thank these authors as well.
  • Most importantly, thanks to the enthusiastic 290I
    students of Spring 2001 at UCSB who made this
    course fun to teach.
Write a Comment
User Comments (0)
About PowerShow.com