A Course on Bioinformatics

About This Presentation

Title:

A Course on Bioinformatics

Description:

The table will indicate for each position Pi if we fail at i, where to start next. O(|P|2 + |T|). Knuth-Morris-Pratt, Boyer-Moore: O(|P|+|T|). – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 104

Provided by: idiNtnuN6

Category:

more less

Transcript and Presenter's Notes

Title: A Course on Bioinformatics

1
A Course on Bioinformatics
Ming Li Bioinformatics Lab Computer Science
Dept., UCSB
2
Chapter 0. Why Study Bioinformatics?

The trend of genetic data growth
Many experts predict 21st century will be a
century of biotechnology.
Genomes Human, Rice, Mouse, Yeast, 50 species
bacteria, biggest digital gold mine ever.

30 billion in year 2005
3
Chapter 1. Biology
Phenotype
4
Animal CELL
Mitochondrion
Nucleolus (rRNA synthesized)
Cytoplasm
Nucleus
Plasma membrance Cell coat
Chromatin
Lots of other stuff/organelles/ribosome
5
Two kinds of Cells

Prokaryotes no nucleus (bacteria)
Their genomes are circular
Eukaryotes have nucleus (animal,plants)
Linear genomes with multiple chromosomes in
pairs. When pairing up, they look like

Middle centromere Top p-arm Bottom q-arm
6
Mitosis and Meiosis

Mitosis homologous chromosomes pair up,
duplicate, one set to each cell.
Meiosis chromosomes split to haploid
(reproductive) cells.
Haploid Number of Chromosomes
Human 23
Rice 12
Fruit Fly 4
Corn 10
Chimpanzee 24
House mouse 20

7
The DNA Sequences

All are made of H (hydrogen), C (carbon), N
(nitrogen), O (oxygen), S (sulfur), P
(phosphorus).
A (single chain) DNA sequence looks like

5
o
o
o
Sugar --- base
CH
A,C,G,T,U
2
c
c
Phosphate (PO )
c
c
4
Sugar --- base
A nucleotide
o
o
3
phosphate
Ribose RNA Deleting circled O, ? Deoxyribose
--DNA
Sugar --- base
etc
8
The 4 bases
Thymine
Adenine
H
H
H
C
o
A-T
N
H
H
H
N
C
C
C
c
C
C
N
N
H
H
N
C
N
C
C
N
o
H
H
H
Note this is flat!
o
H
N
H
N
C
C
C
Uracil replaces T in RNA
c
C
C
N
N
H
H
N
C
N
C
G-C
C
N
o
N
H
Cytosine
Guanine
H
9
Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
10
Genes and Proteins

One gene encodes one protein.
Like a program, it starts with start codon (e.g.
ATG), then each three code one amino acid. Then a
stop codon (e.g. TGA) signifies end of the gene.
Sometimes, in the middle of a (eukaryotic) gene,
there are introns that are spliced out (as junk)
during transcription. Good parts are called
exons. This is the task of gene finding.

11
A.A. Coding Table

Arginine (ARG) CG
Asparagine (ASN) AAT, AAC
Glutamine (GLN) CAA, CAG
Cysteine (CYS) TGT, TGC
Methionine (MET) ATG
Phenylalanine (PHE) TTT,TTC
Tyrosine (TYR) TAT, TAC
Tryptophan (TRP) TGG
Histidine (HIS) CAT, CAC
Proline (PRO) CC
Stop TGA, TAA, TAG

Glycine (GLY) GG
Alanine(ALA) GC
Valine (VAL) GT
Leucine (LEU) CT
Isoleucine (ILE) AT(-G)
Serine (SER) AGT, AGC
Threonine (THR) AC
Aspartic Acid (ASP) GAT,GAC
Glutamic Acid(GLU) GAA,GAG
Lysine (LYS) AAA, AAG
Start ATG, CTG, GTG

12
Bad Genes --gt Genetic Diseases

Hemophilia on X chromosome.
Cystic firosis on chr. 7, CFTR gene. 4
Caucasians carry the defective gene. (recessive)
Sickle-Cell Anemia single nucleotide mutation in
the first exon of beta-globin gene (removes a
cutting site). 1 in 12 African Americans are
carriers. (sick for homozygotes)
BRCA1 gene (chr. 17q) responsible for ½
inherited breast cancer (10 of breast cancer)
Fragile X syndrome (mentally retard) 1 in 1250
males, 2500 females (dominate, but females have
partially expressed good gene). FMR-1 gene
tri-nucleotide repeats gt200 causes disease.
P53 gene chr. 17p, responsible for ½ of all
cancers
X-rated XX, XY, XO(f), XXY(m), XYY(m)

13
Where to get data?

Larry Miller GenBank
http//www.ncbi.nlm.nih.gov
Dr. Huandong Sun
SWISS-PROT http//www.expasy.ch/sprot
PDB http//www.pdb.bnl.gov/
Go to our Bioinformatics Lab website
www.cs.ucsb.edu/mli/Bioinf/resources/index.html
for all the information.

14
Chapter 2. Research and Industry

Bioinformatics Labs and education programs
established in almost all major universities,
scattered in Biology, Physics, Computer Science
Research topics
Gene-finding
Mapping
Sequencing/sequence assembly.
Sequence Comparison alignment, database search
Mining DNA/proteins, finding motifs
Genome arrangement
DNA arrays
Computational proteomics, protein folding
Phylogeny reconstruction and analysis

15
The Commercial Market

Current bioinformatics market is worth 300
million / year. Half data, half software.
2 billion / year bioinformatics market in 5
years, predicted by Oscar Gruss Son.
Wide range of different demands
Bioinformatics software require sophisticated
algorithm support, ranging from probabilistic/
approximation algorithms, data mining, learning
algorithms, databases, to GUI design

16
BUSINESS Landscape
Pharmaceutical Companies
Universities Research Labs
Hospitals, Biotech firms
Sell Data Service
Sell large Systems
Sell Web Service
17
Bioinformatics Companies

Genomatrix Software, Genaissance Pharmaceuticals,
Lynx, Lexicon Genetics, DeCode Genetics, CuraGen,
AlphaGene, Bionavigation, Pangene, InforMax,
TimeLogic, GeneCodes, LabOnWeb.com, Darwin,
Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta,
Lion BioScience, DoubleTwist, eBioinformatics,
Prospect Genomics, Neomorphic, Molecular Mining,
GeneLogic, GeneFormatics, Molecular Simulations,
Total 50.

18
Challenges to Computer Science

Enormous size of genomics data suitable for
asymptotic analysis
Many NP-hard problems defy simple
minded/non-professional approaches multiple
alignment, distant homology, motif finding,
protein folding, phylogeny, gene relationship in
expression data, mining and learning,
Approaching these problems from computer science
perspective will be fruitful.

19
Algorithmic Techniques in CB

Dynamic programming (alignment, etc)
Divide and conquer (Protein folding)
Approximation algorithms (sequence assembly,
phylogeny)
Greedy algorithms (sequence assembly)
Heuristics
Linear programming and relaxation

20
Some CS jargons

NP-Complete easy to guess, hard to find.
Approximation algorithm If the minimal solution
is f, your solution is ggtf, then the
approximation ratio is g/f.
PTAS (Polynomial time approximation scheme) this
is best kind of approximation algorithms for any
e, we can achieve (1e)optimal in polynomial time
(exponent might depend on e).

21
Chapter 3. DNA Sequencing

Two ways to copy DNA
1. Polymerase Chain Reaction (1986) make many
copies of a fragment (500). Needs primers (end
segments). Cleave into 2. Anneal primers (5- 3,
and 3- 5). Make two double chains. Repeat
1,2,4,8,16,

3

5
5
3
5
3

5
3
22

2. Cloning. Insert a large piece of DNA into a
cloning vector (virus, bacteria, yeast -YAC).
Then the vector is inserted into a host cell to
duplicate naturally.
DNA Sequencing
Make many copies (single strand)
Cut them into fragments of lengths 500.
For each fragment of length L, use some process
like PCR, generating all lengths 1 L with some
fluorescence dye. In old scheme, you generate all
fragments end with A, then with C, then G, then
T, run them in 4 lanes of electrophoresis gel. In
the new scheme, you have 4 colors (of the dye)
all fragments in 1 lane.
Then assemble all fragments into the shortest
common superstring by GREEDY repeatedly merge
the pair with max overlap until finish.

23
Shortest Common Superstring

In FOCS 1990, we started formalize and analyze
the following learning problem Infer orginal DNA
sequence from fragments. Or given n strings,
find the shortest common superstring. (1994, J.
ACM)
The problem was proved to be NPC, 1980.
Open for 10 years does GREEDY work? (I.e. does
it give linear approximation?)
We solved this, proving 3, STOC91.
Improvements by many people to 2.89, 2.81, 2.79,
2.75, 2.66,
Formal Statement Given Ss1, sn, find a
shortest s such that for all i, si is a substring
of s.
E.g. alf ate half lethal alpha alfalfa ?
lethalphalfalfate

24
Theorem. GREEDY achieves 4.

Proof. Given Ss1, ,sm, construct G
Nodes are s1, ,sm
Edges if then
add edge
where pref is the pref length. I.e.
siprefoverlap length with sj
SCS(S) length shortest Hamilton cycle in G
(Modified) Greedy restated find all cycles with
minimum weights in G, then open cycle,
concatenate to obtain the final superstring.

sj
pref
si
pref
si
sj
25
This minimum cycle exists

Assuming initial Hamilton cycle has w(C) n
Then merging si with sj is equivalent to breaking
into two cycles. We have
w(C1) w(C2) lt n
Proof We merged (si, sj) because they have max
overlap. Picture shows
d(si,sj)d(s,s)ltd(si,s)d(s,sj)
Continue this process end with self-cycles C1,
C2, C3, C4,
Sw(Ci) lt n.

C
si
sj
s
s

S
sj
C1
si
S
si
sj
s
s
C2
26

Then we open cycles
Let Wiw(Ci)
Li longest string in Ci
open Cilt Wi Li
n gt SWi
Lemma. S1 and S2 overlap lt w1w2
S(Li-2Wi) lt n, by lemma, since Lis must be in
final SCS.
SltS(LiWi)
S(Li-2Wi)S3Wi
lt n 3n
4n.
QED

w1
w1
w1
w1
s1
w2
w2
w2
s2
27
Chapter 4. Sequence Comparison

Sequence Alignment
s(i,j)max
s(i-1,j) d(vi,-)
s(i,j-1)d(-,wj)
s(i-1,j-1)d(vi,wj)
No gap penalties,
where d, for proteins, is either PAM or BLOSUM.
d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.

Longest Common Subsequence (LCS).
Vv1v2 vn
Ww1w2 wm
s(i,j) length of LCS
of V1..i and W1..j
Dynamic Programming
s(i-1,j)
s(i,j)max s(i,j-1)
s(i-1,j-1)1,viwj

28
Misc. Concerns

Local sequence alignment, add si,j0.
Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta.
Space efficient sequence alignment. Hirschberg
alg. in n2 time, O(n) space.
Multiple alignment of k sequences nk

29
BLAST / Psi-BLAST / Gap-BLAST

Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990.
E(epected)-value e dmne -lS, here S is score, m
is database length and n is query length.
Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m.
Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides).

30
Here is an example of a BLAST match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment, I inserted some
white space in the alignment that are not in the
BLAST output I will explain in class. These
white spaces were inserts so that things line up
right (the k-th letter goes with the k-th letter).
31
Chap. 4. Tools From CS

Suffix Array. Give sequence
she_sells_seashell_on_the_sea_shore
Suffix array is an array of pointers pointing to
suffices of the sequence sorted.
E.g., all suffices of above are e, re, ore,
hore, shore,
_shore, a_shore, ea_shore, sea_shore, and so on.
Then these are sorted and their pointers
(pointing to a suffix location in sentence) are
stored in an array (the suffix array).

32
Why Suffix Array?

It can be built very fast.
It can answer queries very fast
How many times ATG appears (their pointers are
all jammed together).
What is G-C contents.
Disadvantages
Cant do approximate matching
Hard to insert new stuff (need to rebuild the
array) dynamically.
Pointers can cost too much space. 3G pointers?

33
Fast Pattern Matching

Given T (text) and P (pattern), is P in T?
Slow algorithm for each position in T, check if
P is in T. This costs O(PT).
Fast algorithm no back tracking. P is short, we
can calculate a table for P first if we have
matched PP1 Pk with T halfway fail to match at
Pi with Tj, since P1 Pi-1 already matched the
text, we should know where to start in P and
continue to match at Tj1 The table will
indicate for each position Pi if we fail at i,
where to start next. O(P2 T).
Knuth-Morris-Pratt, Boyer-Moore O(PT).

34
Chapter 5 Multiple alignment

To do phylogenetic analysis
Same protein from different species
Optimal multiple alignment probably implies
history
Discover irregularities, such as Systic Fibrosis
gene
To find conserved regions
Local multiple alignment reveals conserved
regions
Conserved regions usually are key functional
regions
These regions are prime targets for drug
developments

35
Definitions

Given sequences s1, sn, multiple alignment M
puts them in n rows, one sequence per row, with
spaces inserted to get supersequences S1, , Sn,
SiL.
SP-Alignment Minimize sum of Hamming(Si,Sj) over
all pairs i, j.
Star-Alignment find center sequence S, minimize
sum of Hamming(S,Si) over all i.
We will concentrate on SP-alignment.

36
CLUSTAL-W

Standard popular software
It does multiple alignment as follows
Align 2
Repeat keep on adding a new sequence to the
alignment until no more, or do tree-like
heuristics.
Problem It is simply a heuristics.
Alternative dynamic programming nk for k
sequences. This is simply too slow.
We need to understand the problem and solve it
right.

37
Making the Problem Simpler!

Multiple alignment is very hard
For k sequences, nk time, by dynamic programming
NP hard in general, not clear how to approximate
Popular practice -- alignment within a band the
p-th letter in one sequence is not more than c
places away from the p-th letter in another
sequence in the final alignment the alignment
is along a diagonal bandwidth 2c.
Used in final stage of FASTA program.

38
Literature (for details, see our STOC00 paper)

NP hardness under various models Wang-Jiang
(JCB), Li-Ma-Wang (STOC99), W. Just
Approximation results Gusfield (2- 1/L), Bafna,
Lawler, Pevzner (CPM94, 2-k/L), star alignment.
Sankoff, Kruskal discussed within a band
Pearson showed alignment within a band gives very
good results for a lot protein superfamilies.
Altschul and Lipman, Chao-Pearson-Miller,
Fickett, Ukkonen, Spouge (survey) all have
studied alignment within a band.

39
The following were proved

SP-Alignment
NP hard
PTAS for constant band
PTAS for constant number of insertion/deletion
gaps per sequence on average (for coding regions,
this assumption makes a lot of sense)

Star-Alignment
PTAS in constant band
PTAS for constant number of insertion/deletion
gaps per sequence on average

40
We will do only SP-alignment

Notation in an alignment, a block of inserted
--- is called a gap. If a multiple alignment
has c gaps on the average for each sequence, we
call it average c-gap alignment.
We first design a PTAS for the average c-gap SP
alignment.
Then using the PTAS for the average c-gap SP
alignment, we design a PTAS for SP-alignment
within a band.

41
Average c-gap SP Alignment

Key Idea choose r representative sequences, we
find their correct alignment in the optimal
alignment, by exhaustive search. Then we use this
alignment as reference.
Then we align every other sequence against this
alignment.
Then choose the best.
All we have to show is that there are r sequences
whose letter frequencies in each column of their
alignment approximates the complete alignment.

42
Some over-simplified reasoning

If M is optimal average c-gap SP alignment
In this alignment, many sequences have less than
cl gaps.
So if we take r of these sequences, and try every
possibly way, one way coincides with M.
Then hopefully, its letter frequencies in each
column more or less approximates that of Ms
Then we can simply optimally align all the rest
of the sequences one by one according to this
frequency matrix.

43
Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
44
AverageSPAlign

for Lm to nm
for any r sequences
for all possible alignment M of length L
and with no more than cl gaps
align all other sequences to M //one
alignment
Output the best alignment.

45
SP Alignment within c-Band

Basic Idea
Dynamically cut seq-uences into segments
Each segment satisfies the average c-gap
condition. Hence use previous algorithm
Then assemble the segments together
Divide-Conquer.

Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
46
The final Algorithm diagonalAlign

while (not finished)
find a maximum prefix for each sequence (same
length) such that AverageSPAlign returns
lowcost. Keep the multiple alignment for this
segment
Concatenate the multiple alignments for all
segments together to as final alignment.

47
Discussion

Current PTAS is extremely slow.
However, the design of PTAS might provide useful
hints to design fast programs.
It is an interesting project to implement this
PTAS (combined with some heuristics) and test
this idea.

48
Chapter 6. Motif Finding

Finding motifs/conserved regions in proteins is
important in drug design and proteomics.
The problem is also called local alignment.
Many programs exist -- all based on heuristics
We proved in STOC99 it is NP-hard.
We provided a polynomial time approximation
scheme with guaranteed performance in polynomial
time in STOC99.
Based on this theoretical result, we have
implemented the COPIA system.

49
Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
50
The PTAS (Li, Ma, Wang, STOC99)

Input S1, , Sm, integer L.
Output t1, ,tm, tiL (motifs)
For every r length L substring, compute the
consensuse t of them.
In all strings, find closest substring to t of
length L. From these, find the new consensus s.
Choose the best.
Theorem This algorithm outputs a solution no
worse than 11/sqrt(r) optimal. Time complexity
is (nm)O(r).

51
COPIA (COnsensus Pattern Identification
Analysis, Liang-Li-Ma) (http//dna.cs.ucsb.edu/
copia/copia_submit.html) and Others

Straightforward implementation of PTAS is
extremely slow.
But we can do a few things to speed up
E.g. instead of choosing all r, choose randomly
Many programs exist. Some use HMM. Most popular
perhaps is the Gibbs sampling method by Lawrence
et al (see page 149 textbook). It is an iterative
procedure as follows Input, S1, , Sm
Randomly choose a word wi from each Si
Randomly choose r
Create a frequency matrix (qij) from the m-1
words not from Sr.
For each position in Sr, compute the probability
a word fits the frequency matrix (qij), and use
that word with the highest probability, repeat.

52
Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)

We introduce here a new technique for algorithm
design (approximation).
For simplicity, lets consider a simplified
problem. Given a set Ss1, sn sequences, each
of length n, find string t not in S that is far
away from all si in S find max d such that
H(t,si) gt d.
Formulating an LP problem. If xx1 xn and
sis1, sn, then H(x,si)x1 x2 xn,
where xj xj if sj0, xj 1-xj if sj1, now we
have an integer program
max D
S aijxjgtCiD for each si, aij0,1,-1 Ci are
from Sxi
xi 0, 1

53
LP Relaxation

Integer Program (as we have just described) is
NP-hard. So it is hard to solve it directly.
However, Linear Program is polynomial time
solvable. So we relax our requirement that
solution x1, xn have to be 0,1. Instead we may
use 0?xi?1.
Then using a technique called randomized rounding
we convert our fraction solution to integer
solution by set xi1 with probability xi, xi0
with probability 1-xi.
What does this give us? Well, for each aij xj
Ci, after randomized rounding, we lose about
logn, thus if D is O(n), we are within D-logD
with high probability, so we have a very good
approximation.
We can then derandomize the process getting a PTAS

54
Chapter 7 Protein Folding by Threading

Target ketosteroid isomerase KSI
Template 1ounA C_alpha-RMSD 1.97A and seq.
identity 9 with KSI
2.73A RMSD for all backbone atoms
200 residues (M. Summers, in Science one of
HIV proteins)

Blue predicted model by PROSPECT Red NMR
structure
55
Protein Folding Prediction by ThreadingThere are
many interesting work by Sippl-Weitchus,
Lathrop-Smith, Jones-Thornton, Godzik et al,
Bryant-Altschul, and many others. But we will
concentrate on one program PROSPECT by Ying Xu et
al.

Template library FSSP. It has currently about
2000 known protein structures and specifies, for
each protein
Core (?-helix, ?-sheet) regions.
For each a.a. position in sequence, secondary
structure (3 options), Sol(vent) exposed,
buried, or intermediate. (s.s.xsol total 9
possibilities).
Distance interaction when lt8 angstrom. If in the
same core, if gt4angstrom, also specify
interaction.
Input Query sequence.

core
core
core
loop
loop
loop
loop
Protein
56
Goal Find Best Template

To minimize, over all templates T in FSSP
ETw1Emutatew2Esingletonw3Epairw4Egap
where
Emutate is alignment cost (as sequences, using
PAM250)
Esingleton is a residues local preference of the
ss and solvent environment. (Stored in table)
Epair specifies some (non-neighbor) pairs should
be close in space (Energy level also in a table).
Egap is gap penalty. Not allowed in cores in
PROSPECT.

57
Threading Algorithm

The problem in general is NPC.
The problematic part is Epair. Other parts can be
optimized in polynomial time.
In order to compute Epair, draw an edge between
each pair when there is pairwise interaction in
FSSP.
PROSPECT uses a divide and conquer plus dynamic
programming each step find best place to cut the
problem into two parts (with smallest cut)
DP1,n,1,mmini,kDP1,i,1,k,x1, xc
DPi1,n,k1,mx1,,
xc
Can precompute optimal cut fix i above.

58
Time Complexity

O(ncnm), c3,4 on average. c8 max. for all
proteins in FSSP.
User provided information will also help
threading, these include
Secondary structure information
Inter-residue distances (including NMR data).
Gap length requirements
Problems remain
Coefficients for all (wis). It is better to be
specific.
Can this be done random polynomial time? Or
efficient approximation? Currently too slow.
Fast computation is important for distant
homology.

59
Side Topic Structure Alignment

Problem given two proteins with structures,
align them together.
A simple algorithm repeat for each combination
of three aa from one protein and three from
another, align the six, then align the rest (by
dynamic programming). This algorithm is about n8
by Takutsu.

60
Chapter 8. Repeating Patterns

Over 45 human genome are (approximate) repeats.
More in plants.
Some claim that genomics is a science of
finding/studying genomic repeats. Genomes evolves
by duplicating (then evolve) parts.
How do we find approximate genome level repeats?
We wrote the such a program PattenHunter (Ma,
Tromp) and visualization tool (Miller)
Larry Miller/John Tromp demo.

61
Repeats Arabidopsis Chr 2 vs Chr 1p
62
Chapter 9. Coding Region Detection

Prokaryotic genome and enkaryotic genomes have
different properties
Regulatory regions
Starting and ending codons
Gene density
Introns
Such programs usually use HMM to detect ORFs and
intron/exon regions, and use EST databases, BLAST
search, and species-specific statistics GenScan,
Genie, GenMark..
Other programs use Neural Networks GRAIL

63
How it works in nature

How does it work in nature
Prokaryotes do not have introns
Single cell eukaryotes have less introns
Other eukaryotes may have up to 90 introns.

Spliceosome cuts off the introns which often
start with GU, ends with AG
In the middle, branch site ends with AC (match
GU)

Note Yc/u, Nany
65
Hidden Markov Model

HMM is lt?,Q,A,Egt where
? is symbol alphabet
Q is set of states (that emit symbols)
A is QxQ matrix of state transition
probabilities
E is Qx? matrix of emission probabilities.
A path ??1 ?n is a sequence of states. The
probability a sequence is generated by ? is
P(x?) ? P(xi?i)P(?i ?i1)
Decoding Problem Given x, find ? maximizing
P(x?).

Fake/fair Coin Example Coinlt?,Q,A,Egt
?0,1 (tail/head)
QF,B (fair, biased)
aFFaBB0.9, aFBaBF0.1
eF(0)1/2, eF(1)1/2, eB(0)1/4, eB(1)3/4
Decoding Problem solved by Viterbi in 1967, using
Dynamic Programming.
Parameter estimation

66
A HMM Scheme for Gene Finder
67
Chapter 10. Phylogeny (Good reference Biological
sequence analysis, probabilistic models of
proteins Durbin, Eddy, Krogh, Mitchison,
Cambridge Univ Press.)

Consider a gene seq. from each of k species, it
is possible to infer evolutionary relationship of
these k species from these k genes.
Many heuristics for building evolutionary trees
exist. The problem is known to be NP-hard.
We will discuss various algorithms
Max likelihood method fastNDAml/PROTml
Neighbor Joining
Pasimony
UPGMA
Quartet cleaning (SODA00)
PTAS finding the most consistent tree (FOCS98).

68
Tree of Evolution

AAAGGTACC

G ? T mutation
AAATGTACC
A ? G mutation
AAATGTACC
AAATGTGCC
A ? T mutation
AAATGTGCC
TAATGTGCC
69
UPGMA (Sokal, Michener, 1958)
0.1
0.1
0.1
0.4
0.4

Initialize
Ci si, for all i.
Repeat until one cluster left
Find two clusters Ci, Cj with minij
dij(?dpq)/CiCj, p?Ci, q?Cj
Define node k with i,j as children, edge weight
dij
Form cluster k, remove i,j clusters.

Problem of UPGMA
70
Neighbor Joining (Saitou-Nei, 1987)

Initialize
Tsequences, LT
Choose i,j?L such that dij-ri-rj minimized. Rest
similar to UPGMA with similar modification on
edge weights to k.
Here, ri, rj are the average distances from i,j
to other nodes in L to compensate long edges.

71
Parsimony

Finds the tree with minimal number of
substitutions.
Extremely slow. Use branch and bound.
Even this problem is not easy given the tree
with leaves as sequences, find the best way to
assign sequences to internal nodes so that the
mutations are minimized. (Jiang-Lawler-Wang, 1994
STOC)

72
Maximum Likelihood

Jukes-Cantor (1969) model equal substitution
probability (1 parameter ?)
Kimura (1980) (2 parameter) model purine (A,G)
to purine and pyrimidine (C,T,U) to pyrimidine
substitution at different rates (? and ?).
Max Likelihood method wants a tree with max
probability under these models. You have to
exhaustively search through all trees, extremely
slow process.

73
Quartet Methods

For every four sequences, construct a tree of 4
nodes (quartet), using max likelihood say.
Then build a large tree consistent with (most of)
these small trees (of size 4). But this step is
(NP) hard.
New appoach data correction (see our papers in
FOCS98, SODA00 on our bioinformatics lab
website)

74
Quartets and Correction
c
a
Original tree
d
b
c
a
b
e
c
d
a
d
a
b
e
e
b
d
b
d
a
e
c
correction
c
a
e
c
error
e
d
75
HyperCleaning Software

For less than 30 taxa, HyperCleaning is
comparable to fastDNAml (using maximum likehood
score), and better than NJ.
For more than 30 taxa, true ML and MP methods
take days and produces poor results.
HyperCleaning do well, with better ML score.

76
(No Transcript)
77
Chapter 11. DNA Compression

Life represents order, not chaos. It should be
compressible. Biological evidences
In eukaryotes long tandem repeats
multiple copies of essential genes (rRNA)
only 1000 protein folding patterns
genes duplicate themselves for evolutionary or
selfishness purposes

78
GenCompress (Chen, Kwong, Li, GIW99,RECOMB00)

Lempel-Ziv style, one pass.
Encodes approx. matches (edit distance).
Encodes approximate reverse complements.
Carefully designed gain function and encoding.
Arithmetic encoding when needed.
Works for conditional compression.
Best compression ratio for benchmark data.

79

Comparison with Biocompress-2
Biocompress-2 is by Grumbach-Tahi
80
Comparison with Cfact
Cfact is by Rivals et al
81
Chapter 12. Whole-Genome Phylogeny

Completely Sequenced
C. elegans, H. sapiens, D. melanogaster, Rice
50 species of archaea bacteria and eubacteria
What can we do with them?
Snel, Bork, Huynen compare gene contents
Boore, Brown gene order
Sankoff, Pevzner, Kececioglu reversal/translocati
on/synteny

82
New Thinking

Lets first give up traditional a priori
conception of important similarities between 2
sequences.
Can we simply define a distance d(x,y) that is
universal if two seqs are close in any sense,
then they are close under d(x,y)?

83
Defining a Distance

Shared Information
K(xy) is Kolmogorov complexity
of x condition on y, defined as
the length of shortest program
that outputs x on input y.
K(x) K(xy) is mutual
information

K(x) - K(xy) d(x,y) 1 -
----------------------
K(xy)
84
Universality

It can be proved, for any other computable
normalized measure D(x,y) that satisfies some
reasonable neighborhood density property, then we
have there is a constant c such that for all
x,y, we have
d(x,y) lt cD(x,y)
Informally speaking any similarity detected by D
is also detected by d !

85
Approximate K(xy)

But K(xy) is not computable!!
We approximate K(xy) by Compress(xy).
DNAs are over alphabet A,C,G,T. Trivial
algorithm gives 2 bits per base.
All commercial software like compress,
compact, pkzip, arj give gt 2 bits/base
We will use GenCompress, since this is the best
available compression alg for DNA sequences.

86
First Experiment Genome Tree

On a single gene (or several), current methods
Max. likelihood assumes statistical evolutionary
models, compute the most likely tree. Based on
multiple alignment.
Max. parsimony needs multiple alignment, then
finds the best tree, minimizing cost.
Distance-based methods NJ Quartet methods,
Fitch-Margoliash method.
Trouble different gene trees, manual alignment,
horizontally transferred genes, do not handle
genome level events.

87
Whole Genome Phylogeny

We provide an alternative approach. It
uses all the information in a genome
completely automated
needs no alignment robust, fast, accurate
no model -- simply universal
generalizes alignment distance, reversal
distance, translocation distance
one genome, one evolution, one tree.

88
Eutherian Orders

It has been a disputed issue which two groups of
placental mammals are closer Primates,
Ferungulates, Rodents.
Hasegawas group concatenated 12 proteins from
rat, house mouse, grey seal, harbor seal, cat,
white rhino, horse, finback whale, blue whale,
cow, gibbon, gorilla, human, chimpanzee, pygmy
chimpanzee, orangutan, sumatran orangutan, with
opossum, wallaroo, platypus as out group, 1998.
They used Max Likelihood method.

89
Eutherian Orders ...

We use exactly the same species.
And complete mtDNA genome
We computed d(x,y) for each pair of species, and
used Neighbor Joining in Molphy package (and our
own hypercleaning).
We constructed exactly the same tree. Confirming
((Primates, Ferungulates), Rodents)

90
Evolutionary Tree of Mammals
91
2nd Exp Chaining Chain Letters

Charles Bennett collected 33 chain letters
1980--1997. Using our new method, we
reconstructed their history. Answered open
questions of chain letter experts. Will appear in
Scientific American.
Like a gene, they are about 2000 characters like
a virus, they have infected billions of people,
they mutate just like a genome, even has
horizontal transfer. Traditional phylogeny
methods should also work on them. But they dont
alignment fail--translocated sentences no model
of evolution.
We used our own method to calculate shared
information between each pair of chain letters.

92
A sample letter
93
Another typical chain letter
with love all things are possible this paper has
been sent to you for good luck. the original is
in new england. it has been around the world
nine times. the luck has been sent to you. you
will receive good luck within four days of
receiving this letter. provided, in turn, you
send it on. this is no joke. you will receive
good luck in the mail. send no money. send
copies to people you think need good luck. do
not send money as faith has no price. do not keep
this letter. It must leave your hands within 96
hours. an r.a.f. (royal air force)
officer received 470,000. joe elliot received
40,000 and lost them because he broke the
chain. while in the philippines, george welch
lost his wife 51 days after he received the
letter. however before her death he received
7,755,000. please, send twenty copies and see
what happens in four days. the chain comes from
venezuela and was written by saul anthony de
grou, a missionary from south america. since
this letter must tour the world, you must make
twenty copies and send them to friends and
associates. after a few days you will get a
surprise. this is true even if you are not
superstitious. do note the following
constantine dias received the chain in 1953. he
asked his secretary to make twenty copies and
send them. a few days later, he won a lottery of
two million dollars. carlo daddit, an office
employee, received the letter and forgot it had
to leave his hands within 96 hours. he lost his
job. later, after finding the letter again, he
mailed twenty copies a few days later he got a
better job. dalan fairchild received the letter,
and not believing, threw the letter away, nine
days later he died. in 1987, the letter was
received by a young woman in california, it was
very faded and barely readable. she promised
herself she would retype the letter and send it
on, but she put it aside to do it later. she was
plagued with various problems including
expensive car repairs, the letter did not leave
her hands in 96 hours. she finally typed the
letter as promised and got a new car. remember,
send no money. do not ignore this. it works. st.
jude
94
Phylogeny of 33 Chain Letters
Confirmed by VanArsdales study, answers an open
question
95
Chapter 13. Expression Arrays

Expression arrays can hold thousands of genes
(DNA) such that cDNA (complementary DNA) can
hybridize on them.
The process is explained next page (J. Buhler).
This has opened great new opportunities for
proteomics and biological pathway studies. For
example, there are expression arrays containing
complete set of genes for Yeast (Pat Brown Lab,
Stanford).
Now that the human genome is completed with about
30k genes. It is conceivable for a chip to
contain the complete set of human genes.

96
1. Two kind cells
2. Reverse transcribe mRNA to cDNA
3. Label cDNA with fluorescein
4. Hybridize on the DNA array 5. Read
array 6. Analyze it.
97
Analysis Techniques

Finding genes expressed in one kind of cells and
not in the other.
Cluster analysis finding what genes expressed
together under some condition, to deduce
pathways.
Data mining, machine learning techniques gene
expression dependency. E.g. the expression of
gene A implies the expression of gene B and no
expression of gene C.
Endless possibilities, enormous data.
http//www.cs.ucsb.edu/mli/expression.html

98
An example of clustering
99
Chapter 14. Bioinformatics Platform

GCG (Wisconsin package)
Biotools
MacVector
Doubletwist webservice
NCBI webservice
Omega
Specialized packages like PAUP (phylogeny),
GenePix (expression array)
Many more.

100
DryLab Open, distributed, platform independent
101
Chapter 15 Projects/Missing Topics

I wish I could cover all, but the time is short.
Part of the missing topics will be covered by
student projects/presentations.
Each student will work on one project.
And present the project in class.
Most projects involve serious programming in
either C or Java.
Please start early and discuss with me first.
Check our web page for projects.
When a project is large, I will allow two
students to work on a project together.

102
Major missing topics

Genome rearrangement distances
Physical mapping
DNA arrays / Sequence by hybridization
Drug target design (from computational biology to
computational chemistry).
RNA structure / folding.
Statistical studies / properties.

103
Acknowledgements

I would like to thank many coauthors, colleagues,
students whose work I used here J. Badger, C.
Bennett, B. Brejova, X. Chen, T. Jiang, P.
Kearney, S. Kwong, C. Liang, G.H. Lin, B. Ma, L.
Miller, H.D. Sun, J. Tromp, T. Vinar, L. Wang, Z.
Wang, D. Xu, Y. Xu, H. Zhang.
I have copied some materials/figures from the
internet I thank these authors as well.
Most importantly, thanks to the enthusiastic 290I
students of Spring 2001 at UCSB who made this
course fun to teach.