Title: A Course on Bioinformatics
1A Course on Bioinformatics
Ming Li Bioinformatics Lab Computer Science
Dept., UCSB
2 Chapter 0. Why Study Bioinformatics?
- The trend of genetic data growth
- Many experts predict 21st century will be a
century of biotechnology. - Genomes Human, Rice, Mouse, Yeast, 50 species
bacteria, biggest digital gold mine ever.
30 billion in year 2005
3Chapter 1. Biology
Phenotype
4Animal CELL
Mitochondrion
Nucleolus (rRNA synthesized)
Cytoplasm
Nucleus
Plasma membrance Cell coat
Chromatin
Lots of other stuff/organelles/ribosome
5Two kinds of Cells
- Prokaryotes no nucleus (bacteria)
- Their genomes are circular
- Eukaryotes have nucleus (animal,plants)
- Linear genomes with multiple chromosomes in
pairs. When pairing up, they look like
Middle centromere Top p-arm Bottom q-arm
6Mitosis and Meiosis
- Mitosis homologous chromosomes pair up,
duplicate, one set to each cell. - Meiosis chromosomes split to haploid
(reproductive) cells. - Haploid Number of Chromosomes
- Human 23
- Rice 12
- Fruit Fly 4
- Corn 10
- Chimpanzee 24
- House mouse 20
7The DNA Sequences
- All are made of H (hydrogen), C (carbon), N
(nitrogen), O (oxygen), S (sulfur), P
(phosphorus). - A (single chain) DNA sequence looks like
-
5
o
o
o
Sugar --- base
CH
A,C,G,T,U
2
c
c
Phosphate (PO )
c
c
4
Sugar --- base
A nucleotide
o
o
3
phosphate
Ribose RNA Deleting circled O, ? Deoxyribose
--DNA
Sugar --- base
etc
8 The 4 bases
Thymine
Adenine
H
H
H
C
o
A-T
N
H
H
H
N
C
C
C
c
C
C
N
N
H
H
N
C
N
C
C
N
o
H
H
H
Note this is flat!
o
H
N
H
N
C
C
C
Uracil replaces T in RNA
c
C
C
N
N
H
H
N
C
N
C
G-C
C
N
o
N
H
Cytosine
Guanine
H
9Human 3 billion bases, 30k genes. E. coli 5
million bases, 4k genes
T
A
A
T
C
G
cDNA
T
A
reverse transcription
T
A
translation
transcription
C
G
Protein
mRNA
G
C
(20 amino acids)
(A,C,G,U)
G
C
Codon three nucleotides encode an
amino acid. 64 codons 20 amino acids, some w/more
codes
A
T
C
G
T
A
A
T
10Genes and Proteins
- One gene encodes one protein.
- Like a program, it starts with start codon (e.g.
ATG), then each three code one amino acid. Then a
stop codon (e.g. TGA) signifies end of the gene. - Sometimes, in the middle of a (eukaryotic) gene,
there are introns that are spliced out (as junk)
during transcription. Good parts are called
exons. This is the task of gene finding.
11A.A. Coding Table
- Arginine (ARG) CG
- Asparagine (ASN) AAT, AAC
- Glutamine (GLN) CAA, CAG
- Cysteine (CYS) TGT, TGC
- Methionine (MET) ATG
- Phenylalanine (PHE) TTT,TTC
- Tyrosine (TYR) TAT, TAC
- Tryptophan (TRP) TGG
- Histidine (HIS) CAT, CAC
- Proline (PRO) CC
- Stop TGA, TAA, TAG
- Glycine (GLY) GG
- Alanine(ALA) GC
- Valine (VAL) GT
- Leucine (LEU) CT
- Isoleucine (ILE) AT(-G)
- Serine (SER) AGT, AGC
- Threonine (THR) AC
- Aspartic Acid (ASP) GAT,GAC
- Glutamic Acid(GLU) GAA,GAG
- Lysine (LYS) AAA, AAG
- Start ATG, CTG, GTG
12Bad Genes --gt Genetic Diseases
- Hemophilia on X chromosome.
- Cystic firosis on chr. 7, CFTR gene. 4
Caucasians carry the defective gene. (recessive) - Sickle-Cell Anemia single nucleotide mutation in
the first exon of beta-globin gene (removes a
cutting site). 1 in 12 African Americans are
carriers. (sick for homozygotes) - BRCA1 gene (chr. 17q) responsible for ½
inherited breast cancer (10 of breast cancer) - Fragile X syndrome (mentally retard) 1 in 1250
males, 2500 females (dominate, but females have
partially expressed good gene). FMR-1 gene
tri-nucleotide repeats gt200 causes disease. - P53 gene chr. 17p, responsible for ½ of all
cancers - X-rated XX, XY, XO(f), XXY(m), XYY(m)
13Where to get data?
- Larry Miller GenBank
- http//www.ncbi.nlm.nih.gov
- Dr. Huandong Sun
- SWISS-PROT http//www.expasy.ch/sprot
- PDB http//www.pdb.bnl.gov/
- Go to our Bioinformatics Lab website
www.cs.ucsb.edu/mli/Bioinf/resources/index.html - for all the information.
14Chapter 2. Research and Industry
- Bioinformatics Labs and education programs
established in almost all major universities,
scattered in Biology, Physics, Computer Science - Research topics
- Gene-finding
- Mapping
- Sequencing/sequence assembly.
- Sequence Comparison alignment, database search
- Mining DNA/proteins, finding motifs
- Genome arrangement
- DNA arrays
- Computational proteomics, protein folding
- Phylogeny reconstruction and analysis
15The Commercial Market
- Current bioinformatics market is worth 300
million / year. Half data, half software. - 2 billion / year bioinformatics market in 5
years, predicted by Oscar Gruss Son. - Wide range of different demands
- Bioinformatics software require sophisticated
algorithm support, ranging from probabilistic/
approximation algorithms, data mining, learning
algorithms, databases, to GUI design
16BUSINESS Landscape
Pharmaceutical Companies
Universities Research Labs
Hospitals, Biotech firms
Sell Data Service
Sell large Systems
Sell Web Service
17Bioinformatics Companies
- Genomatrix Software, Genaissance Pharmaceuticals,
Lynx, Lexicon Genetics, DeCode Genetics, CuraGen,
AlphaGene, Bionavigation, Pangene, InforMax,
TimeLogic, GeneCodes, LabOnWeb.com, Darwin,
Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta,
Lion BioScience, DoubleTwist, eBioinformatics,
Prospect Genomics, Neomorphic, Molecular Mining,
GeneLogic, GeneFormatics, Molecular Simulations, - Total 50.
18 Challenges to Computer Science
- Enormous size of genomics data suitable for
asymptotic analysis - Many NP-hard problems defy simple
minded/non-professional approaches multiple
alignment, distant homology, motif finding,
protein folding, phylogeny, gene relationship in
expression data, mining and learning, - Approaching these problems from computer science
perspective will be fruitful.
19Algorithmic Techniques in CB
- Dynamic programming (alignment, etc)
- Divide and conquer (Protein folding)
- Approximation algorithms (sequence assembly,
phylogeny) - Greedy algorithms (sequence assembly)
- Heuristics
- Linear programming and relaxation
20Some CS jargons
- NP-Complete easy to guess, hard to find.
- Approximation algorithm If the minimal solution
is f, your solution is ggtf, then the
approximation ratio is g/f. - PTAS (Polynomial time approximation scheme) this
is best kind of approximation algorithms for any
e, we can achieve (1e)optimal in polynomial time
(exponent might depend on e).
21Chapter 3. DNA Sequencing
- Two ways to copy DNA
- 1. Polymerase Chain Reaction (1986) make many
copies of a fragment (500). Needs primers (end
segments). Cleave into 2. Anneal primers (5- 3,
and 3- 5). Make two double chains. Repeat
1,2,4,8,16,
3
5
5
3
5
3
5
3
22- 2. Cloning. Insert a large piece of DNA into a
cloning vector (virus, bacteria, yeast -YAC).
Then the vector is inserted into a host cell to
duplicate naturally. - DNA Sequencing
- Make many copies (single strand)
- Cut them into fragments of lengths 500.
- For each fragment of length L, use some process
like PCR, generating all lengths 1 L with some
fluorescence dye. In old scheme, you generate all
fragments end with A, then with C, then G, then
T, run them in 4 lanes of electrophoresis gel. In
the new scheme, you have 4 colors (of the dye)
all fragments in 1 lane. - Then assemble all fragments into the shortest
common superstring by GREEDY repeatedly merge
the pair with max overlap until finish.
23Shortest Common Superstring
- In FOCS 1990, we started formalize and analyze
the following learning problem Infer orginal DNA
sequence from fragments. Or given n strings,
find the shortest common superstring. (1994, J.
ACM) - The problem was proved to be NPC, 1980.
- Open for 10 years does GREEDY work? (I.e. does
it give linear approximation?) - We solved this, proving 3, STOC91.
- Improvements by many people to 2.89, 2.81, 2.79,
2.75, 2.66, - Formal Statement Given Ss1, sn, find a
shortest s such that for all i, si is a substring
of s. - E.g. alf ate half lethal alpha alfalfa ?
lethalphalfalfate
24Theorem. GREEDY achieves 4.
- Proof. Given Ss1, ,sm, construct G
- Nodes are s1, ,sm
- Edges if then
add edge - where pref is the pref length. I.e.
siprefoverlap length with sj - SCS(S) length shortest Hamilton cycle in G
- (Modified) Greedy restated find all cycles with
minimum weights in G, then open cycle,
concatenate to obtain the final superstring.
sj
pref
si
pref
si
sj
25This minimum cycle exists
- Assuming initial Hamilton cycle has w(C) n
- Then merging si with sj is equivalent to breaking
into two cycles. We have - w(C1) w(C2) lt n
- Proof We merged (si, sj) because they have max
overlap. Picture shows -
-
- d(si,sj)d(s,s)ltd(si,s)d(s,sj)
- Continue this process end with self-cycles C1,
C2, C3, C4, - Sw(Ci) lt n.
C
si
sj
s
s
S
sj
C1
si
S
si
sj
s
s
C2
26- Then we open cycles
- Let Wiw(Ci)
- Li longest string in Ci
- open Cilt Wi Li
- n gt SWi
- Lemma. S1 and S2 overlap lt w1w2
- S(Li-2Wi) lt n, by lemma, since Lis must be in
final SCS. - SltS(LiWi)
- S(Li-2Wi)S3Wi
- lt n 3n
- 4n.
- QED
w1
w1
w1
w1
s1
w2
w2
w2
s2
27Chapter 4. Sequence Comparison
- Sequence Alignment
- s(i,j)max
- s(i-1,j) d(vi,-)
- s(i,j-1)d(-,wj)
- s(i-1,j-1)d(vi,wj)
- No gap penalties,
- where d, for proteins, is either PAM or BLOSUM.
- d(-,x)d(x,-) -a, d(x,y)-u. When a0,
uinfinity, it is LCS.
- Longest Common Subsequence (LCS).
- Vv1v2 vn
- Ww1w2 wm
- s(i,j) length of LCS
- of V1..i and W1..j
- Dynamic Programming
- s(i-1,j)
- s(i,j)max s(i,j-1)
- s(i-1,j-1)1,viwj
28Misc. Concerns
- Local sequence alignment, add si,j0.
- Gap penalties. For good reasons, we charge first
gap cost a, and then subsequent continuous
insertions blta. - Space efficient sequence alignment. Hirschberg
alg. in n2 time, O(n) space. - Multiple alignment of k sequences nk
29BLAST / Psi-BLAST / Gap-BLAST
- Popular software, using heuristics. By Altschul,
Gish, Miller, Myers, Lipman, 1990. - E(epected)-value e dmne -lS, here S is score, m
is database length and n is query length. - Meaning e is number of hits one can expect to
see just by chance when searching a database of
size m. - Basic Strategy For all 3 a.a. (and closely
related) triples, remember their locations in
database. Given a query sequence S. For all
triples in S, find their location in database,
then extend as long as e-value significant.
Similar strategy for DNA (7-11 nucleotides).
30Here is an example of a BLAST match (E-value 0)
between gene 0189 in C. pneumoniae and gene 131
in C. trachomatis. Query CPn0189
Score
(bits) E-value Aligned with CT131 hypothetical
protein 1240
0.0 Query 1 MKRRSWLKILGICLGSSIVLGFLIFLPQLLSTE
SRKYLVFSL I HKESGLSCSAEELKISW 60
MKR W KI G L L L LP SES KYL
SKEGL EL SW Sbjct 1
MKRSPWYKIFGYYLLVGVPLALLALLPKFFSSESGKYLFLSVLNKETGLQ
F EIEQLHLSW 60 Query 61 FGRQTARKIKLTG-EAKDEVFS
AEKFELDGSLLRLL I YKKPKGITLSGWSLKINEPASID 119
FG QTAKI G EFAEK GSL
RLLY PK TLGWSLIE S Sbjct 61
FGSQTAKKIRIRGIDSDSEIFAAEKI IVKGSLPRLLL
YRFPKALTLTGWSLQIDESLSMN 120 Etc. Note Because of
powerpoint character alignment, I inserted some
white space in the alignment that are not in the
BLAST output I will explain in class. These
white spaces were inserts so that things line up
right (the k-th letter goes with the k-th letter).
31Chap. 4. Tools From CS
- Suffix Array. Give sequence
- she_sells_seashell_on_the_sea_shore
- Suffix array is an array of pointers pointing to
suffices of the sequence sorted. - E.g., all suffices of above are e, re, ore,
hore, shore, - _shore, a_shore, ea_shore, sea_shore, and so on.
- Then these are sorted and their pointers
(pointing to a suffix location in sentence) are
stored in an array (the suffix array).
32Why Suffix Array?
- It can be built very fast.
- It can answer queries very fast
- How many times ATG appears (their pointers are
all jammed together). - What is G-C contents.
- Disadvantages
- Cant do approximate matching
- Hard to insert new stuff (need to rebuild the
array) dynamically. - Pointers can cost too much space. 3G pointers?
33Fast Pattern Matching
- Given T (text) and P (pattern), is P in T?
- Slow algorithm for each position in T, check if
P is in T. This costs O(PT). - Fast algorithm no back tracking. P is short, we
can calculate a table for P first if we have
matched PP1 Pk with T halfway fail to match at
Pi with Tj, since P1 Pi-1 already matched the
text, we should know where to start in P and
continue to match at Tj1 The table will
indicate for each position Pi if we fail at i,
where to start next. O(P2 T). - Knuth-Morris-Pratt, Boyer-Moore O(PT).
34Chapter 5 Multiple alignment
- To do phylogenetic analysis
- Same protein from different species
- Optimal multiple alignment probably implies
history - Discover irregularities, such as Systic Fibrosis
gene - To find conserved regions
- Local multiple alignment reveals conserved
regions - Conserved regions usually are key functional
regions - These regions are prime targets for drug
developments
35Definitions
- Given sequences s1, sn, multiple alignment M
puts them in n rows, one sequence per row, with
spaces inserted to get supersequences S1, , Sn,
SiL. - SP-Alignment Minimize sum of Hamming(Si,Sj) over
all pairs i, j. - Star-Alignment find center sequence S, minimize
sum of Hamming(S,Si) over all i. - We will concentrate on SP-alignment.
36CLUSTAL-W
- Standard popular software
- It does multiple alignment as follows
- Align 2
- Repeat keep on adding a new sequence to the
alignment until no more, or do tree-like
heuristics. - Problem It is simply a heuristics.
- Alternative dynamic programming nk for k
sequences. This is simply too slow. - We need to understand the problem and solve it
right.
37Making the Problem Simpler!
- Multiple alignment is very hard
- For k sequences, nk time, by dynamic programming
- NP hard in general, not clear how to approximate
- Popular practice -- alignment within a band the
p-th letter in one sequence is not more than c
places away from the p-th letter in another
sequence in the final alignment the alignment
is along a diagonal bandwidth 2c. - Used in final stage of FASTA program.
38Literature (for details, see our STOC00 paper)
- NP hardness under various models Wang-Jiang
(JCB), Li-Ma-Wang (STOC99), W. Just - Approximation results Gusfield (2- 1/L), Bafna,
Lawler, Pevzner (CPM94, 2-k/L), star alignment. - Sankoff, Kruskal discussed within a band
- Pearson showed alignment within a band gives very
good results for a lot protein superfamilies. - Altschul and Lipman, Chao-Pearson-Miller,
Fickett, Ukkonen, Spouge (survey) all have
studied alignment within a band.
39 The following were proved
- SP-Alignment
- NP hard
- PTAS for constant band
- PTAS for constant number of insertion/deletion
gaps per sequence on average (for coding regions,
this assumption makes a lot of sense)
- Star-Alignment
- PTAS in constant band
- PTAS for constant number of insertion/deletion
gaps per sequence on average
40We will do only SP-alignment
- Notation in an alignment, a block of inserted
--- is called a gap. If a multiple alignment
has c gaps on the average for each sequence, we
call it average c-gap alignment. - We first design a PTAS for the average c-gap SP
alignment. - Then using the PTAS for the average c-gap SP
alignment, we design a PTAS for SP-alignment
within a band.
41Average c-gap SP Alignment
- Key Idea choose r representative sequences, we
find their correct alignment in the optimal
alignment, by exhaustive search. Then we use this
alignment as reference. - Then we align every other sequence against this
alignment. - Then choose the best.
- All we have to show is that there are r sequences
whose letter frequencies in each column of their
alignment approximates the complete alignment.
42Some over-simplified reasoning
- If M is optimal average c-gap SP alignment
- In this alignment, many sequences have less than
cl gaps. - So if we take r of these sequences, and try every
possibly way, one way coincides with M. - Then hopefully, its letter frequencies in each
column more or less approximates that of Ms - Then we can simply optimally align all the rest
of the sequences one by one according to this
frequency matrix.
43Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
44AverageSPAlign
- for Lm to nm
- for any r sequences
- for all possible alignment M of length L
- and with no more than cl gaps
- align all other sequences to M //one
alignment -
- Output the best alignment.
45SP Alignment within c-Band
- Basic Idea
- Dynamically cut seq-uences into segments
- Each segment satisfies the average c-gap
condition. Hence use previous algorithm - Then assemble the segments together
- Divide-Conquer.
Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
46The final Algorithm diagonalAlign
- while (not finished)
- find a maximum prefix for each sequence (same
length) such that AverageSPAlign returns
lowcost. Keep the multiple alignment for this
segment -
- Concatenate the multiple alignments for all
segments together to as final alignment.
47Discussion
- Current PTAS is extremely slow.
- However, the design of PTAS might provide useful
hints to design fast programs. - It is an interesting project to implement this
PTAS (combined with some heuristics) and test
this idea.
48Chapter 6. Motif Finding
- Finding motifs/conserved regions in proteins is
important in drug design and proteomics. - The problem is also called local alignment.
- Many programs exist -- all based on heuristics
- We proved in STOC99 it is NP-hard.
- We provided a polynomial time approximation
scheme with guaranteed performance in polynomial
time in STOC99. - Based on this theoretical result, we have
implemented the COPIA system.
49Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
50The PTAS (Li, Ma, Wang, STOC99)
- Input S1, , Sm, integer L.
- Output t1, ,tm, tiL (motifs)
- For every r length L substring, compute the
consensuse t of them. - In all strings, find closest substring to t of
length L. From these, find the new consensus s. - Choose the best.
- Theorem This algorithm outputs a solution no
worse than 11/sqrt(r) optimal. Time complexity
is (nm)O(r).
51COPIA (COnsensus Pattern Identification
Analysis, Liang-Li-Ma) (http//dna.cs.ucsb.edu/
copia/copia_submit.html) and Others
- Straightforward implementation of PTAS is
extremely slow. - But we can do a few things to speed up
- E.g. instead of choosing all r, choose randomly
- Many programs exist. Some use HMM. Most popular
perhaps is the Gibbs sampling method by Lawrence
et al (see page 149 textbook). It is an iterative
procedure as follows Input, S1, , Sm - Randomly choose a word wi from each Si
- Randomly choose r
- Create a frequency matrix (qij) from the m-1
words not from Sr. - For each position in Sr, compute the probability
a word fits the frequency matrix (qij), and use
that word with the highest probability, repeat.
52Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)
- We introduce here a new technique for algorithm
design (approximation). - For simplicity, lets consider a simplified
problem. Given a set Ss1, sn sequences, each
of length n, find string t not in S that is far
away from all si in S find max d such that
H(t,si) gt d. - Formulating an LP problem. If xx1 xn and
sis1, sn, then H(x,si)x1 x2 xn,
where xj xj if sj0, xj 1-xj if sj1, now we
have an integer program - max D
- S aijxjgtCiD for each si, aij0,1,-1 Ci are
from Sxi - xi 0, 1
53LP Relaxation
- Integer Program (as we have just described) is
NP-hard. So it is hard to solve it directly. - However, Linear Program is polynomial time
solvable. So we relax our requirement that
solution x1, xn have to be 0,1. Instead we may
use 0?xi?1. - Then using a technique called randomized rounding
we convert our fraction solution to integer
solution by set xi1 with probability xi, xi0
with probability 1-xi. - What does this give us? Well, for each aij xj
Ci, after randomized rounding, we lose about
logn, thus if D is O(n), we are within D-logD
with high probability, so we have a very good
approximation. - We can then derandomize the process getting a PTAS
54Chapter 7 Protein Folding by Threading
- Target ketosteroid isomerase KSI
- Template 1ounA C_alpha-RMSD 1.97A and seq.
identity 9 with KSI - 2.73A RMSD for all backbone atoms
- 200 residues (M. Summers, in Science one of
HIV proteins)
Blue predicted model by PROSPECT Red NMR
structure
55Protein Folding Prediction by ThreadingThere are
many interesting work by Sippl-Weitchus,
Lathrop-Smith, Jones-Thornton, Godzik et al,
Bryant-Altschul, and many others. But we will
concentrate on one program PROSPECT by Ying Xu et
al.
- Template library FSSP. It has currently about
2000 known protein structures and specifies, for
each protein - Core (?-helix, ?-sheet) regions.
- For each a.a. position in sequence, secondary
structure (3 options), Sol(vent) exposed,
buried, or intermediate. (s.s.xsol total 9
possibilities). - Distance interaction when lt8 angstrom. If in the
same core, if gt4angstrom, also specify
interaction. - Input Query sequence.
core
core
core
loop
loop
loop
loop
Protein
56Goal Find Best Template
- To minimize, over all templates T in FSSP
- ETw1Emutatew2Esingletonw3Epairw4Egap
- where
- Emutate is alignment cost (as sequences, using
PAM250) - Esingleton is a residues local preference of the
ss and solvent environment. (Stored in table) - Epair specifies some (non-neighbor) pairs should
be close in space (Energy level also in a table). - Egap is gap penalty. Not allowed in cores in
PROSPECT.
57Threading Algorithm
- The problem in general is NPC.
- The problematic part is Epair. Other parts can be
optimized in polynomial time. - In order to compute Epair, draw an edge between
each pair when there is pairwise interaction in
FSSP. - PROSPECT uses a divide and conquer plus dynamic
programming each step find best place to cut the
problem into two parts (with smallest cut) - DP1,n,1,mmini,kDP1,i,1,k,x1, xc
- DPi1,n,k1,mx1,,
xc - Can precompute optimal cut fix i above.
58Time Complexity
- O(ncnm), c3,4 on average. c8 max. for all
proteins in FSSP. - User provided information will also help
threading, these include - Secondary structure information
- Inter-residue distances (including NMR data).
- Gap length requirements
- Problems remain
- Coefficients for all (wis). It is better to be
specific. - Can this be done random polynomial time? Or
efficient approximation? Currently too slow. - Fast computation is important for distant
homology.
59Side Topic Structure Alignment
- Problem given two proteins with structures,
align them together. - A simple algorithm repeat for each combination
of three aa from one protein and three from
another, align the six, then align the rest (by
dynamic programming). This algorithm is about n8
by Takutsu.
60Chapter 8. Repeating Patterns
- Over 45 human genome are (approximate) repeats.
More in plants. - Some claim that genomics is a science of
finding/studying genomic repeats. Genomes evolves
by duplicating (then evolve) parts. - How do we find approximate genome level repeats?
- We wrote the such a program PattenHunter (Ma,
Tromp) and visualization tool (Miller) - Larry Miller/John Tromp demo.
61Repeats Arabidopsis Chr 2 vs Chr 1p
62Chapter 9. Coding Region Detection
- Prokaryotic genome and enkaryotic genomes have
different properties - Regulatory regions
- Starting and ending codons
- Gene density
- Introns
- Such programs usually use HMM to detect ORFs and
intron/exon regions, and use EST databases, BLAST
search, and species-specific statistics GenScan,
Genie, GenMark.. - Other programs use Neural Networks GRAIL
63How it works in nature
- How does it work in nature
- Prokaryotes do not have introns
- Single cell eukaryotes have less introns
- Other eukaryotes may have up to 90 introns.
64- Spliceosome cuts off the introns which often
- start with GU, ends with AG
- In the middle, branch site ends with AC (match
GU)
Note Yc/u, Nany
65Hidden Markov Model
- HMM is lt?,Q,A,Egt where
- ? is symbol alphabet
- Q is set of states (that emit symbols)
- A is QxQ matrix of state transition
probabilities - E is Qx? matrix of emission probabilities.
- A path ??1 ?n is a sequence of states. The
probability a sequence is generated by ? is - P(x?) ? P(xi?i)P(?i ?i1)
- Decoding Problem Given x, find ? maximizing
P(x?).
- Fake/fair Coin Example Coinlt?,Q,A,Egt
- ?0,1 (tail/head)
- QF,B (fair, biased)
- aFFaBB0.9, aFBaBF0.1
- eF(0)1/2, eF(1)1/2, eB(0)1/4, eB(1)3/4
- Decoding Problem solved by Viterbi in 1967, using
Dynamic Programming. - Parameter estimation
66A HMM Scheme for Gene Finder
67Chapter 10. Phylogeny (Good reference Biological
sequence analysis, probabilistic models of
proteins Durbin, Eddy, Krogh, Mitchison,
Cambridge Univ Press.)
- Consider a gene seq. from each of k species, it
is possible to infer evolutionary relationship of
these k species from these k genes. - Many heuristics for building evolutionary trees
exist. The problem is known to be NP-hard. - We will discuss various algorithms
- Max likelihood method fastNDAml/PROTml
- Neighbor Joining
- Pasimony
- UPGMA
- Quartet cleaning (SODA00)
- PTAS finding the most consistent tree (FOCS98).
68Tree of Evolution
G ? T mutation
AAATGTACC
A ? G mutation
AAATGTACC
AAATGTGCC
A ? T mutation
AAATGTGCC
TAATGTGCC
69UPGMA (Sokal, Michener, 1958)
0.1
0.1
0.1
0.4
0.4
- Initialize
- Ci si, for all i.
- Repeat until one cluster left
- Find two clusters Ci, Cj with minij
dij(?dpq)/CiCj, p?Ci, q?Cj - Define node k with i,j as children, edge weight
dij - Form cluster k, remove i,j clusters.
Problem of UPGMA
70Neighbor Joining (Saitou-Nei, 1987)
- Initialize
- Tsequences, LT
- Choose i,j?L such that dij-ri-rj minimized. Rest
similar to UPGMA with similar modification on
edge weights to k. - Here, ri, rj are the average distances from i,j
to other nodes in L to compensate long edges.
71Parsimony
- Finds the tree with minimal number of
substitutions. - Extremely slow. Use branch and bound.
- Even this problem is not easy given the tree
with leaves as sequences, find the best way to
assign sequences to internal nodes so that the
mutations are minimized. (Jiang-Lawler-Wang, 1994
STOC)
72Maximum Likelihood
- Jukes-Cantor (1969) model equal substitution
probability (1 parameter ?) - Kimura (1980) (2 parameter) model purine (A,G)
to purine and pyrimidine (C,T,U) to pyrimidine
substitution at different rates (? and ?). - Max Likelihood method wants a tree with max
probability under these models. You have to
exhaustively search through all trees, extremely
slow process.
73Quartet Methods
- For every four sequences, construct a tree of 4
nodes (quartet), using max likelihood say. - Then build a large tree consistent with (most of)
these small trees (of size 4). But this step is
(NP) hard. - New appoach data correction (see our papers in
FOCS98, SODA00 on our bioinformatics lab
website)
74Quartets and Correction
c
a
Original tree
d
b
c
a
b
e
c
d
a
d
a
b
e
e
b
d
b
d
a
e
c
correction
c
a
e
c
error
e
d
75HyperCleaning Software
- For less than 30 taxa, HyperCleaning is
comparable to fastDNAml (using maximum likehood
score), and better than NJ. - For more than 30 taxa, true ML and MP methods
take days and produces poor results.
HyperCleaning do well, with better ML score.
76(No Transcript)
77Chapter 11. DNA Compression
- Life represents order, not chaos. It should be
compressible. Biological evidences - In eukaryotes long tandem repeats
- multiple copies of essential genes (rRNA)
- only 1000 protein folding patterns
- genes duplicate themselves for evolutionary or
selfishness purposes
78GenCompress (Chen, Kwong, Li, GIW99,RECOMB00)
- Lempel-Ziv style, one pass.
- Encodes approx. matches (edit distance).
- Encodes approximate reverse complements.
- Carefully designed gain function and encoding.
Arithmetic encoding when needed. - Works for conditional compression.
- Best compression ratio for benchmark data.
79 Comparison with Biocompress-2
Biocompress-2 is by Grumbach-Tahi
80Comparison with Cfact
Cfact is by Rivals et al
81Chapter 12. Whole-Genome Phylogeny
- Completely Sequenced
- C. elegans, H. sapiens, D. melanogaster, Rice
- 50 species of archaea bacteria and eubacteria
- What can we do with them?
- Snel, Bork, Huynen compare gene contents
- Boore, Brown gene order
- Sankoff, Pevzner, Kececioglu reversal/translocati
on/synteny
82New Thinking
- Lets first give up traditional a priori
conception of important similarities between 2
sequences. - Can we simply define a distance d(x,y) that is
universal if two seqs are close in any sense,
then they are close under d(x,y)?
83Defining a Distance
- Shared Information
- K(xy) is Kolmogorov complexity
- of x condition on y, defined as
- the length of shortest program
- that outputs x on input y.
- K(x) K(xy) is mutual
- information
K(x) - K(xy) d(x,y) 1 -
----------------------
K(xy)
84Universality
- It can be proved, for any other computable
normalized measure D(x,y) that satisfies some
reasonable neighborhood density property, then we
have there is a constant c such that for all
x,y, we have - d(x,y) lt cD(x,y)
- Informally speaking any similarity detected by D
is also detected by d !
85Approximate K(xy)
- But K(xy) is not computable!!
- We approximate K(xy) by Compress(xy).
- DNAs are over alphabet A,C,G,T. Trivial
algorithm gives 2 bits per base. - All commercial software like compress,
compact, pkzip, arj give gt 2 bits/base - We will use GenCompress, since this is the best
available compression alg for DNA sequences.
86 First Experiment Genome Tree
- On a single gene (or several), current methods
- Max. likelihood assumes statistical evolutionary
models, compute the most likely tree. Based on
multiple alignment. - Max. parsimony needs multiple alignment, then
finds the best tree, minimizing cost. - Distance-based methods NJ Quartet methods,
Fitch-Margoliash method. - Trouble different gene trees, manual alignment,
horizontally transferred genes, do not handle
genome level events.
87Whole Genome Phylogeny
- We provide an alternative approach. It
- uses all the information in a genome
- completely automated
- needs no alignment robust, fast, accurate
- no model -- simply universal
- generalizes alignment distance, reversal
distance, translocation distance - one genome, one evolution, one tree.
88Eutherian Orders
- It has been a disputed issue which two groups of
placental mammals are closer Primates,
Ferungulates, Rodents. - Hasegawas group concatenated 12 proteins from
rat, house mouse, grey seal, harbor seal, cat,
white rhino, horse, finback whale, blue whale,
cow, gibbon, gorilla, human, chimpanzee, pygmy
chimpanzee, orangutan, sumatran orangutan, with
opossum, wallaroo, platypus as out group, 1998. - They used Max Likelihood method.
89Eutherian Orders ...
- We use exactly the same species.
- And complete mtDNA genome
- We computed d(x,y) for each pair of species, and
used Neighbor Joining in Molphy package (and our
own hypercleaning). - We constructed exactly the same tree. Confirming
- ((Primates, Ferungulates), Rodents)
90Evolutionary Tree of Mammals
912nd Exp Chaining Chain Letters
- Charles Bennett collected 33 chain letters
1980--1997. Using our new method, we
reconstructed their history. Answered open
questions of chain letter experts. Will appear in
Scientific American. - Like a gene, they are about 2000 characters like
a virus, they have infected billions of people,
they mutate just like a genome, even has
horizontal transfer. Traditional phylogeny
methods should also work on them. But they dont
alignment fail--translocated sentences no model
of evolution. - We used our own method to calculate shared
information between each pair of chain letters.
92A sample letter
93Another typical chain letter
with love all things are possible this paper has
been sent to you for good luck. the original is
in new england. it has been around the world
nine times. the luck has been sent to you. you
will receive good luck within four days of
receiving this letter. provided, in turn, you
send it on. this is no joke. you will receive
good luck in the mail. send no money. send
copies to people you think need good luck. do
not send money as faith has no price. do not keep
this letter. It must leave your hands within 96
hours. an r.a.f. (royal air force)
officer received 470,000. joe elliot received
40,000 and lost them because he broke the
chain. while in the philippines, george welch
lost his wife 51 days after he received the
letter. however before her death he received
7,755,000. please, send twenty copies and see
what happens in four days. the chain comes from
venezuela and was written by saul anthony de
grou, a missionary from south america. since
this letter must tour the world, you must make
twenty copies and send them to friends and
associates. after a few days you will get a
surprise. this is true even if you are not
superstitious. do note the following
constantine dias received the chain in 1953. he
asked his secretary to make twenty copies and
send them. a few days later, he won a lottery of
two million dollars. carlo daddit, an office
employee, received the letter and forgot it had
to leave his hands within 96 hours. he lost his
job. later, after finding the letter again, he
mailed twenty copies a few days later he got a
better job. dalan fairchild received the letter,
and not believing, threw the letter away, nine
days later he died. in 1987, the letter was
received by a young woman in california, it was
very faded and barely readable. she promised
herself she would retype the letter and send it
on, but she put it aside to do it later. she was
plagued with various problems including
expensive car repairs, the letter did not leave
her hands in 96 hours. she finally typed the
letter as promised and got a new car. remember,
send no money. do not ignore this. it works. st.
jude
94Phylogeny of 33 Chain Letters
Confirmed by VanArsdales study, answers an open
question
95Chapter 13. Expression Arrays
- Expression arrays can hold thousands of genes
(DNA) such that cDNA (complementary DNA) can
hybridize on them. - The process is explained next page (J. Buhler).
- This has opened great new opportunities for
proteomics and biological pathway studies. For
example, there are expression arrays containing
complete set of genes for Yeast (Pat Brown Lab,
Stanford). - Now that the human genome is completed with about
30k genes. It is conceivable for a chip to
contain the complete set of human genes.
961. Two kind cells
2. Reverse transcribe mRNA to cDNA
3. Label cDNA with fluorescein
4. Hybridize on the DNA array 5. Read
array 6. Analyze it.
97Analysis Techniques
- Finding genes expressed in one kind of cells and
not in the other. - Cluster analysis finding what genes expressed
together under some condition, to deduce
pathways. - Data mining, machine learning techniques gene
expression dependency. E.g. the expression of
gene A implies the expression of gene B and no
expression of gene C. - Endless possibilities, enormous data.
- http//www.cs.ucsb.edu/mli/expression.html
98An example of clustering
99Chapter 14. Bioinformatics Platform
- GCG (Wisconsin package)
- Biotools
- MacVector
- Doubletwist webservice
- NCBI webservice
- Omega
- Specialized packages like PAUP (phylogeny),
GenePix (expression array) - Many more.
100DryLab Open, distributed, platform independent
101Chapter 15 Projects/Missing Topics
- I wish I could cover all, but the time is short.
Part of the missing topics will be covered by
student projects/presentations. - Each student will work on one project.
- And present the project in class.
- Most projects involve serious programming in
either C or Java. - Please start early and discuss with me first.
Check our web page for projects. - When a project is large, I will allow two
students to work on a project together.
102Major missing topics
- Genome rearrangement distances
- Physical mapping
- DNA arrays / Sequence by hybridization
- Drug target design (from computational biology to
computational chemistry). - RNA structure / folding.
- Statistical studies / properties.
103Acknowledgements
- I would like to thank many coauthors, colleagues,
students whose work I used here J. Badger, C.
Bennett, B. Brejova, X. Chen, T. Jiang, P.
Kearney, S. Kwong, C. Liang, G.H. Lin, B. Ma, L.
Miller, H.D. Sun, J. Tromp, T. Vinar, L. Wang, Z.
Wang, D. Xu, Y. Xu, H. Zhang. - I have copied some materials/figures from the
internet I thank these authors as well. - Most importantly, thanks to the enthusiastic 290I
students of Spring 2001 at UCSB who made this
course fun to teach.