Title: Pattern Discovery in Biological Sequences: A Review
1Pattern Discoveryin Biological Sequences A
Review
ChengXiang Zhai Language Technologies
Institiute School of Computer Science Carnegie
Mellon University
Presentation at the Biological Language Modeling
Seminar, June17, 2002
2Outline
Computer Science
Algorithm
Pattern Discovery
Application
Biology
Motivation
Formalization
Basic Concepts (Common Language)
3Basic Concepts
- Alphabet Language
- Alphabet set of symbols, e.g., ?A, T, G, C
is the nucleotide alphabet - String/Sequence (over an alphabet) finite seq.
of symbols, e.g., wAGCTGC ( How many different
nucleotide strings of length 3 are there?) - Language (over an alphabet) set of strings,
e.g., LAAA, AAT, ATA, AGC, , AGG all
nucleotide triplets starting with A.
4ExampleEssential AA Language
The language (set) of essential amino acids on
the alphabet A, U, C, G LCAC, CAU, , UAC,
UAU
The Genetic Code
5Questions to Ask about a Language (L)
- Syntax Semantics
- How do we describe L and interpret L?
- Recognition
- Is sequence s in L or not?
- Learning
- Given example sequences in L and not in L, how do
we learn L? What if given sequences that either
match or do not match a sub-sequence in L ?
6Syntax Semantics of Language
- Syntax description of the form of sequences
- Surface description enumeration
- Deep description a concise decision rule or a
characterizing pattern, e.g., - L contains all the triplets ending with A, or
- L contains all sequences that match AGGGGGA
- Semantics meaning of sequences
- Functional description of a amino acid sequence
- Gene regulation of a nucleotide sequence
7Recognizing Sequences in L
- Recognizer (for L) given a sequence s, it tells
us if s is in L or not. An operational way of
describing L!
Algorithm (G-rec. Recgonizer)
0 (no) 1 (yes)
L (G-receptors)
? ( all protein sequences)
Is the sequence SNASCTTNAPTGAK a G-receptor?
8More than recognizing...
- Can the recognizer explain why a sequence is a
G-receptor? Is the explanation biologically
meaningful? - The explanatory power reflects the recognizers
understanding of the language. - Two possible explanations/decision rules
- It is longer than 300 AAs
- The four AAs A, P, K, B co-occur within a
window of 50 AAs
9Learning a Language (from Examples)
Positive examples
L
?
Negative examples
Learn a recognizer (Classification) - Given a
new sequence, decide if it is in L
Learn meaningful features (Feature
Extraction/Selection) - Characterize, in a
meaningful way, how L is different from the rest
of ?
10More Basic Concepts
- Pattern/Motif ? sequence template, e.g., A..GT
- Different views of a pattern
- A pattern defines a language L seqs that
match the pattern gt - Language learning pattern learning?
- Given a language, can we summarize it with a
pattern? - A pattern is a feature The feature is on for a
sequence that matches the pattern gt - Feature extraction pattern extraction?
- A pattern is a sequence of a pattern language
11The Need of Probabilities
- We have many uncertainties due to
- incomplete data and knowledge
- noise in data (incorrectly labeled, measurement
errors, etc) - So, we relax our criteria
- L could potentially contain all the sequences,
but with different probabilities (statistical
LM) - How likely is a sequence s in L?
- How do we learn such an L? (LM estimation)
12Biological Motivation for Pattern Discovery
- Motifs or preserved sequence patterns are
believed to exist - Motifs determine a sequences biological function
or structure - Many successful stories (Brejova et al. 2000)
- Tuberculosis detecting secretary proteins 90
confirmed - Coiled coils in histidine kinases detecting
coiled coil
13Amino Acid Patterns Patterns in Protein
- Possible biological meanings They may
- determine a proteins 3-D structure
- determine a proteins function
- indicate a proteins evolutionary history or
family - Suspected properties They may be
- long and with gaps
- flexible to permit substitutions
- weak in its primary sequence form
- strong in its structural form
14Nucleotide Patterns Inon-coding regions
- Possible biological meanings They may
- determine the global function of a genome, e.g.,
where all the promotors are - regulate specific gene expression
- play other roles in the complex gene reg. network
- Suspected properties They may be
- in the non-coding regions
- relatively more continuous and short
- working together with many other factors
15 Nucleotide Patterns II Patterns in RNA
- Possible biological meanings They may
- determine RNAs 3-D structure, thus indirectly
transcription behavior - Suspected properties They may
- be long and with gaps(?)
- contain many coordinating/interacting elements
- weak in its primary sequence form
- strong in its structural form
16 Nucleotide Patterns III Tandem Repeats
- Possible biological meanings They may
- be a result of mutations from the same original
segment - play a role in gene regulation
- be related to several diseases
- Suspected properties They may
- be contiguous
- approximate copies of the same root form
- be hard to detect
17 Pattern Discovery Problem Formulation
- The ultimate goal is to find Meaningful
Patterns - Broadly three types of sub-problems
- Pattern Generation/Enumeration
- Sequence Classification/Retrieval/Mining
- Pattern Extraction
18Map of Pattern Discovery Problems
Sequences
Pattern Generation
Seq. Classification
Candidate Patterns
Function Info
Pattern Extraction
19Pattern Generation/Enumeration
- Given a (usually big) collection of sequences
- Generate/enumerate all the significant patterns
that satisfy certain constraints - Issues
- Design a pattern language (e.g., max. length?)
- Design significance criteria (e.g., freq gt 3)
- Design a search/enumeration strategy
- Algorithm has to be efficient
20Sequence Classification
- Finding structures on a group of sequences
- Categorization group sequences into known
families - Clustering explore any natural grouping tendency
- Retrieval find sequences that satisfy certain
need - Goal maximize classification accuracy
- Issues
- Dealing with noise Using good features/patterns
- Breaking the limit of linear similarity
21Sequence Categorization
- 2 or more meaningful classes
- Examples available for each class
- Goal is to predict the class label of a new
instance as accurately as possible - E.g., protein categorization, G-receptor
recognition
Learn the boundaries
C2
C3
Examples
C1
22Sequence Clustering
- Given sequences, but no pre-defined classes
- Design similarity criteria to group sequences
that are similar - Goal is to reveal any interesting structure
- E.g., gene clustering based on expression
information
Learn the boundaries
23Sequence Retrieval
- Given some desired property of sequences
- Find all sequences with the desired property
- E.g., find all Enzyme sequences that are similar
to my G-receptor sequence
Query
Find these sequences
24Pattern Extraction
- Suppose you are given a lot of text in a foreign
language unknown to you - Can you identify proper names in the text?
- Issues
- Need to know the possible form of a meaningful
pattern (Will a name have more than three words?) - Need to identify useful clues (e.g.,
Capitalized) - The extraction criteria must be based on some
information about the functions or structures of
a sequence
25Entering the Algorithm Zone ...
- The most influential ones seem to be
- Pattern Generation Algorithms
- TEIRESIAS SPLASH
- Pattern Classification Algorithms
- I believe that most standard classification
algorithms have been tried - HMMs are very popular
- Pattern Extraction Algorithms ???
26TEIRESIAS SPLASH
- Find deterministic patterns (exact algorithm)
- Pattern Language
- Allowing gaps, e.g. A..HC
- Constraints on density of the wild-card .
- Less powerful than the regular language/expression
- Significance criteria
- Longer more significant
- Higher frequency more significant
- Statistical test How likely is it a random
effect?
27Basic Idea in TEIRESIAS SPLASH
- Generate Test
- Pruning strategy If a (short) pattern occurs
fewer than 5 times, so do ALL longer patterns
containing it! - A Bottom-up Inductive Procedure
- Start with high frequency short patterns
- At any step, try to extend the short patterns
slightly
28Possible Applications of TEIRESIAS SPLASH
- Defining a feature space (biological words)
- Suggesting structures for probabilistic models
(e.g. HMM structure) - A general tool for any sequence mining task
(e.g., mining the web click-log data?)
29Map of Pattern Discovery Problems
Pattern ? Meaningful Structure?
Sequences
Structure Analysis (Alignment)
Function Analysis (Classification)
Structure Info
Function Info
Pattern Extraction/Interpretation
30Probabilistic Pattern Finding
- Probabilistic vs. Deterministic Patterns
- Functional comparison
- A deterministic pattern either matches or does
not match a sequence - A probabilistic pattern potentially matches every
sequence, but with different probabilities - Deterministic patterns are special cases of
prob. Patterns - Structural comparison
- Deterministic patterns are easier to interpret
31Hidden Markov Models (HMMs)
- Probabilistic Models for Sequence Data
- The System is in one of K states at any moment
- At the next moment, the system
- moves to another state probabilistically
- outputs a symbol probabilistically
- To generate a sequence of n symbols, the system
makes n moves (transitions)
32Examples
1.0
P(w1wn) p(w1s)p(wns)
p(w1)p(wn)
Unigram LM
s
p(ws)
Position Weight Matrix(PWM)
1.0
1.0
1.0
1.0
...
Start
End
s1
s2
sk
p(ws1)
p(ws2)
p(wsk)
P(w1wn) p(w1s)p(wns)
Deterministic Pattern AUGUAGUGAUAA
A
A
Start
1.0
A
End
U
G
1
2
3
U
G
A
p(A)1
p(U)1
p(G)1
A
G
p(A) p(U) p(G) p(C)1
33Three Tasks/Uses
- Prediction (Forward/Backward algorithms)
- Given a HMM, how likely would the HMM generate a
particular sequence? - Useful for, e.g., recognizing unknown proteins
- Decoding (Viterbi algorithm)
- Given a HMM, what is the most likely transition
path for a sequence (discover hidden structure
or alignment) - Training (Baum-Welch algorithm)
- HMM unknown, how to estimate parameters?
- Supervised (known state transitions) vs.
unsupervised
34Applications of HMM
- MANY!!!
- Protein family characterization (Profile HMM)
- A generative model for proteins in one family
- Useful for classifying/recognizing unknown
proteins - Discovering weak structure
- Gene finding
- A generative model for DNA sequences
- Identify coding-regions and non-coding regions
35An Example Profile HMM
- Three types of states
- Match
- Insert
- Delete
- One delete and one match per position in model
- One insert per transition in model
- Start and end dummy states
delete
insert
alignment
Match (at position 2)
Example borrowed from Cline, 1999
36Profile HMMs Basic Idea
- Goal Use HMM to represent the common pattern of
proteins in the same family/domain - First proposed in (Krogh et al. 1994)
- Trained on multiple sequence alignments
- match-states consensus columns
- Supervised learning
- Trained on a set of raw sequences
- match-states avg-length
- Unsupervised learning
37Uses of Profile HMMs
- Identify new proteins of a known family
- Match a profile HMM with a database of sequences
- Score a sequence by likelihood ratio (w.r.t. a
background model), apply a threshold - Identify families/domains of a given sequence
- Match a sequence with a database of profile
HMMs, - Return top N domains
- Multiple alignments
- Identify similar sequences Iterative search
38Profile HMMs Major Issues
- Architecture Explain sub-families, more
constrained (motif HMMs) - Local vs. global alignment
- Avoid over-fitting Mixture Dirichlet prior, Use
labeled data - Avoid local-maxima Annealing, labeled data
- Sequence weighting Address sample bias
- Computational efficiency
39Profile HMMs More Details
- Dirichlet Mixture Prior
- Generate an AA from a Dirichlet distribution
Dir(p?) in two-stages - Given observed AA counts, we can estimate the
prior parameters ?s - Assume a mixture of k Dirichlet distributions
Dir(p?) - For each column of multiple alignment
- Assume that the counts (of different AAs) are a
sample of the mixture model
40Protein Structure Prediction with HMMs
- SAM-T98
- Best method that made use of no direct structural
information at CASP 3 (Current Assessment of
Structure Prediction) - Create a model of your target sequence
- Search a database of proteins using that model
- Whichever sequence scores highest, predict that
structure
41How do we build a model using only one sequence?
42Application Example Pfam (HMMER)
- Pfam is a large collection of protein multiple
sequence alignments and profile hidden Markov
models. Pfam is available on the World Wide Web
in the UK,, Sweden, , France, , US. The latest
version (6.6) of Pfam contains 3071 families,
which match 69 of proteins in SWISS-PROT 39 and
TrEMBL 14. Structural data, where available, have
been utilised to ensure that Pfam families
correspond with structural domains, and to
improve domain-based annotation. Predictions of
non-domain regions are now also included. In
addition to secondary structure, Pfam multiple
sequence alignments now contain active site
residue mark-up. New search tools, including
taxonomy search and domain query, greatly add to
the functionality and usability of the Pfam
resource.
43HMM Gene Finders
- Goal Use HMM to find the exact boundary of genes
- Usually Generalized HMMs
- With Class (GeneMark GeneMark.hmm?)
- State Neural Network (Genie)
- Architecture 2 modules interleaved
- Boundary module start codon, stop codon, binding
sites, transcription factors, etc. - Region module exons, introns, etc.
- A lot of domain knowledge encoded
44HMMs Pros Cons
- Advantages
- Statistics
- Modularity
- Transparency
- Prior Knowledge
- Disadvantages
- State independence
- Over-fitting
- Local Maximums
- Speed
45More Applications Discussions
- Ultimately how useful are these algorithms for
biology discovery? - Integrated with biological experiment design
(reinforcement learning?) - Biological verification of patterns/classification
- Evaluation of these algorithms is generally hard
and expensive?
46Some Fundamental Questions
- How powerful should the pattern language be? Is
regular expression sufficient? - How do we formulate biologically meaningful or
biologically motivated classification/extraction
criteria? - How do we evaluate a pattern without expensive
biological experiments?
47The End Thank you!