Title: Local%20Multiple%20Sequence%20Alignment%20Sequence%20Motifs
1Local Multiple Sequence AlignmentSequence Motifs
2Motifs
- Motifs represent a short common sequence
- Regulatory motifs (TF binding sites)
- Functional site in proteins (DNA binding motif)
3Regulatory Motifs
- DNA in every cell is identical
- Different cells have different functions
- Transcription is crucial aspect of regulation
- Transcription factors (TFs) affect transcription
rates - TFs bind to regulatory motifs
- Motifs are 6 20 nucleotides long
- Activators and repressors
- Usually located near target gene, mostly upstream
Transcription Start Site
SBF
MCM1
Gene X
SBF motif
MCM1 motif
4E. Coli promoter sequences
5Challenges
- How to recognize a regulatory motif?
- Can we identify new occurrences of known motifs
in genome sequences? - Can we discover new motifs within upstream
sequences of genes?
61. Motif Representation
- Exact motif CGGATATA
- Consensus represent only deterministic
nucleotides. - Example HAP1 binding sites in 5 sequences.
- consensus motif CGGNNNTANCGG
- N stands for any nucleotide.
- Representing only consensus loses information.
How can this be avoided?
CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACG
G CGGCCCTAACGG ------------ CGGNNNTANCGG
7Transcription start site
Consensus considerations
-35 hexamer
-10 hexamer
spacer
interval
TTGACA
TATAAT
15 - 19 bases
5 - 9 bases
A weight matrix contains more information
2
3
4
5
6
1
2
3
4
5
6
1
A
A
0.1 0.1 0.1 0.5 0.2 0.5
T
0.7 0.7 0.2 0.2 0.2 0.2
T
G
0.1 0.1 0.5 0.1 0.1 0.2
G
C
0.1 0.1 0.2 0.2 0.5 0.1
C
-35
-10
Based on 450 known promoters
8PSPM Position Specific Probability Matrix
- Represents a motif of length k
- Defines PiA,C,G,T for i1,..,k.
- Pi (A) frequency of nucleotide A in position i.
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
9PSPM Position Specific Probability Matrix
- Represents a motif of length k
- Defines PiA,C,G,T for i1,..,k.
- Pi (A) frequency of nucleotide A in position i.
- Each k-mer is assigned a probability.
- Example P(TCCAG)0.50.250.80.70.2
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
10Graphical Representation Sequence Logo
- Horizontal axis position of the base in the
sequence. - Vertical axis amount of information.
- Letter stack order indicates importance.
- Letter height indicates frequency.
- Consensus can be read across the top of the
letter columns.
112. Identification of Known Motifs within Genomic
Sequences
- Motivation
- identification of new genes controlled by the
same TF. - Infer the function of these genes.
- enable better understanding of the regulation
mechanism.
12Detecting a Known Motif within a Sequence using
PSPM
- The PSPM is moved along the query sequence.
- At each position the sub-sequence is scored for a
match to the PSPM. - Example
- sequence ATGCAAGTCT
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
13Detecting a Known Motif within a Sequence using
PSPM
- The PSPM is moved along the query sequence.
- At each position the sub-sequence is scored for a
match to the PSPM. - Example
- sequence ATGCAAGTCT
- Position 1 ATGCA 0.10.250.10.10.61.510-4
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
14Detecting a Known Motif within a Sequence using
PSPM
- The PSPM is moved along the query sequence.
- At each position the sub-sequence is scored for a
match to the PSPM. - Example
- sequence ATGCAAGTCT
- Position 1 ATGCA 0.10.250.10.10.61.510-4
- Position 2 TGCAA 0.50.250.80.70.60.042
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
15Detecting a Known Motif within a Sequence using
PSSM
- Is it a random match, or is it indeed an
occurrence of the motif? - PSPM -gt PSSM (Probability Specific Scoring
Matrix) - odds score matrix Oi(n) where n? A,C,G,T for
i1,..,k - defined as Pi(n)/P(n), where P(n) is background
frequency. - Oi(n) increases gt higher odds that n at position
i is part of a real motif.
16PSSM as Odds Score Matrix
- Assumption the background frequency of each
nucleotide is 0.25. - Original PSPM (Pi)
- Odds Matrix (Oi)
- Going to log scale we get an additive score,Log
odds Matrix (log2Oi)
1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485 1.263
17Calculating using Log Odds Matrix
- Odds ? 0 implies random match Odds gt 0 implies
real match (?). - Example sequence ATGCAAGTCT
- Position 1 ATGCA -1.320-1.32-1.321.26-2.7odd
s 2-2.70.15 - Position 2 TGCAA101.681.481.26
5.42odds25.4242.8
1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
18Calculating the probability of a Match
- ATGCAAG
- Position 1 ATGCA 0.15
19Calculating the probability of a Match
- ATGCAAG
- Position 1 ATGCA 0.15
- Position 2 TGCAA 42.3
20Calculating the probability of a Match
- ATGCAAG
- Position 1 ATGCA 0.15
- Position 2 TGCAA 42.3
- Position 3 GCAAG 0.18
21Calculating the probability of a match
- ATGCAAG
- Position 1 ATGCA 0.15
- Position 2 TGCAA 42.3
- Position 3 GCAAG 0.18
P (1) 0.003 P (2) 0.993 P (3) 0.004
P (i) S / (? S) Example 0.15 /(.1542.8.18)0.0
03
22Building a PSSM
- Collect all known sequences that bind a certain
TF. - Align all sequences (using multiple sequence
alignment). - Compute the frequency of each nucleotide in each
position (PSPM). - Incorporate background frequency for each
nucleotide (PSSM).
23PROBLEMS
- When searching for a motif in a genome using PSSM
or other methods the motif is usually found all
over the place - -gtThe motif is considered real if found in the
vicinity of a gene. - Checking experimentally for the binding sites of
a specific TF (location analysis) the sites
that bind the motif are in some cases similar to
the PSSM and sometimes not!
243. Finding new Motifs
- We are given a group of genes, which presumably
contain a common regulatory motif. - We know nothing of the TF that binds to the
putative motif. - The problem discover the motif.
25Difficulties in Computational Identification
- Each motif can appear in any of m-k
columnsthere are (m-k)n possibilities. - NoiseMismatches are allowed, the motif is not
exact.Not all sequences contain the motif. - Statistical significancek is short (6-20
nucleotides).m ranges from 10s (prokaryotes) to
1000s (eukaryotes) of nucleotides.gt a random
motif can appear by chance in sequences.
26Computational Methods
- This problem has received a lot of attention from
CS people. - Methods include
- Probabilistic methods hidden Markov models
(HMMs), expectation maximization (EM), Gibbs
sampling, etc. - Enumeration methods problematic for inexact
motifs of length kgt10. - Current status Problem is still open.
27Tools on the Web
- MEME Multiple EM for Motif Elicitation.
http//meme.sdsc.edu/meme/website/ - metaMEME- Uses HMM method
- http//meme.sdsc.edu/meme
- MAST-Motif Alignment and Search Tool
- http//meme.sdsc.edu/meme
- TRANSFAC - database of eukaryotic cis-acting
regulatory DNA elements and trans-acting factors.
http//transfac.gbf.de/TRANSFAC/ - eMotif - allows to scan, make and search for
motifs in the protein level. - http//motif.stanford.edu/emotif/