Local%20Multiple%20Sequence%20Alignment%20Sequence%20Motifs

About This Presentation

Title:

Local%20Multiple%20Sequence%20Alignment%20Sequence%20Motifs

Description:

Local Multiple Sequence Alignment Sequence Motifs – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 28

Provided by: esti86

Learn more at: http://darwin.informatics.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Local%20Multiple%20Sequence%20Alignment%20Sequence%20Motifs

1
Local Multiple Sequence AlignmentSequence Motifs
2
Motifs

Motifs represent a short common sequence
Regulatory motifs (TF binding sites)
Functional site in proteins (DNA binding motif)

3
Regulatory Motifs

DNA in every cell is identical
Different cells have different functions
Transcription is crucial aspect of regulation
Transcription factors (TFs) affect transcription
rates
TFs bind to regulatory motifs
Motifs are 6 20 nucleotides long
Activators and repressors
Usually located near target gene, mostly upstream

Transcription Start Site
SBF
MCM1
Gene X
SBF motif
MCM1 motif
4
E. Coli promoter sequences
5
Challenges

How to recognize a regulatory motif?
Can we identify new occurrences of known motifs
in genome sequences?
Can we discover new motifs within upstream
sequences of genes?

6
1. Motif Representation

Exact motif CGGATATA
Consensus represent only deterministic
nucleotides.
Example HAP1 binding sites in 5 sequences.
consensus motif CGGNNNTANCGG
N stands for any nucleotide.
Representing only consensus loses information.
How can this be avoided?

CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACG
G CGGCCCTAACGG ------------ CGGNNNTANCGG
7
Transcription start site
Consensus considerations
-35 hexamer
-10 hexamer
spacer
interval
TTGACA
TATAAT
15 - 19 bases
5 - 9 bases
A weight matrix contains more information
2
3
4
5
6
1
2
3
4
5
6
1
A
A
0.1 0.1 0.1 0.5 0.2 0.5
T
0.7 0.7 0.2 0.2 0.2 0.2
T
G
0.1 0.1 0.5 0.1 0.1 0.2
G
C
0.1 0.1 0.2 0.2 0.5 0.1
C
-35
-10
Based on 450 known promoters
8
PSPM Position Specific Probability Matrix

Represents a motif of length k
Defines PiA,C,G,T for i1,..,k.
Pi (A) frequency of nucleotide A in position i.

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
9
PSPM Position Specific Probability Matrix

Represents a motif of length k
Defines PiA,C,G,T for i1,..,k.
Pi (A) frequency of nucleotide A in position i.
Each k-mer is assigned a probability.
Example P(TCCAG)0.50.250.80.70.2

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
10
Graphical Representation Sequence Logo

Horizontal axis position of the base in the
sequence.
Vertical axis amount of information.
Letter stack order indicates importance.
Letter height indicates frequency.
Consensus can be read across the top of the
letter columns.

11
2. Identification of Known Motifs within Genomic
Sequences

Motivation
identification of new genes controlled by the
same TF.
Infer the function of these genes.
enable better understanding of the regulation
mechanism.

12
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
13
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT
Position 1 ATGCA 0.10.250.10.10.61.510-4

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
14
Detecting a Known Motif within a Sequence using
PSPM

The PSPM is moved along the query sequence.
At each position the sub-sequence is scored for a
match to the PSPM.
Example
sequence ATGCAAGTCT
Position 1 ATGCA 0.10.250.10.10.61.510-4
Position 2 TGCAA 0.50.250.80.70.60.042

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
C 0.3 0.25 0.8 0.1 0.15
T 0.5 0.25 0.05 0.1 0.05
G 0.1 0.25 0.1 0.1 0.2
15
Detecting a Known Motif within a Sequence using
PSSM

Is it a random match, or is it indeed an
occurrence of the motif?
PSPM -gt PSSM (Probability Specific Scoring
Matrix)
odds score matrix Oi(n) where n? A,C,G,T for
i1,..,k
defined as Pi(n)/P(n), where P(n) is background
frequency.
Oi(n) increases gt higher odds that n at position
i is part of a real motif.

16
PSSM as Odds Score Matrix

Assumption the background frequency of each
nucleotide is 0.25.
Original PSPM (Pi)
Odds Matrix (Oi)
Going to log scale we get an additive score,Log
odds Matrix (log2Oi)

1 2 3 4 5
A 0.1 0.25 0.05 0.7 0.6
1 2 3 4 5
A 0.4 1 0.2 2.8 2.4
1 2 3 4 5
A -1.322 0 -2.322 1.485 1.263
17
Calculating using Log Odds Matrix

Odds ? 0 implies random match Odds gt 0 implies
real match (?).
Example sequence ATGCAAGTCT
Position 1 ATGCA -1.320-1.32-1.321.26-2.7odd
s 2-2.70.15
Position 2 TGCAA101.681.481.26
5.42odds25.4242.8

1 2 3 4 5
A -1.32 0 -2.32 1.48 1.26
C 0.26 0 1.68 -1.32 -0.74
T 1 0 -2.32 -1.32 -2.32
G -1.32 0 -1.32 -1.32 -0.32
18
Calculating the probability of a Match

ATGCAAG
Position 1 ATGCA 0.15

19
Calculating the probability of a Match

ATGCAAG
Position 1 ATGCA 0.15
Position 2 TGCAA 42.3

20
Calculating the probability of a Match

ATGCAAG
Position 1 ATGCA 0.15
Position 2 TGCAA 42.3
Position 3 GCAAG 0.18

21
Calculating the probability of a match

ATGCAAG
Position 1 ATGCA 0.15
Position 2 TGCAA 42.3
Position 3 GCAAG 0.18

P (1) 0.003 P (2) 0.993 P (3) 0.004
P (i) S / (? S) Example 0.15 /(.1542.8.18)0.0
03
22
Building a PSSM

Collect all known sequences that bind a certain
TF.
Align all sequences (using multiple sequence
alignment).
Compute the frequency of each nucleotide in each
position (PSPM).
Incorporate background frequency for each
nucleotide (PSSM).

23
PROBLEMS

When searching for a motif in a genome using PSSM
or other methods the motif is usually found all
over the place
-gtThe motif is considered real if found in the
vicinity of a gene.
Checking experimentally for the binding sites of
a specific TF (location analysis) the sites
that bind the motif are in some cases similar to
the PSSM and sometimes not!

24
3. Finding new Motifs

We are given a group of genes, which presumably
contain a common regulatory motif.
We know nothing of the TF that binds to the
putative motif.
The problem discover the motif.

25
Difficulties in Computational Identification

Each motif can appear in any of m-k
columnsthere are (m-k)n possibilities.
NoiseMismatches are allowed, the motif is not
exact.Not all sequences contain the motif.
Statistical significancek is short (6-20
nucleotides).m ranges from 10s (prokaryotes) to
1000s (eukaryotes) of nucleotides.gt a random
motif can appear by chance in sequences.

26
Computational Methods

This problem has received a lot of attention from
CS people.
Methods include
Probabilistic methods hidden Markov models
(HMMs), expectation maximization (EM), Gibbs
sampling, etc.
Enumeration methods problematic for inexact
motifs of length kgt10.
Current status Problem is still open.

27
Tools on the Web

MEME Multiple EM for Motif Elicitation.
http//meme.sdsc.edu/meme/website/
metaMEME- Uses HMM method
http//meme.sdsc.edu/meme
MAST-Motif Alignment and Search Tool
http//meme.sdsc.edu/meme
TRANSFAC - database of eukaryotic cis-acting
regulatory DNA elements and trans-acting factors.
http//transfac.gbf.de/TRANSFAC/
eMotif - allows to scan, make and search for
motifs in the protein level.
http//motif.stanford.edu/emotif/