Motif Finding Workshop Project - PowerPoint PPT Presentation

About This Presentation

Title:

Motif Finding Workshop Project

Description:

MF workshop 08 Ron Shamir. 1. Background. Slides with Ron ... their expression is limited to specific tissues, developmental stages, physiological condition ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 31

Provided by: rsha7

Category:

more less

Transcript and Presenter's Notes

Title: Motif Finding Workshop Project

1
Motif Finding WorkshopProject
Chaim Linhart January 2008
2
Outline

1. Some background again
2. The project

3
1. Background

Slides with Ron Shamir and Adi Akavia

4
Gene from DNA to protein
Pre-mRNA
Mature mRNA
DNA
protein
transcription
translation
splicing
5
DNA

DNA a string over the alphabet of 4 bases
(nucleotides) A, C, G, T
Resides in chromosomes
Complementary strands A-T C-G
Forward/sense strand AACTTGCG
Reverse-complement/anti-sense strand
TTGAACGC
Directional from 5 to 3
(upstream) AACTTGCGATACTCCTA
(downstream)

5 end
3 end
6
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription start site (TSS)
Transcription (RNA polymerase)
Pre-mRNA
Exon
Intron
Exon
Splicing (spliceosome)
5 UTR
3 UTR
Mature mRNA
Stop codon
Start codon
Coding region
Translation (ribosome)
Protein
7
Translation

Codon - a triplet of bases, codes a specific
amino acid (except the stop codons) many-to-1
relation
Stop codons - signal termination of the protein
synthesis process

http//ntri.tamuk.edu/cell/ribosomes.html
8
Genome sequences

Many genomes have been sequences, including those
of viruses, microbes, plants and animals.
Human
23 pairs of chromosomes
3 Gbps (bps base pairs) , only 3 are genes
25,000 genes
Yeast
16 chromosomes
20 Mbps
6,500 genes

9
Regulation of Expression

Each cell contains an identical copy of the whole
genome - but utilizes only a subset of the genes
to perform diverse, unique tasks
Most genes are highly regulated
their expression is limited to specific tissues,
developmental stages, physiological condition
Main regulatory mechanism transcriptional
regulation

10
Transcriptional regulation

Transcription is regulated primarily by
transcription factors (TFs) proteins that bind
to DNA subsequences, called binding sites (BSs)
TFBSs are located mainly (not always!) in the
genes promoter the DNA sequence upstream the
genes transcription start site (TSS)
BSs of a particular TF share a common pattern, or
motif
Some TFs operate together TF modules

TSS
11
TFBS motif models

Consensus (degenerate) string

AC
CG
ACT
T
gene 1
gene 2
AACTGT
gene 3
CACTGT
gene 4
CACTCT
gene 5
CACTGT
gene 6
gene 7
gene 8
gene 9
AACTGT
gene 10

Statistical models
Motif logo representation

12
Human G2M cell-cycle genesThe CHR NF-Y module
CDCA3 (trigger of mitotic entry
1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAAC
T -18 CDCA8 (cell division cycle associated
8) TTGTGATTGGATGTTGTGGGA25bpTGACTGTGGAGTTTGAAT
TGG 23 CDC2 (cell division control protein 2
homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTG
GTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT
0 CDC42EP4 (cdc42 effector protein
4) GCTTTCAGTTTGAACCGAGGA25bpCGACGGCCATTGGCTGCT
GC -110 CCNB1 (G2/mitotic-specific cyclin
B1) AGCCGCCAATGGGAAGGGAG30bpAGCAGTGCGGGGTTTAAA
TCT 45 CCNB2 (G2/mitotic-specific cyclin
B2) TTCAGCCAATGAGAGT15bpGTGTTGGCCAATGAGAAC15
bpGGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA
10 BSs are short, non-specific, hiding in
both strands and at various locations along the
promoters
TFs NF-Y , CHR
13
The computational challenge

Given a set of co-regulated genes (e.g., from
gene expression chips)
Find a motif that is over-represented (occurs
unusually often) in their promoters
This may be the TF binding site motif
Find TF modules over-represented motifs that
tend to co-occur

14
The computational challenge (II)

Motifs can also be found w/o a given target-set
genome-wide
Find a motif that is localized - occurs more
often neat the TSS of genes
Find a motif with a strand bias occurs more
often on the genes coding strand
Find TF modules with biases in their order /
orientation / distance

15
Motif finding algorithms

gt100 motif finding algs
Main differences between them
Type of analysis input
Target-set vs. genome-wide
Single vs. multi-species (conservation)
Single motifs vs. modules
Motif model
Score for evaluating motif
Motif search technique
Combinatorial (enumeration) vs. Statistical
optimization

16
Example - Amadeus
Over-represented motifs in the promoters of genes
expressed in the G2 and G2/M phases of the human
cell cycle
CHR
NF-Y
17
2. The project
18
General goals

Develop software from A-Z
Design
Implementation
(Optimization)
Execution analysis of real data
A taste of bioinformatics
Have fun
Get credit

19
The computational task

Given a set of DNA sequences
Find interesting pairs of motifs
Order bias
Other scores
Main challenges
Performance (time, memory)
Output redundancy

20
Input

File with DNA sequences in fasta format
gtsequence-name1 ltspacegt header1
ACCCGNNNNTCGGAAATGANN
CGGAGTAAAATATGCGAGCGT
gtsequence-name2 ltspacegt header2
cggattnnnaccgcannnnnnnnaccgtga
gtsequence-name3 ltspacegt header3
agtttagactgctagctcgatcgcta
gcggatnggctannnnnatctag

21
Input (II)

Ignore the header lines
Sequence may span multiple lines or one long line
Sequence contains the characters A,C,G,T,N in
upper or lower case
N means unknown or masked base
Sample input files will be supplied

22
Input (III)

Search parameters
Length of motifs (between 5-10)
Min. Max. distance between the motifs
ACGGATTGATNNNTGGATGCCAT
distance9
Single vs. two strands search
Min. number of occurrences (hits) of pair
GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA
hit
hit hit
Max. p-value
Additional parameters

(dont count overlaps, e.g. AAAAAA)
23
Output

A list of the string pairs with the best
order-bias score (smallest p-values)
Motif A Motif B A?B B?A
p-value
ACGTT GGATT 97 17
4.3E-15
ACGTT GATTC 87 16
2.7E-13
TTAAC CAGCC 31 114
1.2E-12
A non-redundant list of motif pairs (motif
consensus string)
logos, of hits, additional scores

24
Part A String pairs with order bias

nA of A?B nB of B?A
WLOG, nA gt nB
n nA nB
H0 random order nA B(n, 0.5)
p-value prob for at least nA occurrences of A?B
tail of B(n, 0.5)
Normal approximation (central limit thm.)
Fix for multiple testing x2

25
Part B Non-redundant list of motif pairs

Collect similar strings to motif with better
score (motif consensus)
String pair (p-value) Motif pair
ACGTT , GGATT (4.3E-15)
ACGAT , GGATT (2.4E-11)
AGGAT , GGTTT (1.7E-5)
AGGTT , GGTTT (5.9E-5)
Dont report similar motif pairs
Motifs that consist of similar strings
Motif pairs that are small shifts of one another
Palindromes

, (8.1E-31)
26
Part B (cont.) Additional score

Option I Co-occurrence rate
N total of sequences
sA of sequences that contain motif A
sAB of sequences that contain motifs A and B
H0 motifs occur independently and randomly
p-value prob for at least joint occurrences,
given the number of hits of each single motif
tail of hypergeometric distribution

27
Part B (cont.) Additional score

Option II Distance bias
Is the distance between the two motifs uniform
(H0), or are there specific distances that are
very common?
Option III Gap variability
Are the sequences between the motifs conserved
(H0),
or are they highly variable?
Other options??

28
Implementation

Java (Eclipse) Linux
GUI Simple graphical user interface for
supplying the input parameters and reporting the
results
Packages for motif logo and statistical scores
will be supplied
Time performance will be measured only for part A
Reasonable documentation
Separate packages for data-structures, scores,
GUI, I/O, etc.

29
Design document