Title: Motif Finding Workshop Project
1Motif Finding WorkshopProject
Chaim Linhart January 2008
2Outline
- 1. Some background again
- 2. The project
31. Background
- Slides with Ron Shamir and Adi Akavia
4Gene from DNA to protein
Pre-mRNA
Mature mRNA
DNA
protein
transcription
translation
splicing
5DNA
- DNA a string over the alphabet of 4 bases
(nucleotides) A, C, G, T - Resides in chromosomes
- Complementary strands A-T C-G
- Forward/sense strand AACTTGCG
- Reverse-complement/anti-sense strand
TTGAACGC - Directional from 5 to 3
- (upstream) AACTTGCGATACTCCTA
(downstream)
5 end
3 end
6Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription start site (TSS)
Transcription (RNA polymerase)
Pre-mRNA
Exon
Intron
Exon
Splicing (spliceosome)
5 UTR
3 UTR
Mature mRNA
Stop codon
Start codon
Coding region
Translation (ribosome)
Protein
7Translation
- Codon - a triplet of bases, codes a specific
amino acid (except the stop codons) many-to-1
relation - Stop codons - signal termination of the protein
synthesis process
http//ntri.tamuk.edu/cell/ribosomes.html
8Genome sequences
- Many genomes have been sequences, including those
of viruses, microbes, plants and animals. - Human
- 23 pairs of chromosomes
- 3 Gbps (bps base pairs) , only 3 are genes
- 25,000 genes
- Yeast
- 16 chromosomes
- 20 Mbps
- 6,500 genes
9Regulation of Expression
- Each cell contains an identical copy of the whole
genome - but utilizes only a subset of the genes
to perform diverse, unique tasks - Most genes are highly regulated
- their expression is limited to specific tissues,
developmental stages, physiological condition - Main regulatory mechanism transcriptional
regulation
10Transcriptional regulation
- Transcription is regulated primarily by
transcription factors (TFs) proteins that bind
to DNA subsequences, called binding sites (BSs) - TFBSs are located mainly (not always!) in the
genes promoter the DNA sequence upstream the
genes transcription start site (TSS) - BSs of a particular TF share a common pattern, or
motif - Some TFs operate together TF modules
TSS
11TFBS motif models
- Consensus (degenerate) string
AC
CG
ACT
T
gene 1
gene 2
AACTGT
gene 3
CACTGT
gene 4
CACTCT
gene 5
CACTGT
gene 6
gene 7
gene 8
gene 9
AACTGT
gene 10
- Statistical models
- Motif logo representation
12Human G2M cell-cycle genesThe CHR NF-Y module
CDCA3 (trigger of mitotic entry
1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAAC
T -18 CDCA8 (cell division cycle associated
8) TTGTGATTGGATGTTGTGGGA25bpTGACTGTGGAGTTTGAAT
TGG 23 CDC2 (cell division control protein 2
homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTG
GTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT
0 CDC42EP4 (cdc42 effector protein
4) GCTTTCAGTTTGAACCGAGGA25bpCGACGGCCATTGGCTGCT
GC -110 CCNB1 (G2/mitotic-specific cyclin
B1) AGCCGCCAATGGGAAGGGAG30bpAGCAGTGCGGGGTTTAAA
TCT 45 CCNB2 (G2/mitotic-specific cyclin
B2) TTCAGCCAATGAGAGT15bpGTGTTGGCCAATGAGAAC15
bpGGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA
10 BSs are short, non-specific, hiding in
both strands and at various locations along the
promoters
TFs NF-Y , CHR
13The computational challenge
- Given a set of co-regulated genes (e.g., from
gene expression chips) - Find a motif that is over-represented (occurs
unusually often) in their promoters - This may be the TF binding site motif
- Find TF modules over-represented motifs that
tend to co-occur
14The computational challenge (II)
- Motifs can also be found w/o a given target-set
genome-wide - Find a motif that is localized - occurs more
often neat the TSS of genes - Find a motif with a strand bias occurs more
often on the genes coding strand - Find TF modules with biases in their order /
orientation / distance
15Motif finding algorithms
- gt100 motif finding algs
- Main differences between them
- Type of analysis input
- Target-set vs. genome-wide
- Single vs. multi-species (conservation)
- Single motifs vs. modules
- Motif model
- Score for evaluating motif
- Motif search technique
- Combinatorial (enumeration) vs. Statistical
optimization
16Example - Amadeus
Over-represented motifs in the promoters of genes
expressed in the G2 and G2/M phases of the human
cell cycle
CHR
NF-Y
172. The project
18General goals
- Develop software from A-Z
- Design
- Implementation
- (Optimization)
- Execution analysis of real data
- A taste of bioinformatics
- Have fun
- Get credit
19The computational task
- Given a set of DNA sequences
- Find interesting pairs of motifs
- Order bias
- Other scores
- Main challenges
- Performance (time, memory)
- Output redundancy
20Input
- File with DNA sequences in fasta format
- gtsequence-name1 ltspacegt header1
- ACCCGNNNNTCGGAAATGANN
- CGGAGTAAAATATGCGAGCGT
- gtsequence-name2 ltspacegt header2
- cggattnnnaccgcannnnnnnnaccgtga
- gtsequence-name3 ltspacegt header3
- agtttagactgctagctcgatcgcta
- gcggatnggctannnnnatctag
21Input (II)
- Ignore the header lines
- Sequence may span multiple lines or one long line
- Sequence contains the characters A,C,G,T,N in
upper or lower case - N means unknown or masked base
- Sample input files will be supplied
22Input (III)
- Search parameters
- Length of motifs (between 5-10)
- Min. Max. distance between the motifs
- ACGGATTGATNNNTGGATGCCAT
- distance9
- Single vs. two strands search
- Min. number of occurrences (hits) of pair
- GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA
- hit
hit hit - Max. p-value
- Additional parameters
(dont count overlaps, e.g. AAAAAA)
23Output
- A list of the string pairs with the best
order-bias score (smallest p-values) - Motif A Motif B A?B B?A
p-value - ACGTT GGATT 97 17
4.3E-15 - ACGTT GATTC 87 16
2.7E-13 - TTAAC CAGCC 31 114
1.2E-12 - A non-redundant list of motif pairs (motif
consensus string) - logos, of hits, additional scores
24Part A String pairs with order bias
- nA of A?B nB of B?A
- WLOG, nA gt nB
- n nA nB
- H0 random order nA B(n, 0.5)
- p-value prob for at least nA occurrences of A?B
tail of B(n, 0.5) - Normal approximation (central limit thm.)
- Fix for multiple testing x2
25Part B Non-redundant list of motif pairs
- Collect similar strings to motif with better
score (motif consensus) - String pair (p-value) Motif pair
- ACGTT , GGATT (4.3E-15)
- ACGAT , GGATT (2.4E-11)
- AGGAT , GGTTT (1.7E-5)
- AGGTT , GGTTT (5.9E-5)
- Dont report similar motif pairs
- Motifs that consist of similar strings
- Motif pairs that are small shifts of one another
- Palindromes
, (8.1E-31)
26Part B (cont.) Additional score
- Option I Co-occurrence rate
- N total of sequences
- sA of sequences that contain motif A
- sAB of sequences that contain motifs A and B
- H0 motifs occur independently and randomly
- p-value prob for at least joint occurrences,
given the number of hits of each single motif
tail of hypergeometric distribution
27Part B (cont.) Additional score
- Option II Distance bias
- Is the distance between the two motifs uniform
(H0), or are there specific distances that are
very common? - Option III Gap variability
- Are the sequences between the motifs conserved
(H0), - or are they highly variable?
- Other options??
28Implementation
- Java (Eclipse) Linux
- GUI Simple graphical user interface for
supplying the input parameters and reporting the
results - Packages for motif logo and statistical scores
will be supplied - Time performance will be measured only for part A
- Reasonable documentation
- Separate packages for data-structures, scores,
GUI, I/O, etc.
29Design document
- Due in 3 weeks (Feb 24)
- 3-5 pages (Word), Hebrew/English
- Briefly describe main goal, input and output of
program - Describe main data structures, algorithms, and
scores for parts AB - Meet with me before submission
30Fin