Motif Finding Workshop Project - PowerPoint PPT Presentation

About This Presentation
Title:

Motif Finding Workshop Project

Description:

MF workshop 08 Ron Shamir. 1. Background. Slides with Ron ... their expression is limited to specific tissues, developmental stages, physiological condition ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 31
Provided by: rsha7
Category:

less

Transcript and Presenter's Notes

Title: Motif Finding Workshop Project


1
Motif Finding WorkshopProject
Chaim Linhart January 2008
2
Outline
  • 1. Some background again
  • 2. The project

3
1. Background
  • Slides with Ron Shamir and Adi Akavia

4
Gene from DNA to protein
Pre-mRNA
Mature mRNA
DNA
protein
transcription
translation
splicing
5
DNA
  • DNA a string over the alphabet of 4 bases
    (nucleotides) A, C, G, T
  • Resides in chromosomes
  • Complementary strands A-T C-G
  • Forward/sense strand AACTTGCG
  • Reverse-complement/anti-sense strand
    TTGAACGC
  • Directional from 5 to 3
  • (upstream) AACTTGCGATACTCCTA
    (downstream)

5 end
3 end
6
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription start site (TSS)
Transcription (RNA polymerase)
Pre-mRNA
Exon
Intron
Exon
Splicing (spliceosome)
5 UTR
3 UTR
Mature mRNA
Stop codon
Start codon
Coding region
Translation (ribosome)
Protein
7
Translation
  • Codon - a triplet of bases, codes a specific
    amino acid (except the stop codons) many-to-1
    relation
  • Stop codons - signal termination of the protein
    synthesis process

http//ntri.tamuk.edu/cell/ribosomes.html
8
Genome sequences
  • Many genomes have been sequences, including those
    of viruses, microbes, plants and animals.
  • Human
  • 23 pairs of chromosomes
  • 3 Gbps (bps base pairs) , only 3 are genes
  • 25,000 genes
  • Yeast
  • 16 chromosomes
  • 20 Mbps
  • 6,500 genes

9
Regulation of Expression
  • Each cell contains an identical copy of the whole
    genome - but utilizes only a subset of the genes
    to perform diverse, unique tasks
  • Most genes are highly regulated
  • their expression is limited to specific tissues,
    developmental stages, physiological condition
  • Main regulatory mechanism transcriptional
    regulation

10
Transcriptional regulation
  • Transcription is regulated primarily by
    transcription factors (TFs) proteins that bind
    to DNA subsequences, called binding sites (BSs)
  • TFBSs are located mainly (not always!) in the
    genes promoter the DNA sequence upstream the
    genes transcription start site (TSS)
  • BSs of a particular TF share a common pattern, or
    motif
  • Some TFs operate together TF modules

TSS
11
TFBS motif models
  • Consensus (degenerate) string

AC
CG
ACT
T
gene 1
gene 2
AACTGT
gene 3
CACTGT
gene 4
CACTCT
gene 5
CACTGT
gene 6
gene 7
gene 8
gene 9
AACTGT
gene 10
  • Statistical models
  • Motif logo representation

12
Human G2M cell-cycle genesThe CHR NF-Y module
CDCA3 (trigger of mitotic entry
1) CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAAC
T -18 CDCA8 (cell division cycle associated
8) TTGTGATTGGATGTTGTGGGA25bpTGACTGTGGAGTTTGAAT
TGG 23 CDC2 (cell division control protein 2
homolog) CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTG
GTGAATCCGGGGCCCTTTAGCGCGGTGAGTTTGAAACTGCT
0 CDC42EP4 (cdc42 effector protein
4) GCTTTCAGTTTGAACCGAGGA25bpCGACGGCCATTGGCTGCT
GC -110 CCNB1 (G2/mitotic-specific cyclin
B1) AGCCGCCAATGGGAAGGGAG30bpAGCAGTGCGGGGTTTAAA
TCT 45 CCNB2 (G2/mitotic-specific cyclin
B2) TTCAGCCAATGAGAGT15bpGTGTTGGCCAATGAGAAC15
bpGGGCCGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA
10 BSs are short, non-specific, hiding in
both strands and at various locations along the
promoters
TFs NF-Y , CHR
13
The computational challenge
  • Given a set of co-regulated genes (e.g., from
    gene expression chips)
  • Find a motif that is over-represented (occurs
    unusually often) in their promoters
  • This may be the TF binding site motif
  • Find TF modules over-represented motifs that
    tend to co-occur

14
The computational challenge (II)
  • Motifs can also be found w/o a given target-set
    genome-wide
  • Find a motif that is localized - occurs more
    often neat the TSS of genes
  • Find a motif with a strand bias occurs more
    often on the genes coding strand
  • Find TF modules with biases in their order /
    orientation / distance

15
Motif finding algorithms
  • gt100 motif finding algs
  • Main differences between them
  • Type of analysis input
  • Target-set vs. genome-wide
  • Single vs. multi-species (conservation)
  • Single motifs vs. modules
  • Motif model
  • Score for evaluating motif
  • Motif search technique
  • Combinatorial (enumeration) vs. Statistical
    optimization

16
Example - Amadeus
Over-represented motifs in the promoters of genes
expressed in the G2 and G2/M phases of the human
cell cycle
CHR
NF-Y
17
2. The project
18
General goals
  • Develop software from A-Z
  • Design
  • Implementation
  • (Optimization)
  • Execution analysis of real data
  • A taste of bioinformatics
  • Have fun
  • Get credit

19
The computational task
  • Given a set of DNA sequences
  • Find interesting pairs of motifs
  • Order bias
  • Other scores
  • Main challenges
  • Performance (time, memory)
  • Output redundancy

20
Input
  • File with DNA sequences in fasta format
  • gtsequence-name1 ltspacegt header1
  • ACCCGNNNNTCGGAAATGANN
  • CGGAGTAAAATATGCGAGCGT
  • gtsequence-name2 ltspacegt header2
  • cggattnnnaccgcannnnnnnnaccgtga
  • gtsequence-name3 ltspacegt header3
  • agtttagactgctagctcgatcgcta
  • gcggatnggctannnnnatctag

21
Input (II)
  • Ignore the header lines
  • Sequence may span multiple lines or one long line
  • Sequence contains the characters A,C,G,T,N in
    upper or lower case
  • N means unknown or masked base
  • Sample input files will be supplied

22
Input (III)
  • Search parameters
  • Length of motifs (between 5-10)
  • Min. Max. distance between the motifs
  • ACGGATTGATNNNTGGATGCCAT
  • distance9
  • Single vs. two strands search
  • Min. number of occurrences (hits) of pair
  • GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA
  • hit
    hit hit
  • Max. p-value
  • Additional parameters

(dont count overlaps, e.g. AAAAAA)
23
Output
  • A list of the string pairs with the best
    order-bias score (smallest p-values)
  • Motif A Motif B A?B B?A
    p-value
  • ACGTT GGATT 97 17
    4.3E-15
  • ACGTT GATTC 87 16
    2.7E-13
  • TTAAC CAGCC 31 114
    1.2E-12
  • A non-redundant list of motif pairs (motif
    consensus string)
  • logos, of hits, additional scores

24
Part A String pairs with order bias
  • nA of A?B nB of B?A
  • WLOG, nA gt nB
  • n nA nB
  • H0 random order nA B(n, 0.5)
  • p-value prob for at least nA occurrences of A?B
    tail of B(n, 0.5)
  • Normal approximation (central limit thm.)
  • Fix for multiple testing x2

25
Part B Non-redundant list of motif pairs
  • Collect similar strings to motif with better
    score (motif consensus)
  • String pair (p-value) Motif pair
  • ACGTT , GGATT (4.3E-15)
  • ACGAT , GGATT (2.4E-11)
  • AGGAT , GGTTT (1.7E-5)
  • AGGTT , GGTTT (5.9E-5)
  • Dont report similar motif pairs
  • Motifs that consist of similar strings
  • Motif pairs that are small shifts of one another
  • Palindromes

, (8.1E-31)
26
Part B (cont.) Additional score
  • Option I Co-occurrence rate
  • N total of sequences
  • sA of sequences that contain motif A
  • sAB of sequences that contain motifs A and B
  • H0 motifs occur independently and randomly
  • p-value prob for at least joint occurrences,
    given the number of hits of each single motif
    tail of hypergeometric distribution

27
Part B (cont.) Additional score
  • Option II Distance bias
  • Is the distance between the two motifs uniform
    (H0), or are there specific distances that are
    very common?
  • Option III Gap variability
  • Are the sequences between the motifs conserved
    (H0),
  • or are they highly variable?
  • Other options??

28
Implementation
  • Java (Eclipse) Linux
  • GUI Simple graphical user interface for
    supplying the input parameters and reporting the
    results
  • Packages for motif logo and statistical scores
    will be supplied
  • Time performance will be measured only for part A
  • Reasonable documentation
  • Separate packages for data-structures, scores,
    GUI, I/O, etc.

29
Design document
  • Due in 3 weeks (Feb 24)
  • 3-5 pages (Word), Hebrew/English
  • Briefly describe main goal, input and output of
    program
  • Describe main data structures, algorithms, and
    scores for parts AB
  • Meet with me before submission

30
Fin
Write a Comment
User Comments (0)
About PowerShow.com