Title: Algorithms in Bioinformatics: A Practical Introduction
1Algorithms in Bioinformatics A Practical
Introduction
- Project
- Motif finding using ChIP-seq peak data
2Transcriptional Control (I)
3Transcriptional Control (II)
TATAAT is the motif!
4Motif model
TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA T
TGACT TTGACC TTGACA
Consensus Pattern
TTGACA
Positional Weight Matrix (PWM)
- Motif can be described in two ways based on the
binding sites discovered
5ChIP experiment
- Chromatin immunoprecipitation experiment
- Detect the interaction between protein
(transcription factor) and DNA.
6Peak data
- Peak data represents the locations where a
particular TF binding. - The data tells us the locations and intensities.
- (Note that due to experimental error, peaks of
low intensity may be noise.)
ChIP-seq data for Human (MCF7) E2 treatment at
45min
chr1883,686-958,485
7Our aim
- Given the DNA sequences of those peaks, find
motifs which occur in those peak regions. - For the example below, we have two motifs TTGACA
and GCATC. - Note that each instance has at most 1 mutation.
GCACGCGGTATCGTTAGCTTGACAATGAAGAATCCCCCCGCTCGACAGT
GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG
CCCTCTGGAAATTAGTGCGGCATCTCACAACCCGAGGAATGACCAAATG
GTATTGAAAGTAAGGCAACGGTGATCCCCATGACACCAAAGATGCTAAG
CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC
GTCTCTTGACCGCTTAATCCTAAAGGCCTCCTATTAGTATCCGCAATGT
GAACAGGAGCGCGAGCCATCAATTGAAGCGAAGTTGACACCTAATAACT
8Input (I)
- From every peak, we get approximately /-200 DNA
sequence - gtcmyc_1_chr1_4842133_4842148_range_chr1_4841934_48
42348_intensity_20 - CCTCCATACCAGCCCCAATGTTCTGCGTTCCCGAATGAAAGACACACAAC
ACAGCCTTTATATTTTGATATGCCTAAAACTGCTCAATGGCTGGGCCACT
TCCTAGCTAGTATCCACGTGGCTATCCCACCTCTCTCTGATATTCCCAAG
TCATTACTTACTAAAATCTGTAATTACATCTTTGCTGCCCTAGGCCCAAT
CTGGCAGCCCTCCTGTGGCCCCTCAGGCTACTACATGGCAGCTAAGCTCT
CTGACCCACATCTTCTCAGGCACCGTGCCTCCTCTTCTCCACCTTATTCA
AACATGGTGGCTCTCCTTCCTCCTTCTTCCTGTCTGTCCCCAGCCTGGGA
ATTCTAAAAGTCCCACCTCTGTCTGCCCTGTTCAGCCATTGGCTGTCGGC
ATCTTTATTTACGAG - gtcmyc_2_chr1_5073201_5073215_range_chr1_5073002_50
73415_intensity_15 - GGTCATAAACCAAGCTTCTTCAAAGATTTTTGGCTTTTTGGCACCAGTGG
CCTGCAGGGTGGCGAGCTCTGCCAGTTTGAAGTGACCAAGTTAAGTGGCC
TGGGAAAGGCCATTTGGTGCGCGGTCCAGCAGTTTTGGGCGCTCTCGGCT
TCCGCCCTCAGCTGCGGTCACGTGCGGCTGCTCACGTGCCAGACGCTGCT
GTCACTTCGTAGCTGTTCCGGCTTCCTCTGAGTGAGGCTCGCAACGTCTC
CCACGGAGTCGCCTTCGTTCTGCTCTGGGTCTCCCGTGGCCACTGAGACC
TCGGAGCTCGACCGGCGCCTGCCCGCCCGTGCGGCCCTCACTCCCCGAGG
CTATCCAGGTGAGGCCGCCTGGGGTCCCTCCCCGGCTCCGGAGAGCCGAC
TGGTTTCCCTGCCG - gtcmyc_3_chr1_9530642_9530652_range_chr1_9530443_95
30852_intensity_36 - GTAGTCCCAACCAGGTCCTGAGCTGGTTAGCCAACCCTCAGCGCCAGTCG
GGCCAACATCCGGTGACGAATCCAAGTCCCGCCTCTAAGCCCATCTGCTG
TCCAATGCCGCCCTCTGCCGGTCTTTACCTCCCCGCCTAGCTGTGAGCCG
CTTCCAGACAACCCGGAAGTGATCTTTCCTCTTCCGGATTACGGGTCCGG
ACGTCCGCACGTGGTTGCCGGTTTAGGGTGCTGCTGTAGTGGCGATACGT
CCCGCCGCTGTCCCGAAGTGAGGGATCCGAGCCGCAGCGAGAGCCATGGA
GGGCCAGCGCGTGGAGGAGCTGCTGGCCAAGGCAGAGCAGGAGGAGGCGG
AGAAGCTGCAGCGCATCACGGTGCACAAGGAGCTGGAGCTGGAGTTCGAC
CTGGGCAACC
9Input (II)
- A set of sequences which are likely containing no
motif. - gtSEQ_1
- AACAAGGGAAAGAGTAGTGAGTGCTTCTTTCTATTCAGAGGGAGGGGAAG
TTGCTGTTAGCTAAGACAGTCAGGACTGAGAAGGGGGGGGGGGGTTTAAC
TCTCCTGGAGGGAGCTGAGAGGTAAAGGGAGGGGCGTGAGGTAGAACAAG
CCGAGAACACAGGGCAGGTTGGTCTGACTCCAGAGCACAGTGCAGGAGCC
CGGAAGTTGACTCAGTTCAGTTAGCAAGTATTTTCACACAAGGCGTGAAC
ACTGAAGACAAAAGCAAGAGACACAGCTCTATCTCTAAGAAGATTTTCAG
AGCCAAGATCGATGGGGCACACCTGTTAATCCCAGCACTTAGGAGGCTGA
GGCAGGAGGATCCCAAGTTCAAAACCAGCCTGGACTTGTTTTAAGGAAAA
- gtSEQ_2
- AAAAAAAAAAAAAAGACTTCCAGTTTAATAAATGACCAATTCAGGAATGG
AGATTAGGGCTGGATGACAAGTTTTTAATTGTCAAGGACTCAATTCTGTT
TATCAGTTGGTATGGAATTATGTAAGCTTTTAGCGATATGACCGCACGGA
GCAGTGTAGAGAGTGATCTGAGAGACGCTTGGGGGTCAGGATGGAGATAG
AACTCCCTCTCTATTAGAAGGTGTTTGGTGGTAGGTAACCCTGGGCTAGC
ATGGTGGGTCTCTTCTTACTTAGGCTTCCATCTTTGTGGTTCAAATCCAA
GAAGGACCTGCGTTCCCTCCCTCCTTGTGATCAGCTGATTGCTAGAGCAT
AACTCATCTTAACTTCTCATGTACTCTCCGGGTACAGGAAGGGAGGGGGC
- gtSEQ_3
- CCACTGCTGACAGTGGAGCATGAAACGACCGGCTTCCTGACTATGTTGGT
ACCCTTTCAGGAGCCTAAAACAGTGCTTTCAATACTTGTGTCTATGTCTG
TTAGCCACAACTTTCTAGTTTCCCAGAGAGATTTTGAAGTGTAGTTTTGT
ATTTGCTCAAATATATATTCATATGGTGAGGTGCACATTTTTTATATTAT
ATTTTTATTCATTTATTTTTGGTGCTTGGGAATTATACTCTAGGAATAAA
GCGCCTGGTAGAAAGTGGCACACATCTTTAATCCCAGCACTCAGGAAGCA
GAGGCAGACAAATCTCTGCGTTCCAGGACAGCCTGGTCTATAGAGCAAGG
TCCAAGCCAGCCAGGTTTACACAAAGAAACCTAGTGTGGAAAAGACAAAA
10Output
- You need to output a list of candidate (ranked)
motifs. - You can model the motif as PWM or consensus
sequence. - If you model the motif as a PWM, one of the
answer for the previous dataset is - You may also return other significant motifs.
11Aim of the project
- Given a sample file and a background file,
- you need to implement a method which output a
list of motifs. - You need to take advantage of the fact that this
is a ChIP-seq dataset - Hint Read papers on ChIP-seq and understand its
properties.