Title: Systematic Evaluation of MatrixBased Pattern Matching in Mammals
1Systematic Evaluation of Matrix-Based Pattern
Matching in Mammals
- Jean Valery Turatsinze
- Universite Libre de Bruxelles
- SCMBB-ULB
- EMBRACE RSMD Workshop 2006-11-10
- Uppsala - Sweden
2Introduction
- An important step in understanding
transcriptional regulation of genes is to locate
precisely all functional occurrences of
transcription factors binding sites (TFBS) in the
genomes - Several tools have been developed to predict
putative TFBS in DNA sequences (patser,
MatInspector, Match, TESS, MotifLocator...) - General problem trade between sensitivity and
specificity - High score threshold high specificity, but loss
in sensitivity - Low score threshold high sensitivity but poor
predictive value
3Questions
- Which are the optimal parameters for predicting
binding sites in genome sequences ? - Threshold on score
- Choice of the background model
- Which level of accuracy can we hope to reach ?
- We performed a systematic evaluation on the basis
of a large collection (166 regulons, 287 PSSM)
4Matrix model representation of TFBS
1 2 3 4 5 6 7 8 9 10
G G G A C T T T C C G G G G A T T
T C C G G G G T T T C C C G G G A
A T C T C C G G G A G A T T C C G
G G G A T T C C C G G G G A A G C
C C G G G A C T T C C C
5PSSM calculation of the probability of a segment
S given the matrix model M P(S/M)
2nd option pseudo-weight distributed
according to residue priors
1st option identically distributed
pseudo-weight
or
6Matrix-based pattern matching tools
Scan each segment of the sequence and attribute
the score
Seq A T G C G G G A T T T C C G A
A T C C T G G A A T C G G A
Score
7Background model representation of the sequence
Bernoulli model
P(SM) probability of the sequence S given the
background model B ri residue found at the
position i of sequence S pri prior probability of
the residue ri
Markov model
P(SB) probability of the sequence S given the
background model B prj probability of the
residue r at the position j Si residue at
position i m Markov model order
8calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Oligonucleotides frequencies
Transition matrix
9calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P (SB) P(AT)
Oligonucleotides frequencies
Transition matrix
10Background model representation of the sequence
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) .
Oligonucleotides frequencies
Transition matrix
11calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG)
Oligonucleotides frequencies
Transition matrix
12calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG) . P(GCG)
Oligonucleotides frequencies
Transition matrix
13calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG) . P(GCG). .
.P(TAT)
Oligonucleotides frequencies
Transition matrix
14General approach comparison of the predictions
with experimentally well characterized sites
Genomic sequence
TRANSFAC annoteted sites
predicted sites
compare-features
inter
inter
diff
diff
False negative FN
False positive FP
True positive TP
Partial overlapping --gtto decide
15Distribution of all human, rat and mouse
annotated sites
- Testing set for the evaluation
- All the human transcription factors having a PSSM
in TRANSFAC annotations - TRANSPRO promoters from -1000 and -500 to -1 from
the transcription start site (TSS) - This choice was justified by the fact most
TRANSFAC annotations are restricted to this
proximal region (probably due to experimental
biases).
16Choice of the background sequence set
Global model
Sliding windows model model
Input model
17pMatchingEval flow chart
18Accuracy optimizing score AP-1
19Accuracy optimizing score NF-kB
20Accuracy optimizing score Sp1
21Accuracy profiles (500 bp promoter)
Global BG
input BG
22Accuracy profiles (500 bp promoter)
Sliding window 500nt BG
Global BG
Sliding window 400nt BG
Sliding window 300nt BG
Sliding window 100nt BG
Sliding window 200nt BG
23Score, accuracy, PPV and Sensitivity median
profiles (500)
24Score, accuracy, PPV and Sensitivity median
profiles (1000)
25Conclusions
- Score optimizing accuracy variable according to
the matrix considered, - Even for the same TF different matrices give
different optimal parameters - Background model impact
- Global calibration is generally slightly better
than factor-specific and sliding windows
calibration - Order of the Markov chain
- For some matrices the effect is marginal
- For other matrices the effect is erratic
- General trends (median profiles) almost no
effect for global model - For sliding windows higher order Markov chains
(gt0) give bad results due to the short size of
training sets (several transition are not
observed) - Optimal parameters should be selected on a case
by case basis using this approach
26Acknowledgements
SCMBB Lab Jacques van Helden Olivier
Sand Raphaël Leplae Rekins Janky Karoline
Faust Sylvain Brohée Ariane Toussaint Gipsi Lima
Mendez Marc Lesink Benoit Dessailly Raul Mendez
RSAThttp//rsat.scmbb.ulb.ac.be/rsat/ PhD
Funding F.R.I.A. (FNRS)
27(No Transcript)
28Adaptive background modelsMotivations
- Heterogeneity of nucleotide composition of
promoters - GC content analysis of promoters (500bp) and
matrices