Systematic Evaluation of MatrixBased Pattern Matching in Mammals - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Systematic Evaluation of MatrixBased Pattern Matching in Mammals

Description:

Scan each segment of the sequence and attribute the score: Matrix-based pattern matching tools ... by the fact most TRANSFAC annotations are restricted ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 29
Provided by: teache89
Category:

less

Transcript and Presenter's Notes

Title: Systematic Evaluation of MatrixBased Pattern Matching in Mammals


1
Systematic Evaluation of Matrix-Based Pattern
Matching in Mammals
  • Jean Valery Turatsinze
  • Universite Libre de Bruxelles
  • SCMBB-ULB
  • EMBRACE RSMD Workshop 2006-11-10
  • Uppsala - Sweden

2
Introduction
  • An important step in understanding
    transcriptional regulation of genes is to locate
    precisely all functional occurrences of
    transcription factors binding sites (TFBS) in the
    genomes
  • Several tools have been developed to predict
    putative TFBS in DNA sequences (patser,
    MatInspector, Match, TESS, MotifLocator...)
  • General problem trade between sensitivity and
    specificity
  • High score threshold high specificity, but loss
    in sensitivity
  • Low score threshold high sensitivity but poor
    predictive value

3
Questions
  • Which are the optimal parameters for predicting
    binding sites in genome sequences ?
  • Threshold on score
  • Choice of the background model
  • Which level of accuracy can we hope to reach ?
  • We performed a systematic evaluation on the basis
    of a large collection (166 regulons, 287 PSSM)

4
Matrix model representation of TFBS
1 2 3 4 5 6 7 8 9 10
G G G A C T T T C C G G G G A T T
T C C G G G G T T T C C C G G G A
A T C T C C G G G A G A T T C C G
G G G A T T C C C G G G G A A G C
C C G G G A C T T C C C
5
PSSM calculation of the probability of a segment
S given the matrix model M P(S/M)
2nd option pseudo-weight distributed
according to residue priors
1st option identically distributed
pseudo-weight
or
6
Matrix-based pattern matching tools
Scan each segment of the sequence and attribute
the score
Seq A T G C G G G A T T T C C G A
A T C C T G G A A T C G G A
Score
7
Background model representation of the sequence
Bernoulli model
P(SM) probability of the sequence S given the
background model B ri residue found at the
position i of sequence S pri prior probability of
the residue ri
Markov model
P(SB) probability of the sequence S given the
background model B prj probability of the
residue r at the position j Si residue at
position i m Markov model order
8
calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Oligonucleotides frequencies
Transition matrix
9
calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P (SB) P(AT)
Oligonucleotides frequencies
Transition matrix
10
Background model representation of the sequence
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) .
Oligonucleotides frequencies
Transition matrix
11
calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG)
Oligonucleotides frequencies
Transition matrix
12
calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Example Probability of a sequence segment under
an order 2 Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG) . P(GCG)
Oligonucleotides frequencies
Transition matrix
13
calculation of the probability of a segment S
given the background model B P(S/B)
Markov chain-based background model
The probability of each nucleotides depends on
the m precedind nucleotides m being the order
of the Markov model
Seq A T G C G G G A T T
P(SB) probability of the sequence given the
background model
Seq A T G C G G G A T T
P(SB) P(AT) . P(GAT) . P(CTG) . P(GCG). .
.P(TAT)
Oligonucleotides frequencies
Transition matrix
14
General approach comparison of the predictions
with experimentally well characterized sites
Genomic sequence
TRANSFAC annoteted sites
predicted sites
compare-features
inter
inter
diff
diff
False negative FN
False positive FP
True positive TP
Partial overlapping --gtto decide
15
Distribution of all human, rat and mouse
annotated sites
  • Testing set for the evaluation
  • All the human transcription factors having a PSSM
    in TRANSFAC annotations
  • TRANSPRO promoters from -1000 and -500 to -1 from
    the transcription start site (TSS)
  • This choice was justified by the fact most
    TRANSFAC annotations are restricted to this
    proximal region (probably due to experimental
    biases).

16
Choice of the background sequence set
Global model
Sliding windows model model
Input model
17
pMatchingEval flow chart
18
Accuracy optimizing score AP-1
19
Accuracy optimizing score NF-kB
20
Accuracy optimizing score Sp1
21
Accuracy profiles (500 bp promoter)
Global BG
input BG
22
Accuracy profiles (500 bp promoter)
Sliding window 500nt BG
Global BG
Sliding window 400nt BG
Sliding window 300nt BG
Sliding window 100nt BG
Sliding window 200nt BG
23
Score, accuracy, PPV and Sensitivity median
profiles (500)
24
Score, accuracy, PPV and Sensitivity median
profiles (1000)
25
Conclusions
  • Score optimizing accuracy variable according to
    the matrix considered,
  • Even for the same TF different matrices give
    different optimal parameters
  • Background model impact
  • Global calibration is generally slightly better
    than factor-specific and sliding windows
    calibration
  • Order of the Markov chain
  • For some matrices the effect is marginal
  • For other matrices the effect is erratic
  • General trends (median profiles) almost no
    effect for global model
  • For sliding windows higher order Markov chains
    (gt0) give bad results due to the short size of
    training sets (several transition are not
    observed)
  • Optimal parameters should be selected on a case
    by case basis using this approach

26
Acknowledgements
SCMBB Lab Jacques van Helden Olivier
Sand Raphaël Leplae Rekins Janky Karoline
Faust Sylvain Brohée Ariane Toussaint Gipsi Lima
Mendez Marc Lesink Benoit Dessailly Raul Mendez
RSAThttp//rsat.scmbb.ulb.ac.be/rsat/ PhD
Funding F.R.I.A. (FNRS)
27
(No Transcript)
28
Adaptive background modelsMotivations
  • Heterogeneity of nucleotide composition of
    promoters
  • GC content analysis of promoters (500bp) and
    matrices
Write a Comment
User Comments (0)
About PowerShow.com