Comparing Database Search Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Comparing Database Search Methods

Description:

Comparing Database Search Methods & Improving the Performance of PSI-BLAST. Stephen Altschul ' ... Random retrieval on a ROC plot. Line of fixed sensitivity ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 34
Provided by: Alts4
Category:

less

Transcript and Presenter's Notes

Title: Comparing Database Search Methods


1
Comparing Database Search Methods Improving the
Performance of PSI-BLAST
  • Stephen Altschul

2
Gold standards for protein classification
  • Traditional curated sequence databases with
  • family and superfamily classifications
  • PIR
  • SWISS-PROT
  • Structure-based protein domain classification
  • SCOP

3
Measuring retrieval accuracy
Sequence Sequence Sequence Sequence
Search Related Unrelated
Search Positive TP True Positive FP False Positive P TP FP
Search Negative FN False Negative TN True Negative N FN TN
Search R TP FN U FP TN
Sensitivity TP/R
Specificity TP/P
4
Receiver Operating Characteristic curve
False
True
False
True
5
Random retrieval on a ROC plot
6
Line of fixed sensitivity
7
Line of fixed specificity
8
Line of fixed crossover ratio
9
Line of fixed crossover ratio
10
ROC score area under the ROC curve
11
Region of interest in ROC analysis
12
Region of interest in ROC analysis
13
Truncated ROC, or ROCn curve
Fraction unrelated accepted
0
103
14
ROCn score area under the ROCn curve
15
Questions concerning ROC analysis
  • What false-positive cutoff value should be used?
  • When does it make sense to pool the results of
    database searches?
  • When are the ROC scores for two different methods
    significantly different?

16
Marginal ratio of true to false positives
17
Definition of the ROCn score
ti Number of related sequences (true
positives) returned before the ith false
positive
t Total number of related sequences
18
Random distribution of ROCn scores
  • Bootstrap resampling can be used to assign a
    statistical significance to differences in ROCn
    scores.
  • Under reasonable assumptions, the distribution of
    bootstrapped ROCn scores is approximately normal.
  • Resampling a small subset in a large database is
    equivalent to resampling the subset with
    independent Poisson distributions with mean 1.

19
Bootstrap resampling of false positives
Retrieval Ranking of the Database
20
Mean and variance for the normal distribution of
ROCn scores yielded by resampling only the false
positives
21
Mean and variance for the normal distribution of
the difference of two ROCn scores, yielded by
resampling only the false positives
22
PSI-BLAST in a nutshell
  • With a protein sequence as query, use BLAST to
    search a protein sequence database.
  • Collapse significant local alignments (those with
    E-value less than or equal to a set threshold h)
    into a multiple alignment, using the residues of
    the query sequence as alignment-column
    placeholders.
  • Abstract a position-specific score matrix from
    the multiple alignment.
  • Search the database with the score matrix as
    query.
  • Iterate a fixed number of times, or until
    convergence.

23
Protocol for evaluating PSI-BLAST
  • For each query sequence, search a comprehensive
    protein sequence database (e.g. NCBIs nr)
    through a fixed number of PSI-BLAST iterations,
    or until convergence.
  • Use the resulting position-specific score matrix
    to search the gold standard database.
  • Pool the search results for ROC analysis.

24
The effect of acceptance threshold h on PSI-BLAST
accuracy
25
Some ideas for improving PSI-BLAST
  • 1. New statistical parameters
  • 2. Smith-Waterman alignment
  • 3. Substitution matrix frequency ratios
  • 4. Apply SEG to database sequences
  • 5. Composition-based statistics
  • 6. Concentrated accounting of gaps
  • 7. Dispersed accounting of gaps
  • 8. Exponentiate Henikoff weights
  • 9. Reverse sequence normalization
  • 10. Window for amino acid composition

11. Use pseudocounts with composition window 12.
Vary gap costs 13. Generalized affine gap
costs 14. Substitution score offset 15.
Information-dependent pseudocount parameter 16.
Database-sequence length-normalization 17.
Restricted score rescaling 18. Adjust purging
percentage 19. Adjust pseudocount parameter 20.
Adjust acceptance threshold
26
The effect of composition-based statistics on
PSI-BLAST accuracy
27
Composition-based statistics
  • Statistics based on standard amino acid
    frequencies can differ by orders of magnitude
    from those based upon the peculiar composition of
    two proteins.
  • Standard protein
    4.5 N
  • DNA pol III, ß chain M. genitalium
    12.1 N
  • DNA pol III, ß chain C. jejuni
    7.6 N
  • Depending upon the composition assumed, a
    search of nr with M. genitalium DNA pol III as
    query yields different E-values for C. jejuni DNA
    pol III, as well as for the highest-scoring false
    positive
  • Standard statistics 10-10
    0.0002
  • Composition-based statistics 0.001
    0.2
  • At a threshold of 0.0001, standard
    statistics yield 54 true positives, while at 0.1,
    composition-based statistics yield 55 true
    positives.

28
The effect of dispersed accounting of gaps on
PSI-BLAST accuracy
29
The effect of restricted score rescaling and
parameter tuning on PSI-BLAST accuracy
30
Accuracy of PSI-BLAST
Program version ROC100 score
Original h 10-6 0.758 0.005
Composition-based statistics h 0.002 0.879 0.003
Dispersed gap accounting h 0.005 0.884 0.002
Restricted score rescaling b 9 p 0.94 0.895 0.003
31
PSI-BLAST accuracy as a function of the number of
iterations
32
Literature
  • ROC analysis
  • Swets, J.A. (1988) Science 2401285-1293
  • Gribskov, M. Robinson, N.L. (1996) Comput.
    Chem. 2025-33
  • PSI-BLAST
  • Altschul, S.F. et al. (1997) Nucl. Acids Res.
    253389-3402
  • Composition-based statistics
  • Karplus, K. et al. (1998) Bioinformatics
    14846-856
  • Schäffer, A.A. et al. (1999) Bioinformatics
    151000-1011
  • Mott, R. (2000) J. Mol. Biol. 300649-659
  • Statistics of ROCn resampling
  • Schäffer, A.A. et al. (2001) Nucl. Acids Res.
    292994-3005
  • Spouge, J.L. Czabarka, E. (2002) ISMB Poster
    133A

33
Acknowledgements
  • Analysis of ROCn score distribution
  • John Spouge
  • Eva Czabarka
  • Improvements to PSI-BLAST
  • Alejandro Schäffer
  • L. Aravind
  • Thomas Madden
  • Sergei Shavirin
  • John Spouge
  • Yuri Wolf
  • Eugene Koonin
Write a Comment
User Comments (0)
About PowerShow.com