Title: Database Searching
1Database Searching
2Searching for Data
- Text Patterns
- LookUp
- Sequence Patterns
- FindPatterns
- ProfileSearch
- Sequence Similarity
- FastA, TFastA
- BLAST, NetBLAST
3Introduction to Data Base Searching
- What are you looking for?
4"Exact" matches
- "Have I cloned something that someone else has
already worked on?"
5"Related" sequences
- Is there something similar to my sequence
- Evolutionary relationships
- Convergent function
6Search Program Considerations
- Sensitivity
- Stringency
- Speed
- Cost
7Speed and Cost
- Time and cost of the search is dependent on the
size of the database and the size of the query - Restrict the size of the database
- Use the -batch qualifier to save money
- Use GenBank's Services
8Results
- Histogram
- Plot of 'match scores" vs. number of sequences
- Allows you to distinguish background noise from
significant matches - Sequence Names
- Alignments
9FindPatterns
- Locate short sequence patterns in sequences
- Nucleic acid or Protein
- Searches both strands of a nucleic acid sequence
10Pattern Definitions
- Findpatterns, Map, Mapsort, Mapplot, and Motifs
all let you search with ambiguous expressions - Expressions can include any legal GCG sequence
character - Expressions can also specify
- OR and NOT matching
- Begin and end constraints
- Repeat counts
11Repeats
- Parentheses () enclose one or more symbols that
can be repeated - Braces enclose numbers that tell how many
times the symbol(s) must be found - (GA)2,10 - GA repeated 2 to 10 times
- G2, - G repeated 2 to 350,000 times
- (GAT),10 - GAT repeated 0 to 10 times
12TAATA(N)20,30ATG
- TAATA, followed by 20 to 30 of any base,
followed by ATG
13OR Matching
- Enclose the different choices in parentheses
and separate the choices with commas - RGF(Q,A)S
- RGF followed by either Q or A followed by S.
- GAT(TG,T,G)1,4A means
- GAT followed by any combination of TG, T, or
G repeated from 1 to 4 times followed by A
14NOT Matching
- Use the symbol
- GCCAT
- GC, followed by any symbol except C followed by
AT - GC(A,T)CC
- GC followed by any symbol except A or T, followed
by CC.
15 BEGIN AND END Constraints
- The pattern ltGACCAT can only be found at the
beginning of the sequence - The pattern GACCATgt can only be found at the end
of the sequence
16analyze findpatterns -check FindPatterns
identifies sequences that contain short patterns
like GAATTC or YRYRYRYR. You can define the
patterns ambiguously and allow mismatches. You
can provide the patterns in a file or simply type
them in from the terminal. Minimal Syntax
findpatterns -INfileGenbankHumig -Default
Prompted Parameters -PATternsGAATTC,RGGAY
patterns to be found -OUTfilefindpatterns.
find the output file name Local Data Files
-DATapattern.dat a file with a set of
patterns
17Optional Parameters -MISmatch1 allows
mismatches in the search for your
subsequence -NAMes writes the output as
a list file -ONEstrand searches only the top
strand of nucleotide sequences -SIXbase
searches only for patterns with six or more
symbols -CIRcular searches all sequences as
if they were circular -ALL does an
"overlapping-set" search in nucleotide
sequences -PERFect looks only for perfect
matches -APPend appends the pattern data
file to the output file -SHOw shows
every file searched even if there are no
finds -TERminal writes output to the
terminal screen instead of a file -NOMONitor
suppresses the screen trace showing each
file -ONCe limits finds to patterns
found a maximum of 1 time -MINCuts1 limits
finds to patterns found a minimum of 1
time -MAXCuts3 limits finds to patterns
found a maximum of 3 times -EXCLuden1,n2
excludes patterns found between positions n1 and
n2 -SINce6.90 limits search to sequences
dated on or after June 1990 -BATch
Submits the program to run in the batch queue
Add what to the command line ?
18 FINDPATTERNS in what sequence(s) ? swp
Enter patterns individually, one per line. End
the list with a blank line.
Pattern 1 ygdd Pattern 2
What should I call the output file (
findpatterns.find ) ? ygdd.find
findpatterns will run as a batch or at job.
findpatterns was submitted using the command
" atnow " Job class000.894911339.a will be
run at Mon May 11 132859 CDT 1998. analyze
19! FINDPATTERNS on swp allowing 0 mismatches !
1 YGDD May 11,
1998 1102 .. AAC1_PSEAE ck 7052
len 177 ! P23181 pseudomonas aeruginosa.
gentamicin 3'-acetyltransferase (ec 2.3.1.6 1
YGDD 148 YVQAD YGDD
PAVAL AMDZ_YEAST ck 8601 len 464
! Q03557 saccharomyces cerevisiae (baker's
yeast). probable amidase ymr293c 1
YGDD 450 QVVGQ YGDD STVLD
AMOB_NITEU ck 4649 len 420 ! Q04508
nitrosomonas europaea. ammonia monooxygenase (ec
1.13.12.-). 2/96 1 YGDD
227 RVLLA YGDD LLMDP AMYM_BACST
ck 5976 len 717 ! P19531 bacillus
stearothermophilus. maltogenic alpha-amylase
precursor (ec
20POLG_HRV1B VPSGCSGTSI FNTMINNIII RTLVLDAYKN
IDLDKLKIIA YGDDVIFSYK POLG_HRV2 VPSGCSGTSI
FNTMINNIII RTLVLDAYKN IDLDKLKIIA YGDDVIFSYI
POLG_HRV89 MPSGCAGTSI FNTIINNIII RTLVLDAYKN
IDLDKLKILA YGDDVIFSYN POLG_CXA16 MPSGCSGTSI
FNSMINNIII RTLLIKTFKG IDLDELNMVA YGDDVLASYP
POLG_HE71M MPSGCSGTSI FNSMINNIII RTLLIKTFKG
IDLDELNMVA YGDDVLASYP POLG_HE71B MPSGCSGTSI
FNSMINNIII RTLLIKTFKG IDLDELKMVA YGDDVLASYP
POLG_SVDVU MPSGCSGTSI FNSMINNIII RTLMLKVYKG
IDLDQFRMIA YGDDVIASYP POLG_SVDVH MPSGCSGTSI
FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP
POLG_COXB5 MPSGCSGTSI FNSMINNIII RTLMLKVYKG
IDLDQFRMIA YGDDVIASYP POLG_COXB3 MPSGCSGTSI
FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP
POLG_CXA9 MPSGCSGTSI FNSMINNIII RTLMLKVYKG
IDLDQFRMIA YGDDVIASYP POLG_COXB4 MPSGCSGTSI
FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP
POLG_COXB1 MPSGCSGTSI FNSMINNIII RTLMLKVYKG
IDLDQFRMIA YGDDVIASYP POLG_EC11G MPSGYSGTSM
FNSMINNIII RTLMLKVYKG IDLDQFRMIA YGDDVIASYP
POLG_FMDV1 MPSGCSATSI INTILNNIYV LYALRRHYEG
VELDTYTMIS YGDDIVVASD POLG_FMDVO MPSGCSATSI
INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD
POLG_FMDVZ MPSGCSATSI INTILNNIYV LYALRRHYEG
VELDTYTMIS YGDDIVVASD POLG_FMDVA MPSDCSATGI
INTILNNIYV LYALRRHYEG VELDTYTMIS YGDDIVVASD
POLG_FMDVS MPSGCSATSI VNTILNNIYV LYALRRHYEG
VELDTYTMIS YGDDIVVASD POLG_TMEVB LPSGCAATSM
LNTIMNNVII RAALYLTYSN FDFDDIKVLS YGDDLLIGTN
POLG_TMEVG LPSGCAATSM LNTIMNNVII RAALYLTYSN
FEFDDIKVLS YGDDLLIGTN POLG_TMEVD LLSGCAATSM
LNTIMNNVII RAALYLTYSN FEFDDIKVLS YGDDLLIGTN
POLG_EMCVD LPSGCAATSM LNTIMNNIII RAGLYLTYKN
FEFDDVKVLS YGDDLLVATN POLG_EMCVB LPSGCAATSM
LNTIMNNIII RAGLYLTYKN FEFDDVKVLS YGDDLLVATN
POLG_EMCV LPSGCAATSM LNTIMNNIII RAGLYLTYKN
FEFDDVKVLS YGDDLLVATN
21! FINDPATTERNS on swp allowing 0 mismatches !
1 (L,I,V)(S,A)YGDD(L,I,V)2
May 11, 1998 1131 .. AMOB_NITEU ck
4649 len 420 ! Q04508 nitrosomonas europaea.
ammonia monooxygenase (ec 1.13.12.-). 2/96 1
(L,I,V)(S,A)YGDD(L,I,V)2
(L)(A)YGDD(L)2
225 RSRVL LAYGDDLL MDPMD
POLG_BOVEV ck 7260 len 2,175 ! P12915
bovine enterovirus (strain vg-5-27) genome
polyprotein (coat protei 1
(L,I,V)(S,A)YGDD(L,I,V)2
(I)(A)YGDD(L,V)2 2,038 DDLKI
IAYGDDVL ASYPY POLG_COXB1
ck 4153 len 2,182 ! P08291 coxsackievirus
b1. genome polyprotein (coat proteins vp1 to vp4
co 1 (L,I,V)(S,A)YGDD(L,I,V)
2 (I)(A)YGDD(I,V)2
2,045 DQFRM IAYGDDVI
ASYPW POLG_COXB3 ck 7699 len
2,185 ! P03313 coxsackievirus b3. genome
polyprotein (coat proteins vp1 to vp4 co
22FastA
- Search nucleotide sequences with a nucleotide
query - Search protein sequences with a peptide query
23FastA Algorithm
- Uses a word search algorithm
- Breaks the search into steps
- Only the sequences with the best scores are
searched in subsequent steps - Relatively fast
- Sensitive
24Step 1
- Scan the sequence database for the best hits
- Uses a word-match type search
- Looks for runs of short, perfect matches
- Essentially a dotplot-like search
- Then find the 10 Best diagonals for each sequence
pair
25Initial Scan (DotPlot)
DatabaseSequences
Query Sequence(s)
26Best Diagonals
27Step 2
- Rescore the initial diagonals
- Conservative replacements
- Uses the Blosum50 symbol comparison table
28Initial regions
- Regions are an area of diagonals with the highest
scores - Score reported as Init1
29Diagonals with the Highest Scores
30Step 3
- Join adjacent diagonals.
- Find the optimal subset of initial regions which
can be joined together. - Corrects for gaps
- Score reported as Init n
31Joined Diagonals
32Step 4
- Align the sequences with the best matches
- Uses BestFit Algorithm
- Aligns the joined diagonals from step 3 with the
query sequence - Score reported as Opt
33FastA Summary
SequenceAlignment
34Specifying the Word Size
- 1 to 6 for nt
- 1 or 2 for aa
- Smaller words
- Increased sensitivity
- Decreased stringency
- Higher backgrounds
- Increases cpu time
35Output
- Histogram
- Shows number of sequences falling in a particular
score range - Sequence names with scores
- Alignment of sequences to the query
36Features
- More Sensitive than BLAST (?)
- Slower than BLAST
37analyze fasta -check FastA does a Pearson and
Lipman search for similarity between a
query sequence and a group of sequences of the
same type (nucleic acid or protein). For
nucleotide searches, FastA may be more sensitive
than BLAST. Minimal Syntax fasta
-INfile1ggamma.pep -Default Prompted
Parameters -INfile2pir
specifies the search set -OUTfileggamma.fasta
specifies the output file name -BEGin1
-END148 sets the range of
interest -WORdsize2 sets the
word size -EXPect2.0 lists
scores until E() value reaches 2.0 Local Data
Files -MATRixfastadna.cmp assigns the
scoring matrix for nucleic acids -MATRixblosum50.
cmp assigns the scoring matrix for proteins
38Optional Parameters -PROCessors2 sets the
number of threads devoted to the analysis
on a multiprocessor computer Press
q to quit or ltReturngt for more -MINLength1000
searches only sequences of 1000 or more
residues -MAXLength5000 searches only sequences
of 5000 or fewer residues -SINce6.90 limits
search to sequences dated on or after June
1990 -ONEstrand searches using only the top
strand of nucleotide queries -PAMfactor
uses scoring matrix to calculate initial diagonal
scores -GAPweight16 sets gap creation penalty
(12 is protein default) -LENgthweight4 sets gap
extension penalty (2 is protein
default) -OPTall20 computes opt score when
the initn score is 20 or
higher sorts on opt score -NOOPTall
doesn't compute opt score during search sorts on
initn -SWalign creates final alignment as
unlimited Smith-Waterman for nuc -LIStsize40
shows the best 40 scores (overrides
EXPect) -ALIgn20 shows the best 20
alignments -NOALIgn suppresses sequence
alignments -SHOWall shows complete
sequences in alignment, not just
overlaps -MARKx3 sets the alignment
display mode -NOHIStogram suppresses printing
the histogram -LINesize60 sets number of
sequence symbols per line of the
alignment -NODOCLines suppresses sequence
documentation in the alignment -BATch
submits the program to run in the batch
queue -NOMONitor suppresses the screen
trace for each search set sequence Add what to
the command line ?
39 FASTA with what query sequence ? pol.pep
Begin ( 1 ) ? End
( 461 ) ? Search for query in what
sequence(s) ( SwissProt ) ? What word size
( 2 ) ? Don't show scores whose E() value
exceeds ( 10.0 ) What should I call the
output file ( pol.fasta ) ? fasta will
run as a batch or at job. fasta was
submitted using the command " atnow "
job class000.894911721.a at Mon May 11 133521
1998
40!!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of
pol.seq from 1 to 1383 May 11, 1998 1244
TO GenEMBL Sequences 436,425 Symbols
769,709,871 Word Size 6 Sequences too short
to analyze 25 (113 symbols) Databases
searched GenBank, Release 105.0, Released on
15Feb1998, Formatted on 19Feb1998 EMBL,
Release 53.0, Released on 16Dec1997, Formatted on
20Feb1998 Searching with both strands of the
query. Scoring matrix GenRunDatafastadna.cmp
Constant pamfactor used Gap creation penalty 16
Gap extension penalty 4 Histogram Key
Each histogram symbol represents 1771 search set
sequences Each inset symbol represents 16 search
set sequences z-scores computed from opt scores
41z-score obs exp () () lt 20
216 0 22 57 0 24 75
1 26 114 18 28 100 197
30 433 1195 32 1343 4619
34 6372 12526 36 17948 25726
38 37141 42515
40 65424 59305
42
82308 72494
42 84 2119 1599 86 1556 1237 88
1236 957 90 1075 741 92
755 573
94 635 443
96
442 343
98 395 266
100 276 205
102 218 159
104 164 123
106 121 95
108 105 74
110 67 57 112 61
44 114 57 34
116 73 26 118
50 20 gt120 353 16
43z-score obs exp () () lt 20
216 0 22 57 0 24 75
1 26 114 18 28 100 197
30 433 1195 32 1343 4619
34 6372 12526 36 17948 25726
38 37141 42515
40 65424 59305
42
82308 72494
44 106229 79967
46
85717 81448
48 80192 77978
50 64551 71155
52 54078 62557
54 48895 53435
56 37786
44634 58 33037
36644 60 26505 29684
62 22308 23798
64 18406 18926
66 15179 14959 68 13671 11766
70 10703 9221 72 7870
7205 74 6143 5618 76
5397 4372 78 3856 3399 80
3153 2639 82 2636 2019 84
2119 1599 86 1556 1237 88 1236
957 90 1075 741 92 755
573
94 635 443
96
442 343
98 395 266
100 276 205
102 218 159
104 164 123
106 121 95
108 105 74
110 67 57 112 61
44 114 57 34
116 73 26 118
50 20 gt120 353 16
44 Results sorted and z-values calculated from opt
score 1614 scores saved that exceeded 99 471579
optimizations performed Joining threshold 62,
optimization threshold 47, opt. width 16 The
best scores are init1 initn
opt z-sc E(867084).. GB_VIPOL1 Begin
5987 End 7369 ! J02281 Poliovirus type 1
(Mahoney s... 6915 6915 6915 6938.1
0 GB_VIPOLIO1B Begin 5987 End 7369 !
V01149 Genome of human poliovirus t... 6915 6915
6915 6938.1 0 GB_VIPOL1B31B Begin
889 End 2271 ! M17494 Poliovirus type 1
(Mahoney) ... 6879 6879 6879 6908.0
0 GB_VIPOLIO1A Begin 5979 End 7361 !
V01148 Genome of human poliovirus t... 6879 6879
6879 6901.9 0 GB_VIPOLIOS1 Begin
5987 End 7369 ! V01150 Genome of human
poliovirus, ... 6825 6825 6825 6847.6
0 GB_PATI00480 Begin 1 End 1227 ! I00480
Sequence 9 from Patent US 47... 6135 6135 6135
6162.7 0 GB_VIPIPOLS2 Begin 5986 End
7368 ! X00595 Poliovirus type 2 genome (st...
5275 5275 5340 5353.8 0 GB_VICXA24CG
Begin 6010 End 7392 ! D90457 Coxsackievirus
A24, complete... 5142 5142 5142 5154.6
0 GB_PATI22065 Begin 5978 End 7360 !
I22065 Sequence 1 from patent US 55... 5124 5124
5124 5136.5 0 GB_VIPIPO3119 Begin
5978 End 7360 ! X01076 Poliovirus type 3
complete s... 5115 5115 5115 5127.5
0 GB_VIPOL3L37 Begin 5978 End 7360 !
K01392 Poliovirus P3/Leon/37 (type ... 5115 5115
5115 5127.5 0
45FastA Results (nt search)Polio Polymerase vs
GenEMBL
46FastA Results (protein search)Polio Polymerase
vs SwissProt word2
47FastA Results (protein search)Polio Polymerase
vs SwissProt word1
48TFastA
- Translates the nucleotide sequence database in
all 6 reading frames - Search the translated sequences with a peptide
query - Algorithm is the same as for FastA
49analyze tfasta -check TFastA does a Pearson
and Lipman search for similarity between a query
peptide sequence and any group of nucleotide
sequences. TFastA translates the nucleotide
sequences in all six reading frames
before performing the comparison. It is designed
to answer the question, "What implied peptide
sequences in a nucleotide sequence database are
similar to my peptide sequence?" Minimal
Syntax tfasta -INfile1ggamma.pep -Default
Prompted Parameters -INfile2GenEMBL
search set (all of GenEMBL) -OUTfileggamma
.tfasta output file name -BEGin1 -END148
range of interest -WORdsize2
word size -EXPect2.0
lists scores until E() value reaches 2.0
Local Data Files -MATRixblosum50.cmp
scoring matrix for peptides
50Optional Parameters -GAPweight16 gap
creation penalty -LENgthweight4 gap extension
penalty -SINce6.90 limits search to
sequences dated on or after June
1990 -THREEFrames translates and searches
only the three forward reading
frames -FRAme1 translates and
searches only the frame specified. -NOPAMfactor
uses a constant factor to calculate initial
diagonal scores -LIStsize40 shows the best
40 scores (overrides EXPect) -NOATTRibutes
suppresses writing the Begin, End, and Strand
list attributes to the list of
best scores -ALIgn20 shows the best 20
alignments -NOALIgn suppresses sequence
alignments -OPTall20 immediately
computes opt score when the initn score is 20
or higher sorts on opt
score -NOOPTall doesn't compute opt
score during search sorts on initn -SWalign
does final alignment as Smith-Waterman -SHOWa
ll shows complete sequences in
alignment, not just overlaps -MARKx3
determines the alignment display
mode -NOHIStogram suppresses printing the
histogram -LINEsize60 number of sequence
symbols per line of the alignment -NODOCLines
suppresses sequence documentation in the
alignment -NOMONitor suppresses the
screen trace for each search set sequence -BATch
submits the program to run in the
batch queue -MINLength1000 searches only
sequences of 1000 or more residues -MAXLength5000
searches only sequences of 5000 or fewer
residues Add what to the command line ?
51 TFASTA with what query sequence ? pol.pep
Begin ( 1 ) ?
End ( 461 ) ? Search for query in what
sequence(s) ( GenEMBL ) ? What word size
( 2 ) ? Don't show scores whose E() value
exceeds ( 10.0 ) What should I call the
output file ( pol.tfasta ) ? tfasta will
run as a batch or at job. tfasta was
submitted using the command " atnow "
job class000.894911765.a at Mon May 11 133605
1998
52TFastA ResultsPolio polymerase vs. GenEMBL
53BLAST
- Basic Local Alignment Search Tool
- Altschul, Gish, Miller, Myers, and Lipman
- NCBI
- J. Mol. Bio. 215403
54BLAST Searches
- Locate regions of similarity between a query
sequence and database sequences - High Scoring Segment Pair
- Starts with a word search comparison
- Current versions will introduce gaps as necessary
- Will find multiple regions of similarities
between the query and any one database sequence - Provides statistical data for similarity
significance
55BLAST Options
- BLAST
- Uses local, GCG-supplied databases
- Uses your own BLAST-formatted databases
- Format with GCGToBLAST
- NetBLAST
- Uses NCBI's BLAST Server
- Uses most recent version of Genbank
- Updated daily
56NetBLAST
- NetBLAST automatically submits the sequence to
NCBI's BLAST network server - Results are returned to a file in your directory
57Flavors of BLAST
- BLASTN
- nt query vs. nt database
- BLASTP
- protein query vs. protein database
- BLASTX
- nt query vs. protein database
- nt query translated in all six frames
- TBLASTN
- protein query vs. translated nt database
- TBLASTX
- translated nt query vs. translated nt database
58analyze gcgff analyze blast pol.pep BLAST
searches one or more nucleic acid or protein
databases for sequences similar to one or more
query sequences of any type. BLAST can produce
gapped alignments for the matches it finds.
Begin ( 1 ) ?
End ( 461 ) ? ERROR no databases
found! analyze
59analyze blast -batch -check pol.pep BLAST
searches one or more nucleic acid or protein
databases for sequences similar to one or more
query sequences of any type. BLAST can produce
gapped alignments for the matches it finds.
Minimal Syntax blast -INfile1pirmywhp
-Default Prompted Parameters -INfile2pir
specifies database(s) to
search -EXPect10.0 ignores scores
that would occur by chance
more than 10 times -LIStsize500
sets maximum number of sequences
listed in the output -OUTfilemywhp.bla
stp names the output file Local Data
Files -DATa2blast.ldbs names the list
of available local databases -DATa3blast.sdbs
names the list of available site-specific
databases
60Optional Parameters -PROCessors1
sets the number of processors to use -TBLASTX
if query and database are both
nucleotide, translates
both and does protein comparisons -DBNucleotideonl
y searches only nucleic databases -DBProtei
nonly searches only protein
databases -WORdsize0 sets word size
(0 selects program default) -MATch1
sets nucleotide match reward -MISmatch-3
sets nucleotide mismatch
penalty -MATRixBLOSUM62 assigns the
scoring matrix for proteins -GAPweight0
sets gap creation penalty -LENgthweight0
sets gap extension penalty -HITEXTTHRESHold0
sets minimum score to extend hits -NOFILter
suppresses filtering of low
complexity segments
out of nucleotide and protein query
sequences -TRANSlate1 names genetic
code for translating query -DBTRANSlate1
names genetic code for translating
database -EFFdbsize0 sets effective
database size (0 selects
program default) -NOFRAgments
suppresses showing list file entries as
fragments -ALIgnments250 sets number of
sequences for which to show
alignments -VIEW0
selects alignment view type (0-6 allowed) -NOGAPS
suppresses gapped
alignments -XDRopoff0 sets X
dropoff value for gapped alignments -NATive
produces unmodified BLAST2
output -APPend"string" appends "string"
to pass-through command line -BATch
submits program to batch queue
61 Add what to the command line ?
Begin ( 1 ) ? End (
461 ) ? Search for query in what sequence
database 1) GCGPROT p GCG SeqStore Protein
Database 2)
GCGNUC n GCG SeqStore Nucleotide Database
3) GCGEST n GCG
SeqStore EST Database
Please choose one ( 1 ) Ignore
hits expected to occur by chance more than (
10.0 ) times? Limit the number of sequences
in my output to ( 500 ) ? What should I call
the output file ( pol.blastp ) ? blast
will run as a batch or at job. blast was
submitted using the command " atnow
" commands will be executed using /bin/csh job
989254200.a at Mon May 7 115000 2001 analyze
62Local BLAST Results
63analyze netblast -check pol.pep NetBLAST
searches for sequences similar to a query
sequence. The query and the database searched can
be either peptide or nucleic acid in any
combination. NetBLAST can search only databases
maintained at the National Center
for Biotechnology Information (NCBI) in Bethesda,
Maryland, USA. Minimal Syntax netblast
-INfile1pirzizm99 -Default Prompted
Parameters -INfile2nr
specifies database to search -EXPect10.0
ignores scores that would occur by chance
more than 10
times -LIStsize250 sets maximum
number of sequences listed in
the output -OUTfilezizm99.netblastp
names the output file Local Data
Files -DATa1netblast.rdbs names the
list of available remote databases -MATRixblosum6
2 assigns a scoring matrix for
proteins
64Optional Parameters -NOFILter
suppresses filtering of low complexity
regions out of nucleotide and protein
query sequences -GAPweight11 sets gap
creation penalty -LENgthweight1 sets gap
extension penalty -TBLASTX if query
and database are both nucleotide,
translates both and does protein
comparisons -TRANSlate1 names genetic
code for translating query -DBNucleotideonly
searches only nucleic databases -DBProteinonly
searches only protein databases -URLwww.ncbi.nl
m.nih.gov/cgi-bin/BLAST/nph-blast_report
sends HTTP query to NCBI's net
server (default) -MAILblast_at_ncbi.nlm.nih.gov
sends email to NCBI's email server -ALIgnments100
sets number of sequences for which to show
alignments -NOGAPS produce ungapped
alignments using sum statistics -BATch
submits program to batch queue -PROXY"gateway.c
ompany.com99/" specifies the host and port of a
proxy
server -APPend"stringstring..." appends each
string, on a separate line,
to the query (NCBI's email format) Add
what to the command line ?
65 Search for query in what sequence database
1) nr p Non-redundant GenBank CDS
translationsPDBSwissProtPIR 2) pdb
p PDB protein sequences
3) swissprot p SwissProt
sequences
4) yeast p Saccharomyces cerevisiae
protein sequences 5) kabat
p Kabat Sequences of Proteins of Immunological
Interest 6) alu p Translations
of Select Alu Repeats from REPBASE
7) month p All new or revised GenBank CDS
translationPDBSwissProtPI 8) ecoli p
E. coli genomic CDS translations
9) nr n Non-redundant
GenBankEMBLDDBJPDB sequences (but no EST's
10) pdb n PDB nucleotide sequences
11) vector n
Vector subset of GenBank
12) yeast n Saccharomyces
cerevisiae genomic nucleotide sequences
13) est n Non-redundant Database of
GenBankEMBLDDBJ EST Division 14) sts
n Non-redundant Database of GenBankEMBLDDBJ
STS Division 15) htgs n High
Throughput Genomic Sequences
16) mito n Database of
mitochondrial sequences, Rel. 1.0, July 1995
17) kabat n Kabat Sequences of Nucleic Acid
of Immunological Interest 18) epd n
Eukaryotic Promotor Database
19) alu n Select Alu Repeats
from REPBASE 20)
month n All new or revised
GenBankEMBLDDBJPDB sequences released 21)
gss n Genome Survey Sequence, includes
single_pass genomic data, 22) ecoli n E.
coli genomic nucleotide sequences.
Please choose one ( 1 )
66 Ignore hits expected to occur by chance more
than ( 10.0 ) times? Limit the number of
sequences in my output to ( 250 ) ? What
should I call the output file ( pol.netblastp )
? Sending query... Awaiting results... Done.
Wrote search results to pol.netblastp analyze
67NetBLAST ResultsPolio polymerase vs. Genbank nr
68Web-based BLAST Searches
- http//www.ncbi.nlm.nih.gov/BLAST/
- Gapped BLAST with graphic summary
- PSI-BLAST
- Gapped BLAST followed by BLAST using a
position-specific scoring matrix - More sensitive
- Repeat as many times as desired
69Running NCBI Blast
- http//www.ncbi.nlm.nih.gov/blast
- gtPOL ID POLG_POL1M STANDARD PRT
2206 AA. - GEIPWMRPSKDAGYPIINAPSKTKLEPSAFHYVFEGVKEPAVLTKNDPRL
KTDFEEAIFS - KYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEAL
DLSTSAGYPY - VAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQ
GKSRLIEASS - LNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLF
AFDYTGYDAS - LSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGC
SGTSIFNSMI - NNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDY
GLTMTPADKS - ATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNT
QDHVRSLCLL - AWHNGEEEYNKFLAKIRSVPIGRALLLPEYSTLYRRWLDSF
70SSearch
- Very sensitive database searching
71SSearch
- Rigorous Smith-Waterman search for similarity
between a query sequence and a group of sequences
of the same type - This may be the most sensitive method available
for similarity searches - VERY slow!
72analyze ssearch -batch -check pol.pep SSearch
does a rigorous Smith-Waterman search for
similarity between a query sequence and a group
of sequences of the same type (nucleic acid or
protein). This may be the most sensitive method
available for similarity searches. Compared to
BLAST and FastA, it can be very slow. Minimal
Syntax ssearch -INfile1ggamma.pep
-Default Prompted Parameters -INfile2pir
specifies the search set
-OUTfileggamma.ssearch names the output
file -BEGin1 -END148 sets the
range of interest -EXPect2.0
lists scores until E() value reaches 2.0 Local
Data Files -MATRixfastadna.cmp
assigns the scoring matrix for nucleic
acids -MATRixblosum50.cmp assigns the
scoring matrix for proteins
73Optional Parameters -PROCessors2 sets the
number of threads devoted to the analysis
on a multiprocessor
computer -MINLength1000 searches only
sequences of 1000 or more residues -MAXLength5000
searches only sequences of 5000 or fewer
residues -SINce6.90 limits search to
sequences dated on or after June 1990 -ONEstrand
searches using only the top strand of
nucleotide queries -GAPweight16 sets the
gap creation penalty (12 is protein
default) -LENgthweight4 sets the gap
extension penalty (2 is protein
default) -LIStsize40 shows the best 40
scores (overrides EXPect) -ALIgn20
shows the best 20 alignments -NOALIgn
suppresses sequence alignments -SHOWall
shows complete sequences in alignment, not just
overlaps -MARKx3 sets the alignment
display mode -NOHIStogram suppresses
printing the histogram -LINesize60 sets
number of sequence symbols per line of the
alignment -NODOCLines suppresses sequence
documentation in the alignment -BATch
submits the program to run in the batch
queue -NOMONitor suppresses the screen
trace for each search set sequence
74 Add what to the command line ?
Begin ( 1 ) ? End ( 461
) ? Search for query in what sequence(s) (
PIR ) ? gcgprot Don't show scores whose
E() value exceeds ( 10.000000 ) Maximum
number of alignments ( 40 ) ? What should I
call the output file ( pol.ssearch ) ?
ssearch will run as a batch or at job.
ssearch was submitted using the command "
atnow " commands will be executed using
/bin/csh job 989256600.a at Mon May 7 123000
2001 analyze
75SSearch Output
76FrameSearch
- Optimal alignments including reading frame shifts
77FrameSearch
- Finds similarities between a protein sequence and
a nucleotide sequence database - Finds similarities between a nucleotide sequence
and a protein sequence database - Aligns amino acids to nucleotide codons
- Allows for frameshifts in the nucleotide
sequence(s)
78Running FrameSearch
- Takes a LONG LONG time to run
- Run in Batch mode
- Limit the size of the database
79(No Transcript)
80(No Transcript)
81analyze framesearch -check -batch FrameSearch
searches a group of protein sequences for
similarity to one or more nucleotide query
sequences, or searches a group of
nucleotide sequences for similarity to one or
more protein query sequences. For each sequence
comparison, the program finds an optimal
alignment between the protein sequence and all
possible codons on each strand of the nucleotide
sequence. Optimal alignments may include reading
frame shifts. Minimal Syntax framesearch
-INfile1ESTAtts0012 -Default Prompted
Parameters -BEGin11 -END1286
range of interest for a single
query sequence -INfile2SwissPro
t search set -GAPweight12
gap creation penalty -LENgthweight4
gap extension penalty -FRAmeweight0
frameshift gap
penalty -OUTfileatts0012.framesearch output
file name Local Data Files -MATRixblosum62.cmp
amino acid substitution matrix
-TRANSlatetranslate.txt contains the genetic
code
82Optional Parameters -BEGin11 -END1100 range
of interest for each query sequence -ONEstrand
searches only the top strand of
nucleotide seqs -LIStsize40 number of
scores to show -ALIgn40 number of
alignments to show
(-NOALIgn suppresses alignments) -GLObal
searches by global alignment -ENDWeight
penalizes end gaps in global alignments
like other gaps -HIGhroad
among equally optimal alignments,
shows one with maximum
gaps in protein sequence -LOWroad
among equally optimal alignments, shows one
with maximum gaps in nucleotide
sequence -LINesize70 length of
documentation for each sequence in the
output list -PAIrx,2,1
thresholds for displaying '', '', and
'.' -WIDth50 the number of sequence
symbols per line -PAGe60 adds a
line with a form feed every 60 lines -NOBIGGaps
suppresses abbreviation of large gaps
with '.'s -NOPLOt suppresses the
plot of the search score
distribution -BATch submits
program to the batch queue -NOMonitor
suppresses the screen trace of program
progress -NOSUMmary suppresses the
screen summary
83 FRAMESEARCH with what query sequence(s) ?
uu001a.seq Begin ( 1 ) ?
End ( 1371 ) ? Search for
query in what sequence(s) ( SwissProt ) ?
I read your local translation table
"translate.txt" What is the gap creation
penalty ( 12 ) ? What is the gap extension
penalty ( 4 ) ? What is the frameshift
penalty ( 0 ) ? What should I call the
output file ( uu001a.framesearch ) ?
framesearch will run as a batch or at job.
framesearch was submitted using the command
" atnow " commands will be executed using
/bin/csh job 894913723.a at Mon May 11 140843
1998
84!!SEQUENCE_LIST 1.0 FRAMESEARCH of
/export/home/lefkowit/temp/uu001a.seq UU001
TO swdnaa_ Sequences 31 Total-length
13,393 May 12, 1998 1108 Databases searched
SWISS-PROT, Release 35.0, Released on
13Dec97, Formatted on 13Dec1997 Scoring matrix
GenRunDatablosum62.cmp Translation table
translate.txt Gap creation penalty 12
Gap extension penalty 4 Frameshift
penalty 0 The best scores are
.. SWDNAA_MYCCA P24116 mycoplasma capricolum.
chromosomal replicatio... 346 SWDNAA_MYCGE
P35888 mycoplasma genitalium. chromosomal
replicatio... 316 SWDNAA_SPICI P34028
spiroplasma citri. chromosomal replication in...
308 SWDNAA_BORBU P33768 borrelia burgdorferi
(lyme disease spirochete... 297 SWDNAA_MYCPN
Q59549 mycoplasma pneumoniae. chromosomal
replicatio... 275 SWDNAA_MYCMY P35889
mycoplasma mycoides. chromosomal replication ...
264
85uu001a.seq DNAA_MYCCA Quality
346 Length 882
Ratio 1.197 Gaps 5 Percent
Similarity 44.406 Percent Identity 33.916
. . . .
. 438 AACCCTTTATTTTTATTTGGTAAAGTTGGTGTTG
GTAAAACGCATATCGT 487
... ..... 142
AsnProLeuPheIleTyrGlyGluSerGlyMetGlyLysThrHisLeuLe
158 . . .
. . 488 GGCTGCTGCTGGTAATCGTTTTGCTA
ATAGTAA.TCCTAATTTAAAATTT 536 .
...... 159
uLysAlaAlaLysAsnTyrIleGluSerAsnPheSerAspLeuLysValS
175 . . .
. . 537 ATTATTATGAAGGGCAAGATTTTTTT
CGAAAGTTTTGTTCTGCTTCGTTA 586
176
erTyrMetSerGlyAspGluPheAlaArgLysAlaValAspIleLeuGln
191 . . .
. . 587 AAAGGGACTAGTTATGTTGAAGAGTT
TAAAAAAGAAATTGCTTCAGCAGA 636
192
LysThrHisLysGluIleGluGlnPheLysAsnGluValCysGlnAsnAs
208 . . .
. . 637 TTTATTAATTTTTGAAGATATTCAAA
ATATCCAATCACGTGATTCAACGG 686 ...
209
pValLeuIleIleAspAspValGlnPheLeuSerTyrLysGluLysThrA
225 . . .
. . 687 CTGAATTGTTTTTTAATATCTTTAAT
GATATAAAATTAAATGGTGGAAAA 736
... ...
226 snGluIlePhePheThrIlePheAsnAsnPheIleGluAsnA
spLysGln 241
86FrameAlign
- Align a protein sequence to the codons in all
possible reading frames of a nucleotide sequence - Allows for frameshifts
- Local or Global alignment
87analyze framealign -check FrameAlign creates an
optimal alignment of the best segment
of similarity (local alignment) between a protein
sequence and the codons in all possible reading
frames of a nucleotide sequence.
Optimal alignments may include reading frame
shifts. Minimal Syntax framealign
-INfile1ESTAtts0012 \
-INfile2SWG3pc_Arath -Default Prompted
Parameters -BEGin11 -END1286 range of
interest for first sequence -BEGin21 -END2338
range of interest for second sequence -REVerse
strand for nucleotide
sequence -GAPweight12 gap creation
penalty -LENgthweight4 gap extension
penalty -FRAmeweight0 frameshift gap
penalty -OUTfile1gamma.pair output file for
alignment Local Data Files -MATRixblosum62.cmp
amino acid
substitution matrix
-TRANSlatetranslate.txt contains the genetic
code
88Optional Parameters -GLObal
creates global alignment (default is local)
-ENDWeight penalizes end gaps in
global alignments like
other gaps -LIMit1337 gap shift
limit for nucleotide sequence -LIMit2285
gap shift limit for protein
sequence -HIGhroad among equally
optimal alignments, shows one
with maximum gaps in protein
sequence -LOWroad among equally
optimal alignments, shows one
with maximum gaps in nucleotide
sequence -PAIrx,2,1 thresholds
for displaying '', '', and '.' -WIDth50
the number of sequence symbols per
line -PAGe60 adds a line with
a form feed every 60 lines -NOBIGGaps
suppresses abbreviation of large gaps with
'.'s -OUTfile2atts0012.gap new file for
nucleotide sequence with gaps added -OUTfile3g3p
c_arath.gap new file for protein sequence with
gaps added -BATch submits
program to the batch queue -NOMonitor
suppresses the screen trace of program
progress -NOSUMmary suppresses
the screen summary Add what to the command line
?
89 Local alignment of what sequence 1 ? uu001.pep
Begin ( 1 ) ?
End ( 457 ) ? to what nucleotide
sequence ? uu001a.seq Begin
( 1 ) ? End ( 1371 ) ?
Reverse ( No ) ? I read
your local translation table "translate.txt"
What is the gap creation penalty ( 12 ) ?
What is the gap extension penalty ( 4 ) ?
What is the frameshift penalty ( 0 ) ? What
should I call the paired output display file (
uu001.pair ) ? uu001.fran Aligning
......................-.......................
Gaps 3 Quality 2285
Quality Ratio 5.011 Similarity 100.000
Length 1371
90 Local alignment of uu001a.seq check 9730
from 1 to 1371 UU001 to uu001.pep
check 6522 from 1 to 457 Scoring matrix
/export/home0/gcg/gcgcore/data/rundata/blosum62.cm
p Translation table /export/home/lefkowit/temp/
translate.txt This file contains the Mold,
Protozoan, and Coelenterate Mitochondrial and the
Mycoplasma/Spiroplasma Code translation table,
specified in the Feature Definition, Version
1.08, formatted for use with GCG programs. It
names amino acids in both one and three-letter
form and lists the codons which should translate
into . . . Gap Weight 12
Average Match 2.912 Length Weight 4
Average Mismatch -2.003 Frameshift Weight
0 Quality 2285
Length 1371 Ratio 5.011
Gaps 3 Percent Similarity 100.000
Percent Identity 100.000 Match display
thresholds for the alignment(s)
IDENTITY 2
. 1
91 . . . .
. 1 ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATA
AAAAGGATTCC 50
1
MetAlaAsnAsnTyrGlnThrLeuTyrAspSerAlaIleLysArgIlePr
17 . . .
. . 51 ATACGATCTTATTTCTGATCAAGCTTA
TGCAATTCTACAAAATGCTAAAA 100
18 oTyrAspLeuIleSerAspGlnAlaTyrAlaIleLeuGln
AsnAlaLysT 34 . .
. . . 101
CTCATAAGTT.TGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTT
149
35 hrHisLysValCysAspGlyValLeu
TyrIleIleValAlaAsnAlaPhe 50 .
. . . . 150
GAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCT
199
51 GluLysSerIleIleAsnGlyAsnPh
eIleAsnIleIleSerLysTyrLe 67 .
. . . . 200
AAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAG
249
68 uSerGluGluPheLysLysGluAsnI
leValAsnPheGluPheIleIleA 84 .
. . . . 250
ACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAAT
299
85 spAsnGluLysLeuLeuIleAsnSer
AsnPheLeuIleLysGluThrAsn 100
92Next
- Multiple Sequence Analysis