Title: de novo Analysis of Sequences
1de novo Analysis of Sequences
Slides by Jane Loveland and Dustin Schones
2Introduction
- Large number of sequence analysis tools are
available on the web - Sequences submitted to public databases will
probably be annotated and incorporated in Ensembl
and NCBI databases.
HOWEVER
- Sequences may also be analysed and annotated
manually. - Some of the tools available are not used by
Ensembl, NCBI etc. and may provide you with
useful additional information
3Lab work
de novo cDNA analysis de novo genomic sequence
analysis de novo protein analysis
Confirms/disagrees with in silico predictions
Development of programs
Predictions
Sequence analysis tools
4 de novo analysis of sequence
- BLAST similarity searching
- BLAT rapid genome searching
- PSI-BLAST to pull out similar proteins
- ORF finder to highlight putative protein products
of cDNA - BLASTP link from ORF finder to investigate what
these potential protein products might be.
- SPIDEY to align cDNA to genomic DNA
- CLUSTALW to align similar sequences
- View and edit alignments in JALVIEW and GENEDOC
to produce a coloured and shaded alignment - InterProScan to search for protein domains
5sequence alignment
- sequence analysis ? sequence alignment
- what
- why
- similar sequence
- infer homology
- infer function
sequence ? structure ? function
6pairwise alignments multiple sequence
alignments
7Global vs. Local
8BLAST
Basic Local Alignment Search Tool
- idea find high scoring local alignments between
query sequence and target database - assumption true match alignments very likely to
contain within them very high scoring matches - heuristics theme search quickly for homologous
regions and then do slow/exact
alignments
9BLAST family
10BLAST Steps
- For each word of length W in the query,
generate a list of all possible words
(neighborhood) with a score of at least threshold
T (determined by using the scoring matrix)
11Determine the locations of all common words
between the query and the database (word hits).
12(No Transcript)
13BLAST Steps
- use dynamic programming to extend hits until
the score drops a value of X expensive!! --
90 of time
14Evaluates the statistical significance of
extended hits and reports only those above the
determined threshold.
15(No Transcript)
16BLAST statistical evaluation
- for local, ungapped alignments
- m size of query n size of database
- E expected of HSPs with scores at least S
- p prob of finding at least one HSP with S
- good tutorial at
- http//www.ncbi.nlm.nih.gov/BLAST/tutorial
/Altschul-1.html
17BLAT
- Good for aligning mRNA, ESTs to genome
- fast
- aligns whole mRNA, not just exons
- handles introns and splice-sites
18BLAT
- Steps for cDNA alignment
- 1 break cDNA into n base chunks
- 2 use index to find regions in genome similar
to each chunk of cDNA - 3 detailed alignment between genome region and
cDNA chunk - 4 dynamic programming - stitch together
detailed alignments of chunks into alignment of
whole
19- genome cacaattatcacgaccgc (K 8-13 real
genome) - K-mers cac aat tat cac gac cgc
- 0 3 6 9 12 15
- cDNA aattctcac
- 3-mers aat att ttc tct ctc tca cac
- 0 1 2 3 4 5 6
- hits aat 0,3 -3
- cac 6,0 6
- cac 6,9 -3
- clump cacAATtatCACgaccgc
example from Jim Kent
20PSI-BLAST
Position Specific Iterated-BLAST
- database searches using position-specific scoring
matrices more powerful than simply using single
sequence - STEPS
- collect all DB sequences that align with E-val lt
T - align these to make position-specific scoring
matrix - use scoring matrix to search for new hits
- iterate
21PSI-Blast
22ORF-finder
- graphical analysis tool which finds all open
reading frames in a sequence - looks for start and stop codons
- assumes upstream start and downstream stop if ORF
at least 100 amino acid - ORFs can be selected to view as DNA sequence or
amino acid sequence
23BLAST (Basic Local Alignment Search Tool) is a
set of similarity search programs designed to
explore all of the available sequence databases
regardless of whether the query is protein or
DNA. The BLAST programs have been designed for
speed, with a minimal sacrifice of sensitivity to
distant sequence relationships. The concept of
BLAST is shown below
Database of sequences
seq1
seq2
seq3
seq4
seq5
seq6
seq7
seq8
seq9
seq10
seq11
Etc
Sequence of interest
Query 133 agcagccgtttcgactttggcattcggtaccgg
Subject 232 agcagccgtttcgactttggcattcg
gtaccgg
BLAST query run within publicly available
databases within defined data sets, or as a
command line to user defined sets of information
24PSI-BLAST Position Specific Iterative Blast
- Pulls out similar proteins, creates an alignment
of these proteins, and then produces a Position
Specific Scoring Matrix (PSSM) - Blast parameters
- A BLAST search is then performed again, looking
for proteins which are similar at highly
conserved regions of the PSSM.
- Several iterations can be performed.
25Graphical analysis tool which finds all ORFs in
a sequence Looks for start and stop
codons Assumes upstream start and downstream
stop when ORF 100aa or over Graphical overview
of all ORFs which can be selected to view the DNA
sequence and the corresponding aa
sequence Integral Blastp
26- Aligns a set of spliced nucleotide sequences
(ESTs, cDNAs or mRNAs) to an unspliced genomic
DNA sequence, inserting introns of arbitrary
length when needed.
- Aligns sequences with high stringency blast
against the genomic sequence - Sorts the blast output by score and then uses
splice matrices to assign intron/exon boundaries
- Outputs a list of the exons and introns it has
found, the alignments and a protein translation
for each exon - May be used for interspecies alignments by
selecting divergent species
27Clustalw DNA and Protein alignments
Copy and paste sequences Alignment may be viewed
and edited in Jalview
28(No Transcript)
29GeneDoc Alignment Viewer and Editor
30- Integrated documentation resource for protein
domains, families and sites - Integrated view of databases
- Intuitive interface for text and sequence
searches
31Summary de novo analysis of sequence Similarity
searching BLAST, BLAT and PSI-blast Find
possible ORFs ORF finder, BLASTP Align cDNA to
genomic DNA Spidey and BLAT Align similar
sequences Clustalw, Jalview, GeneDoc Find
protein domains InterProScan