Title: Sequence Analysis Tools Introduction
1Sequence Analysis ToolsIntroduction
2Bioinformatics
a definition ?
The design, construction and use of software
tools to generate, store, annotate, access and
analyse data and information relating to
Molecular Biology
Here we consider the use of Bioinformatics tools
rather than their design and construction
Here we consider the access and analysis of data
and information items rather than their
generation, storage or annotation
3Introduction of sequence analysis toolsDifferent
analysis tools
- Standard Unix tools (e.g., the grep family, sed,
awk, and cut). - Publicly available tools (e.g., BLAST, the EMBOSS
package,DNAstar, vector NTI). - Open source libaries (e.g., BioPerl, BioJava,
BioPython, BioRuby). - Custom tools.
4Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
5The field of sequence analysis
- Many of the software tools used in studying
sequence analysis, which is one of the many
subfields of computational molecular biology. The
field of sequence analysis includes - Primer Design
- Pattern and motif searching
- Sequence comparison
- Multiple sequence alignment
- Sequence composition determination
- Secondary structure and 3D prediction
- Phylogenetic analysis
6Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
7Software Tools for Sequence Analysis
WWW Resources
Database Retrieval
Sequence Retrieval System
Retrieves MUCH more than sequences
Core elements free to academic sites
Implemented in many places
It is possible to integrate analysis tools
Elements of SRS are incorporated into EMBOSS
8Software Tools for Sequence Analysis
WWW Resources
Database Retrieval
Retrieves MUCH more than sequences
Access to NCBI databases only
Entrez client software available by anonymous ftp
Most general packages include tools to access
local sequence databases
EMBOSS programs can access sequences from remote
SRS servers
9Readseq
- Readseq is a classic sequence format convert
tools. - 1989. Developed by Don Gilbert,
- Functionthis program reads and writes nucleotide
and protein sequences in many useful formats. - To run Readseq use
- java -cp readseq.jar run options inputfiles
- Supported formats GCG,fasta,genbank embl,msf
10Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
11Software Tools for Sequence Analysis
Specialised Packages
Sequencing Project Management
Free academic licence
The Phred - Phrap Package By Phil Green et al
Excellent base call confidence estimation (phred)
Excellent large scale contig assembler (phrap)
Available by anonymous ftp
Excellent GUI
Excellent contig editor
Excellent finishing tools
Simple confidence estimation Contig assembler
not good for big projects BUT phred and phrap can
be accessed from Staden GUI
12Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
13Primer Design
- Oligo 6 (????)
- Premier Primer (????)
- Vector NTI Suit
- Dnasis
- Omiga
- Dnastar
- Primer3 (????)
14Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
15Software Tools for Sequence Analysis
Specialised Packages
DNA/RNA Folding
Free for academic use
Can be installed locally or run via a WWW page
Incorporated into the GCG general package
Michael Zukers Programs
Protein Structure Analysis
Nominal fee for academic use
LINUX, IRIX, Windows
Whatif by Gert Vriend
16Software Tools for Sequence Analysis
Specialised Packages
Protein Structure Analysis for very rich people
IRIX, HP-UX, LINUX
IRIX, AIX, LINUX
Both systems are very impressive _at_ very expensive
17Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
18http//evolution.genetics.washington.edu/phylip/so
ftware.html
Here are some 195 of the phylogeny packages, and
18 free servers, that I know about. It is an
attempt to be completely comprehensive.
19Software Tools for Sequence Analysis
Specialised Packages
Phylogeny
Available by anonymous ftp
Windows, Macintosh, UNIX
Incorporated into the EMBOSS general package
Commercial, but reasonable
UNIX, VMS, DOS and windows
Incorporated into the GCG general package
20Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
21Software Tools for Sequence Analysis
WWW Resources
Database Similarity Searching
Very popular, very widely available
Not sensitive But extremely fast
Popular, widely available
Not sensitive much slower than blast
Can be installed locally or run via a WWW page
Available by anonymous ftp (blast, fasta)
BOTH blast fasta
DNA/Protein query V DNA/Protein database
Incorporated into the GCG general package
22Clustal
- Clustal is a general-purpose multiple sequence
alignment program for nucleotide sequences or
proteins. - FunctionIt produces biologically meaningful
multiple sequence alignments of divergent
sequences. It calculates the best match for the
selected sequences, and lines them up so that the
identities, similarities and differences can be
seen.
23Clustal
- Download address
- ftp//ftp-igbmc.u-strasbg.fr/pub/ClustalX/clustalx
1.8.msw.zip - Clustalw for dos
- Clustalx for windows
- Clustalv for unix or linux
24Sequence Analysis an Overview
Sequencing Project Management
Database Retrieval
Restriction Mapping
Primer Design
DNA/RNA Folding
Nucleic Acid Sequence Analysis
Database Retrieval
Seeking Coding regions
Database Similarity Searching
Translation to amino acids
Pairwise Sequence Comparison
Multiple Sequence Alignment
Protein Sequence analysis
Prediction of Function
Structure prediction
Phylogeny
Motifs and Patterns
Structure analysis
25HMMER
- HMMER is a collection of programs that create a
hidden Markov model (HMM) of a sequence family
which can be utilized as a query against a
sequence database to identify (and/or align)
additional homologs of the sequence family. - HMMER was developed by Sean Eddy at Washington
University.
26HMMER
- Download address
- ftp//ftp.genetics.wustl.edu/pub/eddy/hmmer/2.2g/h
mmer-2.2g.bin.dos-cygwin.zip - Different for Linux\Solaris\MAC\IRIX\dos
- ???????win\dos?????,?????dos ??????unix???dos
?????????????????
27HMMER
28Application of HMM
- PfamProtein families database of alignments and
HMMs - Pfam is a collection of protein families and
domains. Pfam contains multiple protein
alignments and profile-HMMs of these families.
Pfam is a semi-automatic protein family database,
which aims to be comprehensive as well as
accurate. - http//www.sanger.ac.uk/Software/Pfam/
29Application of HMMpfam
30Application of HMM
- TMHMMPrediction of transmembrane helices in
proteins - http//www.cbs.dtu.dk/services/TMHMM/
31The MEME/MAST
- Motif Discovery and Search tools
- MEMEDiscover motifs (highly conserved regions)
in groups of related DNA or protein sequences. - MASTSearch sequence databases using motifs.
- http//meme.sdsc.edu/meme/website/intro.html
32Software Tools for Sequence Analysis
General Packages
Commercial
UNIX only
WWW and X GUIs
Comprehensive
Widely available
Open source
UNIX only
Several GUIs (java, WWW, X)
Comprehensive
Similar structure to the GCG package
Open source
Windows, MacOS X, UNIX
Excellent GUI including interactive graphical
output
Not comprehensive but allows access to EMBOSS
33Genetics Computer Group
Molecular biologists worldwide use the GCG
Wisconsin Package as their software of choice
for comprehensive sequence analysis. The
Wisconsin Package meets research
34Founded in 1982 as a service of the Department of
Genetics at the University of Wisconsin, GCG
became a private company in 1990 and was acquired
by Oxford Molecular Group in 1997. The company
was one of the pioneers of bioinformatics and its
Wisconsin Package sequence analysis tools are
widely used and well regarded throughout the
pharmaceutical and biotechnology industries and
in academia.
35EMBOSS
- EMBOSS (European Molecular Biology Open Software
Suite) is an open source package of sequence
analysis tools. This software covers a wide range
of functionality and can handle data in a variety
of formats - Download address
- ftp//ftp.uk.embnet.org/pub/EMBOSS/EMBOSS-2.8.0.ta
r.gz - Only linux/unix version no version for win/dos
36- Totally 150 programs
- Sequence alignment.
- Rapid database searching with sequence patterns.
- Protein motif identification, including domain
analysis. - Nucleotide sequence pattern analysis, for example
to identify CpG islands or repeats. - Codon usage analysis for small genomes.
- Rapid identification of sequence patterns in
large scale sequence sets. - Presentation tools for publication.