Title: Marine Biological Laboratory
1Marine Biological Laboratory Workshop on
Molecular Evolution
Woods Hole, Massachusetts
July 25, 2006, 7 to 10 PM
2Multiple Sequence Alignment Analysis thru GCGs
SeqLab
- Steven M. Thompson
- Florida State University School of Computational
Science (SCS)
More data yields stronger analyses if done
carefully! Mosaic ideas and evolutionary
importance.
3But first a prelude My definitions
- Biocomputing and computational biology are
synonymous and describe the use of computers and
computational techniques to analyze any
biological system, from molecules, through cells,
tissues, organisms, and populations, to complete
ecologies. - Bioinformatics describes using computational
techniques to access, analyze, and interpret the
biological information in any of the available
online biological databases. - Sequence analysis is the study of molecular
sequence data for the purpose of inferring the
function, mechanism, interactions, evolution, and
perhaps structure of biological molecules. - Genomics analyzes the context of genes or
complete genomes (the total DNA content of an
organism) within and across genomes. - Proteomics is a subdivision of genomics concerned
with analyzing the complete protein complement,
i.e. the proteome, of organisms, both within and
between different organisms.
4And a way to think about it The reverse
biochemistry analogy
- from a virtual DNA sequence to actual molecular
physical characterization, not the other way
round. - Using bioinformatics tools, you can infer all
sorts of functional, evolutionary, and,
structural insights into a gene product, without
the need to isolate and purify massive amounts of
protein! Eventually you can go on to clone and
express the gene based on that analysis using PCR
techniques. - The computer and molecular databases are an
essential part of this process.
5The exponential growth of molecular sequence
databases
cpu power
- Year BasePairs Sequences
- 1982 680338 606
- 1983 2274029 2427
- 1984 3368765 4175
- 1985 5204420 5700
- 1986 9615371 9978
- 1987 15514776 14584
- 1988 23800000 20579
- 1989 34762585 28791
- 1990 49179285 39533
- 1991 71947426 55627
- 1992 101008486 78608
- 1993 157152442 143492
- 1994 217102462 215273
- 1995 384939485 555694
- 1996 651972984 1021211
- 1997 1160300687 1765847
- 1998 2008761784 2837897
- 1999 3841163011 4864570
Doubling time 1 year!
6Back to multiple sequence alignment
Applicability?
So what why even bother? Applications Probe/pr
imer, and motif/profile design Graphical
illustrations Comparative homology
inference Molecular evolutionary analysis. OK
well, how do you do it?
7Dynamic programmings complexity increases
exponentially with the number of sequences being
compared
- N-dimensional matrix . . . .
- complexitysequence lengthnumber of sequences
8Global heuristic solutions
See MSA (global within bounding box)
and PIMA (local portions only) on the multiple
alignment page at the Baylor College of
Medicines Search Launcher http//searchlauncher
.bcm.tmc.edu/ but, severely limiting
restrictions!
9Multiple Sequence Dynamic Programming
Therefore pairwise, progressive dynamic
programming restricts the solution to the
neighbor-hood of only two sequences at a
time. All sequences are compared, pairwise, and
then each is aligned to its most similar partner
or group of partners. Each group of partners is
then aligned to finish the complete multiple
sequence alignment.
10Reliability and the Comparative Approach
- explicit homologous correspondence
- manual adjustments should be encouraged based
on knowledge, - especially structural, regulatory, and functional
sites. - Therefore, editors like SeqLab and
- the Ribosomal Database Project
- http//rdp.cme.msu.edu/index.jsp
11Structural Functional correspondence in the
Wisconsin Packages SeqLab
12Work with proteins!If at all possible
- Twenty match symbols versus four, plus
similarity! Way better signal to noise. - Also guarantees no indels are placed within
codons. So translate, then align. - Nucleotide sequences will only reliably align if
they are very similar to each other. And they
will require extensive hand editing and careful
consideration.
13Beware of aligning apples and oranges and
grapefruit!
- Parologous versus orthologous
- genomic versus cDNA
- mature versus precursor.
14Mask out uncertain areas
15Complications
- Order dependence.
- Not that big of a deal.
- Substitution matrices and gap penalties.
- A very big deal!
- Regional realignment becomes incredibly
important, especially with sequences that have
areas of high and low similarity (GCG PileUp
-InSitu option).
16Complications cont.
- Format hassles!
- Specialized format conversion tools such as GCGs
SeqConv program and PAUPSearch, and - Don Gilberts public domain ReadSeq program.
17Still more complications
- Indels and missing data symbols (i.e. gaps)
designation discrepancy headaches - ., -, , ?, N, or X
- . . . . . Help!
18Web resources for pairwise, progressive multiple
alignment
http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
Ali/welcome.html. http//pbil.univ-lyon1.fr/alignm
ent.html http//www.ebi.ac.uk/clustalw/ http//sea
rchlauncher.bcm.tmc.edu/ However, problems with
very large datasets and huge multiple alignments
make doing multiple sequence alignment on the Web
impractical after your dataset has reached a
certain size. Youll know it when youre there!
19If large datasets become intractable for analysis
on the Web, what other resources are available?
- Desktop software solutions public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be
pretty computer savvy. So, - commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc., - but . . . license hassles, big expense per
machine, and Internet and/or CD database access
all complicate matters!
20Therefore, UNIX server-based solutions
- Public domain solutions also exist, but now a
very cooperative systems manager needs to
maintain everything for users, so, - commercial products, e.g. the Accelrys GCG
Wisconsin Package and the SeqLab Graphical User
Interface, simplify matters for administrators
and users. One format, one look-and-feel. - One license fee for an entire institution and
very fast, convenient database access on local
server disks. Connections from any networked
terminal or workstation anywhere! - Operating system UNIX command line operation
hassles communications software telnet, ssh,
and terminal emulation X graphics file transfer
ftp, and scp/sftp and editors vi, emacs,
pico (or desktop word processing followed by file
transfer save as "text only!"). See my
supplement pdf file.
21The Genetics Computer Group
- The Accelrys Wisconsin Package for Sequence
Analysis - GCG began in 1982 in Oliver Smithies Genetics
Dept. lab at the University of Wisconsin,
Madison and then starting in 1990 it became a
private company which was acquired by the Oxford
Molecular Group, U.K., in 1997 and then by
Pharmacopeia Inc., U.S.A., in 2000 and then in
2004 Accelrys, San Diego, California, left
Pharmacopeia to become an independent entity. - The suite contains around 150 programs designed
to work in a toolbox fashion. Several simple
programs used in succession can lead to very
sophisticated results. - Also internal compatibility, i.e. once you
learn to use one program, all programs can be run
similarly, and, the output from many programs can
be used as input for other programs. - Used all over the world at over 950 institutions,
so learning it will likely be useful at other
research institutions as well.
22To answer the always perplexing GCG question
What sequence(s)? . . . .
Specifying sequences, GCG style in order of
increasing power and complexity
- The sequence is in a local GCG format single
sequence file in your UNIX account. (GCG
Reformat and SeqConv programs) - The sequence is in a local GCG database in which
case you point to it by using any of the GCG
database logical names. A colon, , always
sets the logical name apart from either an
accession number or a proper identifier name or a
wildcard expression, and they are case
insensitive. - The sequence is in a GCG format multiple sequence
file, either an MSF (multiple sequence format)
file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple
sequence file, supply the file name followed by a
pair of braces, , containing the sequence
specification, e.g. a wildcard . - Finally, the most powerful method of specifying
sequences is in a GCG list file. It is merely
a list of other sequence specifications and can
even contain other list files within it. The
convention to use a GCG list file in a program is
to precede it with an at sign, _at_. Furthermore,
you can supply attribute information within list
files to specify something special about the
sequence such as begin and end constraints.
23Clean GCG format single sequence file after
reformat (or the SeqConv program)
!!NA_SEQUENCE 1.0 This is a small example of GCG
single sequence format. Always put some
documentation on top, so in the future you can
figure out what it is you're dealing with!
The line with the two periods is converted to the
checksum line. example.seq Length 77 July 21,
1999 0930 Type N Check 4099 .. 1
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA
CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA
SeqLabs Editor mode can also Import native
GenBank format and ABI or LI-COR trace files!
24Logical terms for the Wisconsin Package
- Sequence databases, nucleic acids Sequence
databases, amino acids - GENBANKPLUS all of GenBank plus EST, HTC GSS
subdivisions GENPEPT GenBank CDS translations - GBP all of GenBank plus EST, HTC GSS
subdivisions GP GenBank CDS translations - GENBANK all of GenBank except EST, HTC GSS
subdivisions UNIPROT or UNI all of Swiss-Prot and
all of SPTrEMBL - GB all of GenBank except EST, HTC GSS
subdivisions SWISSPROTPLUS all of Swiss-Prot and
all of SPTrEMBL - BA GenBank bacterial subdivision SWP all of
Swiss-Prot and all of SPTrEMBL - BACTERIAL GenBank bacterial subdivision UNISPROT a
ll of Swiss-Prot (fully annotated) - EST GenBank EST (Expressed Sequence Tags)
subdivision SWISSPROT all of Swiss-Prot (fully
annotated) - GSS GenBank GSS (Genome Survey Sequences)
subdivision SWISS all of Swiss-Prot (fully
annotated) - HTC GenBank High Throughput cDNA SW all of
Swiss-Prot (fully annotated) - HTG GenBank High Throughput Genomic UNITREMBL Swis
s-Prot preliminary EMBL translations - IN GenBank invertebrate subdivision SPTREMBL Swiss
-Prot preliminary EMBL translations - INVERTEBRATE GenBank invertebrate
subdivision SPT Swiss-Prot preliminary EMBL
translations - OM GenBank other mammalian subdivision P all of
PIR Protein - OTHERMAMM GenBank other mammalian
subdivision PIR all of PIR Protein - OV GenBank other vertebrate subdivision PIR1 PIR
fully annotated subdivision - OTHERVERT GenBank other vertebrate subdivision
PIR2 PIR preliminary subdivision - PAT GenBank patent subdivision PIR3 PIR
unverified subdivision - PATENT GenBank patent subdivision PIR4 PIR
unencoded subdivision
These are easy they make sense and youll have
a vested interest.
25GCG MSF RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF
735 Type P July 20, 2001 1453 Check 6619
.. Name a49171 Len 425 Check
537 Weight 1.00 Name e70827 Len
577 Check 21 Weight 1.00 Name g83052
Len 718 Check 9535 Weight 1.00
Name f70556 Len 534 Check 3494
Weight 1.00 Name t17237 Len 229
Check 9552 Weight 1.00 Name s65758
Len 735 Check 111 Weight 1.00 Name
a46241 Len 274 Check 3514 Weight
1.00 // ///////////////////////////////
///////////////////
!!RICH_SEQUENCE 1.0 .. name
ef1a_giala descrip PileUp of
_at_/users1/thompson/.seqlab-mendel/pileup_28.list ty
pe PROTEIN longname /users1/thompson/seqlab/EF
1A_primitive.orig.msfef1a_giala sequence-ID
Q08046 checksum 7342 offset
23 creation-date 07/11/2001 165119 strand
1 comments ///////////////////////////////////////
/////////////////////
This is SeqLabs native format
- The trick is to not forget the Braces and wild
card, e.g. filename, when specifying!
26The List File Format
remember the _at_ sign!
- !!SEQUENCE_LIST 1.0
- An example GCG list file of many elongation 1a
and Tu factors follows. As with all GCG data
files, two periods separate documentation from
data. .. - my-special.pep begin24 end134
- SwissProtEfTu_Ecoli
- Ef1a-Tu.msf
- /usr/accounts/test/another.rsfef1a_
- _at_another.list
The way SeqLab works!
27SeqLab GCGs X-based GUI!
- SeqLab is the merger of Steve Smiths Genetic
Data Environment and GCGs Wisconsin Package
Interface - GDE WPI SeqLab
- Requires an X-Windowing environment either
native on UNIX computers (including LINUX, but
not installed by default on Mac OS X v.10
systems, however, see Apples free X11 package or
XDarwin), or emulated with X-Server Software on
personal computers.
28Conclusions
Gunnar von Heijne in his old but quite readable
treatise, Sequence Analysis in Molecular Biology
Treasure Trove or Trivial Pursuit (1987),
provides a very appropriate conclusion Think
about what youre doing use your knowledge of
the molecular system involved to guide both your
interpretation of results and your direction of
inquiry use as much information as possible and
do not blindly accept everything the computer
offers you. He continues . . . if any lesson
is to be drawn . . . it surely is that to be able
to make a useful contribution one must first and
foremost be a biologist, and only second a
theoretician . . . . We have to develop better
algorithms, we have to find ways to cope with the
massive amounts of data, and above all we have to
become better biologists. But thats all it
takes.
FOR MORE INFO...
- Explore my Web Home http//bio.fsu.edu/stevet/c
v.html. - Contact me (stevet_at_bio.fsu.edu) for specific
long-distance bioinformatics assistance and
collaboration.
29AND FOR EVEN MORE INFO...
Many texts are now available in the field. To
honk-my-own-horn a bit, check out Current
Protocols in Bioinformatics from John Wiley
Sons, Inc. (http//www.does.org/cp/bioinfo.html)
and Horizon Scientific Press Computational
Genomics Theory and Application (http//www.horiz
onpress.com/hsp/books/com.html).
Humana Press Introduction to Bioinformatics A
Theoretical And Practical Approach (http//www.hum
anapress.com/Product.pasp?txtCatalogHumanaBookst
xtCategorytxtProductID1-58829-241-XisVariant0
)
They all asked me to contribute chapters on
multiple sequence alignment and analysis using
GCG software.
30On to a demonstration of some of SeqLabs
multiple sequence dataset capabilities some of
my prebuilt alignments, and . . . Elongation
Factor 1?/Tu, how to do it.