Title: Introduction to Bioinformatics
1Introduction to Bioinformatics Monday,
November 17, 2008 Jonathan Pevsner pevsner_at_jhmi.ed
u Bioinformatics M.E800.707
2Teaching assistants!
Bethany Drehman bethfoxglove_at_gmail.com Cheng
Ran (Lisa) Huang huangchengran_at_gmail.com
3Who is taking this course?
- People with very diverse backgrounds in biology
- Some people with backgrounds in computer
- science and biostatistics
- Most people (will) have a favorite gene,
protein, or disease
4What are the goals of the course?
- To provide an introduction to bioinformatics
with - a focus on the National Center for
Biotechnology - Information (NCBI) and EBI
- To focus on the analysis of DNA, RNA and
proteins - To introduce you to the analysis of genomes
- To combine theory and practice to help you
- solve research problems
5Themes throughout the course
Textbooks Web sites Literature
references Gene/protein families Computer labs
6Textbook
The course textbook has no required textbook. I
wrote Bioinformatics and Functional Genomics
(Wiley, 2003). The seven lectures in this course
correspond closely to chapters. An electronic
version is available on the Welch Library
website. A few copies will be available on
reserve at Welch Library, and the library has six
more copies. I recommend several other
bioinformatics texts Baxevanis and
Ouellette David Mount Durbin et al.
7Visit http//www.welch.jhu.edu Search for
bioinformatics in ebook titles
8Visit http//www.welch.jhu.edu Search for
bioinformatics in ebook titles
9Web sites
The course website is reached via moodle
http//pevsnerlab.kennedykrieger.org/moodle (or
Google moodle bioinformatics) --This site
contains the powerpoints for each lecture.
including color and black white versions --The
weekly quizzes are here --You can ask questions
via the forum The textbook website
is http//www.bioinfbook.org This has
powerpoints, URLs, etc. organized by chapter
10Literature references
You are encouraged to read original source
articles (posted on moodle). They will enhance
your understanding of the material. Readings are
optional but recommended.
11Themes throughout the course gene/protein
families
We will use beta globin and retinol-binding
protein 4 (RBP4) as model genes/proteins
throughout the course. Globins including
hemoglobin and myoglobin carry oxygen. RBP4 is a
member of the lipocalin family. It is a small,
abundant carrier protein. We will study globins
and lipocalins in a variety of contexts
including --sequence alignment --gene
expression --protein structure --phylogeny --ho
mologs in various species
12(No Transcript)
13Computer labs
There are three computer labs. I STRONGLY
encourage you to bring a laptop to class. Also,
the seven weekly quizzes function as a computer
lab to solve the questions, you may need to go
to a website and use databases or software.
14Grading
60 moodle quizzes (best six out of seven).
Quizzes are taken at the moodle website,
and are due one week after the relevant
lecture 40 final exam Tuesday, January 12 (in
class). Closed book, cumulative, no
computer, short answer / multiple choice. Past
exams will be made available ahead of time.
15Google moodle bioinformatics to get here Click
Introduction to Bioinformatics to sign in The
enrollment key is
16Outline for the course
1. Accessing information about DNA and
proteins Nov. 17 2. Pairwise alignment Nov.
24 3. BLAST Dec. 1 LAB 1 of 3 Dec.
1 4. Multiple sequence alignment Dec. 8 5.
Molecular phylogeny and evolution Dec. 15 LAB
2 of 3 Dec. 15 6. Proteomics Dec.
22 7. Gene expression microarrays Jan. 5 LAB
3 of 3 Jan. 5 Final exam Jan. 12
17Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Four ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
18What is bioinformatics?
- Interface of biology and computers
- Analysis of proteins, genes and genomes
- using computer algorithms and
- computer databases
- Genomics is the analysis of genomes.
- The tools of bioinformatics are used to make
- sense of the billions of base pairs of DNA
- that are sequenced by genomics projects.
-
19On bioinformatics
Science is about building causal relations
between natural phenomena (for instance, between
a mutation in a gene and a disease). The
development of instruments to increase our
capacity to observe natural phenomena has,
therefore, played a crucial role in the
development of science - the microscope being the
paradigmatic example in biology. With the human
genome, the natural world takes an unprecedented
turn it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and
the associated software becomes the instrument to
observe it, and the discipline of bioinformatics
flourishes.
20On bioinformatics
However, as the separation between us (the
observers) and the phenomena observed increases
(from organism to cell to genome, for instance),
instruments may capture phenomena only
indirectly, through the footprints they leave.
Instruments therefore need to be calibrated the
distance between the reality and the observation
(through the instrument) needs to be accounted
for. This issue of Genome Biology is about
calibrating instruments to observe gene
sequences more specifically, computer programs
to identify human genes in the sequence of the
human genome. Martin Reese and Roderic Guigó,
Genome Biology 2006 7(Suppl I)S1, introducing
EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
21Tool-users
Tool-makers
22Three perspectives on bioinformatics
The cell The organism The tree of life
Page 4
23(No Transcript)
24DNA
RNA
phenotype
protein
Page 5
25Time of development
Body region, physiology, pharmacology, pathology
Page 5
26After Pace NR (1997) Science 276734
Page 6
27DNA
RNA
phenotype
protein
28Growth of GenBank
Base pairs of DNA (billions)
Sequences (millions)
Fig. 2.1 Page 17
1982
1986
1990
1994
1998
2002
Year
29Growth of GenBank Whole Genome
Shotgun (1982-November 2008)
250
200
150
Number of sequences in GenBank (millions)
Base pairs of DNA in GenBank (billions) Base
pairs in GenBank WGS (billions)
100
50
0
1982
1987
1992
1997
2002
2007
30Central dogma of molecular biology
DNA
RNA
protein
31DNA
RNA
phenotype
protein
protein sequence databases
cDNA ESTs UniGene
genomic DNA databases
Fig. 2.2 Page 20
32There are three major public DNA databases
GenBank
EMBL
DDBJ
The underlying raw DNA sequences are identical
Page 16
33There are three major public DNA databases
GenBank
EMBL
DDBJ
Housed at EBI European Bioinformatics Institute
Housed at NCBI National Center
for Biotechnology Information
Housed in Japan
Page 16
34The Trace Archive at NCBI contains over 2 billion
traces
11/08
35Taxonomy at NCBI 200,000 species are
represented in GenBank
http//www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
11/08
36The most sequenced organisms in GenBank
Homo sapiens 13.1 billion bases Mus musculus
8.4b Rattus norvegicus 6.1b Bos
taurus 5.2b Zea mays 4.6b Sus
scrofa 3.6b Danio rerio 3.0b Oryza sativa
(japonica) 1.5b Strongylocentrotus
purpurata 1.4b Nicotiana tabacum 1.1b
Updated 11-6-08 GenBank release 168.0 Excluding
WGS, organelles, metagenomics
Table 2-2 Page 18
37National Center for Biotechnology Information
(NCBI) www.ncbi.nlm.nih.gov
Page 24
38Fig. 2.5 Page 25
www.ncbi.nlm.nih.gov
39Fig. 2.5 Page 25
40- PubMed is
-
- National Library of Medicine's search service
- 16 million citations in MEDLINE
- links to participating online journals
- PubMed tutorial (via Education on side bar)
Page 24
41(No Transcript)
42- Entrez integrates
- the scientific literature
- DNA and protein sequence databases
- 3D protein structure data
- population study data sets
- assemblies of complete genomes
Page 24
43Entrez is a search and retrieval system that
integrates NCBI databases
Page 24
44- BLAST is
- Basic Local Alignment Search Tool
- NCBI's sequence similarity search tool
- supports analysis of DNA and protein databases
- 100,000 searches per day
Page 25
45- OMIM is
- Online Mendelian Inheritance in Man
- catalog of human genes and genetic disorders
- created by Dr. Victor McKusick led by Dr. Ada
Hamosh - at JHMI
Page 25
46- Books is
- searchable resource of on-line books
Page 26
47- TaxBrowser is
- browser for the major divisions of living
organisms - (archaea, bacteria, eukaryota, viruses)
- taxonomy information such as genetic codes
- molecular data on extinct organisms
Page 26
48- Structure site includes
- Molecular Modelling Database (MMDB)
- biopolymer structures obtained from
- the Protein Data Bank (PDB)
- Cn3D (a 3D-structure viewer)
- vector alignment search tool (VAST)
Page 26
49Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Five ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
50Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences. You may want to acquire information
beginning with a query such as the name of a
protein of interest, or the raw nucleotides
comprising a DNA sequence of interest. DNA
sequences and other molecular data are tagged
with accession numbers that are used to identify
a sequence or other record relevant to molecular
data.
Page 26
51What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
Page 27
52Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 27
535 ways to access protein and DNA sequences
1 Entrez Gene with RefSeq Entrez Gene is a
great starting point it collects key information
on each gene/protein from major databases. It
covers all major organisms. RefSeq provides a
curated, optimal accession number for each DNA
(NM_006744) or protein (NP_007635)
Page 27
54From the NCBI home page, type beta globin and
hit Go
revised 11/08 Fig. 2.7 Page 29
55revised Fig. 2.7 Page 29
56(No Transcript)
57(No Transcript)
58By applying limits, there are now fewer entries
59Entrez Gene (top of page)
Note that links to many other HBB database
entries are available
revised Fig. 2.8 Page 30
60Entrez Gene (middle of page)
61Entrez Gene (middle of page, continued)
62Entrez Gene (bottom of page) RefSeqs
63Entrez Gene (bottom of page) non-RefSeq
accessions
64Fig. 2.9 Page 32
65Fig. 2.9 Page 32
66Fig. 2.9 Page 32
67FASTA format versatile, compact with gtone
header line followed by a string of nucleotides
or amino acids in the single letter code
Fig. 2.10 Page 32
68What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples X02775 GenBank genomic
DNA sequence NT_030059 Genomic contig Rs7079946 db
SNP (single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of hundreds) NM_006744 R
efSeq DNA sequence (from a transcript) NP_007635
RefSeq protein AAC02945 GenBank
protein Q28369 SwissProt protein 1KT7 Protein
Data Bank structure record
DNA
RNA
protein
Page 27
69NCBIs important RefSeq project best
representative sequences
RefSeq (accessible via the main page of
NCBI) provides an expertly curated accession
number that corresponds to the most stable,
agreed-upon reference version of a sequence.
RefSeq identifiers include the following
formats Complete genome NC_ Complete
chromosome NC_ Genomic contig NT_ mRN
A (DNA format) NM_ e.g. NM_006744 Protein
NP_ e.g. NP_006735
Page 29-30
70NCBIs RefSeq project accession for genomic,
mRNA, protein sequences
Accession Molecule Method Note AC_123456
Genomic Mixed Alternate complete
genomic AP_123456 Protein Mixed Protein
products alternate NC_123456
Genomic Mixed Complete genomic
molecules NG_123456 Genomic Mixed Incomplet
e genomic regions NM_123456
mRNA Mixed Transcript products mRNA
NM_123456789 mRNA Mixed Transcript
products 9-digit NP_123456
Protein Mixed Protein products NP_123456789
Protein Curation Protein products 9-digit
NR_123456 RNA Mixed Non-coding
transcripts NT_123456 Genomic Automated Gen
omic assemblies NW_123456
Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome
shotgun data XM_123456 mRNA Automated Transc
ript products XP_123456 Protein Automated Pr
otein products XR_123456 RNA Automated Tran
script products YP_123456 Protein Auto.
Curated Protein products ZP_12345678
Protein Automated Protein products
71Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 31
72protein
DNA
RNA
complementary DNA (cDNA)
UniGene
Fig. 2.3 Page 23
73UniGene unique genes via ESTs
- Find UniGene at NCBI
- www.ncbi.nlm.nih.gov/UniGene
- UniGene clusters contain many expressed sequence
- tags (ESTs), which are DNA sequences (typically
- 500 base pairs in length) corresponding to the
mRNA - from an expressed gene. ESTs are sequenced from
a - complementary DNA (cDNA) library.
- UniGene data come from many cDNA libraries.
- Thus, when you look up a gene in UniGene
- you get information on its abundance
- and its regional distribution.
Pages 20-21
74Cluster sizes in UniGene
This is a gene with 1 EST associated the cluster
size is 1
Fig. 2.3 Page 23
75Cluster sizes in UniGene
This is a gene with 10 ESTs associated the
cluster size is 10
76Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters 1 ?
40,300 2 18,500 3-4 18,000 5-8 13,400 9-
16 8,100 17-32 5,200 ?500-1000 1,900 ?100
0-4000 940 ?4000-16,000 74 ?16,000-65,000 8
1600070000ESTC
UniGene build 216, 11/08
77UniGene unique genes via ESTs
Conclusion UniGene is a useful tool to look
up information about expressed genes.
UniGene displays information about the abundance
of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain
vs. liver). We will discuss UniGene further on
January 5 (gene expression).
Page 31
78Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 31
79Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a
premier human genome web browser. We will
encounter Ensembl as we study the human
genome, BLAST, and other topics.
80click human
81enter RBP4
82(No Transcript)
83Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 33
84ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system (ExPASy Expert
Protein Analysis System) Visit
http//www.expasy.ch/
Page 33
85Fig. 2.11 Page 33
86(No Transcript)
87Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 33
881 Visit http//genome.ucsc.edu/, click Genome
Browser
2 Choose organisms, enter query (beta globin),
hit submit
89Example of how to access sequence data HIV-1 pol
There are many possible approaches. Begin at the
main page of NCBI, and type an Entrez query
hiv-1 pol
Page 34
9011/08
91Searching for HIV-1 pol Following the genome
link yields a manageable five results
Page 34
92Example of how to access sequence data HIV-1 pol
For the Entrez query hiv-1 pol there are about
80,000 nucleotide or protein records (and
gt200,000 records for a search for hiv-1), but
these can easily be reduced in two easy
steps --specify the organism, e.g.
hiv-1organism --limit the output to RefSeq!
Page 34
93over 200,000 nucleotide entries for HIV-1
only 1 RefSeq
94Examples of how to access sequence data histone
query for histone results protein
records 21847 RefSeq entries 7544 RefSeq
(limit to human) 1108 NOT deacetylase 697 At
this point, select a reasonable candidate
(e.g. histone 2, H4) and follow its link to
Entrez Gene. There, you can confirm you have the
right gene/protein.
8-12-06
95(No Transcript)
96Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Four ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
97PubMed at NCBI to find literature information
98PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,600 journals published in
the United States and in 70 foreign countries.
It has gt18 million records dating back to 1950s.
Updated 11-08
Page 35
99MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
Page 35
100(No Transcript)
101(No Transcript)
102PubMed search strategies
Try the tutorial (education on the left
sidebar) Use boolean queries (capitalize AND,
OR, NOT) lipocalin AND disease Try using
limits Try Links to find Entrez information
and external resources Obtain articles on-line
via Welch Medical Library (and download pdf
files) http//www.welch.jhu.edu/
Page 35
103lipocalin AND disease (60 results)
1 AND 2
1
2
lipocalin OR disease (1,650,000 results)
1 OR 2
1
2
lipocalin NOT disease (530 results)
1 NOT 2
1
2
Fig. 2.12 Page 34
104Article contents
globin is absent
globin is present
Search result
false positive (article does not discuss globins)
globin is found
true positive
false negative (article discusses globins)
globin is not found
true negative
105WelchWeb is available at http//www.welch.jhu.edu
106WelchWeb is available at http//www.welch.jhu.edu
Welch Medical Library liasons to the basic
sciences
107November 24, 2008 Pairwise sequence
alignment Jonathan Pevsner,
Ph.D. Bioinformatics Johns Hopkins M.E440.707
108Outline pairwise alignment
- Overview and examples
- Definitions homologs, paralogs, orthologs
- Assigning scores to aligned amino acids
- Dayhoffs PAM matrices
- Alignment algorithms Needleman-Wunsch,
- Smith-Waterman
- Statistical significance of pairwise alignments
109Pairwise alignments in the 1950s
b-corticotropin (sheep) Corticotropin A (pig)
ala gly glu asp asp glu asp gly ala glu asp glu
CYIQNCPLG CYFQNCPRG
Oxytocin Vasopressin
110myoglobin
a-
b-
globins
Early example of sequence alignment globins
(1961) H.C. Watson and J.C. Kendrew, Comparison
Between the Amino-Acid Sequences of Sperm Whale
Myoglobin and of Human Hæmoglobin. Nature
190670-672, 1961.
111Pairwise sequence alignment is the most
fundamental operation of bioinformatics
- It is used to decide if two proteins (or genes)
- are related structurally or functionally
- It is used to identify domains or motifs that
- are shared between proteins
- It is the basis of BLAST searching (next week)
- It is used in the analysis of genomes
112(No Transcript)
113(No Transcript)
114Pairwise alignment protein sequences can be more
informative than DNA
- protein is more informative (20 vs 4
characters) - many amino acids share related biophysical
properties - codons are degenerate changes in the third
position - often do not alter the amino acid that is
specified - protein sequences offer a longer look-back
time - DNA sequences can be translated into protein,
- and then used in pairwise alignments
115Page 54
116Pairwise alignment protein sequences can be more
informative than DNA
DNA can be translated into six potential
proteins
5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACC
CAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTT
TGGATGGGTG 5
5 GTG GGT 5 TGG GTA 5 GGG TAG
117Pairwise alignment protein sequences can be more
informative than DNA
- Many times, DNA alignments are appropriate
- --to confirm the identity of a cDNA
- --to study noncoding regions of DNA
- --to study DNA polymorphisms
- --example Neanderthal vs modern human DNA
Query 181 catcaactacaactccaaagacacccttacacccactag
gatatcaacaaacctacccac 240
Sbjct 189 catcaactgcaaccccaaagccacccct-caccca
ctaggatatcaacaaacctacccac 247
118b-lactoglobulin (P02754)
retinol-binding protein 4 (NP_006735)
Page 42
119Outline pairwise alignment
- Overview and examples
- Definitions homologs, paralogs, orthologs
- Assigning scores to aligned amino acids
- Dayhoffs PAM matrices
- Alignment algorithms Needleman-Wunsch,
- Smith-Waterman
- Statistical significance of pairwise alignments
120Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
121Definitions
Homology Similarity attributed to descent from a
common ancestor.
Page 42
122Definitions
Homology Similarity attributed to descent from a
common ancestor.
Identity The extent to which two (nucleotide or
amino acid) sequences are invariant.
RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA
59 K GTWMA L
A glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNIS
LMATLKA 55
Page 44
123Definitions two types of homology
Orthologs Homologous sequences in different
species that arose from a common ancestral gene
during speciation may or may not be responsible
for a similar function. Paralogs Homologous
sequences within a single species that arose by
gene duplication.
Page 43
124common carp
Orthologs members of a gene (protein) family in
various organisms. This tree shows RBP orthologs.
zebrafish
rainbow trout
teleost
African clawed frog
chicken
human
mouse
rat
horse
rabbit
cow
pig
10 changes
Page 43
125apolipoprotein D
Paralogs members of a gene (protein) family
within a species
retinol-binding protein 4
Complement component 8
Alpha-1 Microglobulin /bikunin
prostaglandin D2 synthase
progestagen- associated endometrial protein
neutrophil gelatinase- associated lipocalin
Odorant-binding protein 2A
10 changes
Lipocalin 1
Page 44
126(No Transcript)
127Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Page 46
128Definitions
Similarity The extent to which nucleotide or
protein sequences are related. It is based upon
identity plus conservation. Identity The extent
to which two sequences are invariant. Conservatio
n Changes at a specific position of an amino
acid or (less commonly, DNA) sequence that
preserve the physico-chemical properties of the
original residue.
129Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Identity (bar)
Page 46
130Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Very similar (two dots)
Somewhat similar (one dot)
Page 46
131Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
Page 47
132Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Internal gap
Terminal gap
Page 46
133Gaps
Positions at which a letter is paired with a
null are called gaps. Gap scores are
typically negative. Since a single mutational
event may cause the insertion or deletion of
more than one residue, the presence of a gap
is ascribed more significance than the length
of the gap. In BLAST, it is rarely necessary
to change gap values from the default.
134Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
135Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss)
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGT
WYAMAKKDP 48 ...
. .. . 1
MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP
47 . . .
. . 49 EGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDVCADMVGTFTDTED 98
... ..
48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFE
DTPD 97 . . .
. . 99 PAKFKMKYWGVASFLQKGNDDHW
IVDTDYDTYAVQYSCRLLNLDGTCADS 148
..
98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCR
EVDLDGTCLDG 147 . .
. . . 149
YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNL
L 199 .. .
148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGK
YRRVGHTGFCESS...... 192
136Pairwise sequence alignment allows us to look
back billions of years ago (BYA)
Origin of life
Origin of eukaryotes
Earliest fossils
Eukaryote/ archaea
Fungi/animal Plant/animal
insects
4
3
2
1
0
Page 48
137Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Page 49
138Multiple sequence alignment of human lipocalin
paralogs
EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSP
VKVTALGGGNLEATFTF odorant-binding protein
2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIV
LHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIE
KIPTTFENGRCIQANYSLMENGNQELRADGTV
apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAE
FSVDETGNWDVCADGTF retinol-binding
protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKS
YNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFL
GRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCP
WMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobuli
n PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD.
.. complement component 8
Page 49