Title: Genome Analysis II Comparative Genomics
1Genome Analysis IIComparative Genomics
- Jiangbo Miao
- Apr. 25, 2002
- CISC889-02S Bioinformatics
2Why Comparative Genomics ?
- It tells us what are common and what are unique
between different species at the genome level. - Genome comparison may be the surest and most
reliable way to identify genes and predict their
functions and interactions. - e.g., to distinguish orthologs from
paralogs - The functions of human genes and other DNA
regions can be revealed by studying their
counterparts in lower organisms.
3Outline
- All-against-all Self-comparison of Proteome
- Between-proteome Comparisons
- Family and Domain Analysis
- Ancient Conserved Regions (ACRs)
- Horizontal Gene Transfer
- Functional Classification of Genes
- Gene-order Comparisons
4All-against-all Self-comparison
- How?
- Making a database of the proteome
- Use each protein as a query in a similarity
search against the database - (BLAST, WU-BLAST or FASTA)
- Generate a matrix of alignment scores (P or E
value) - A conservative cutoff E value 10e-6
- Why?
- Number of Gene Families
- This comparison distinguishes unique proteins
from proteins arisen from gene duplication, and
also reveals the of gene families. - Paralogs
- Significantly matched pairs of protein sequences
may be paralogs.
5All-against-all Comparison Example
6Cluster Analysis
- To sort out relationships among all of the
proteins found to be related in the above search. - Clustering organizes the proteins into groups by
some objective criterion - P or E value ( lt 0.01-0.05)
- Distance between each pair of sequences in a
multiple seq. alignment - ( of amino acid changes between the aligned
seq.) - Methods
- By Making Sub-graphs
- By Single Linkage
7Clustering by making subgraphs
- Each protein sequence is a vertex
- Each matched pair of sequences with a significant
score is joined by an edge - The edges are weighted according to the P/E value
- Simple Algorithm Remove weaker links (From the
weakest one) - Rubin et al. (2000)
- Edges of E value gt 10-6 are removed
- Remaining subgraphs comprise sequences that share
a significant relationship to each other but not
to other seq. - Criterion the group should mutually share gt 2/3
of all of the edges from this group to all
proteins in the proteome - This algorithm favors the selection of proteins
with the same domain structure reflecting that
these proteins are most probably paralogs
8Clustering by making subgraphs Example
9Clustering by single linkage
- Based on the distance criterion
- A group of related sequences found in the
all-against-all proteome comp. is subjected to a
MSA (CLUSTALW). - A distance matrix is made
- Use this matrix to cluster the sequence by a
neighbor-joining algorithm - (the same procedure as that used to make a
phylogenetic tree) - Cluster representation Tree or Dendrogram
- As smaller groups are chosen, the most strongly
supported clusters are more likely to be made up
of paralogs(?)
10Clustering by single linkage Example
11Core Proteome
- All-against-all comparison reveals the of
protein/gene families in an organism. - This number represents the core proteome of the
organism from which all biological functions have
diversified.
Organism of genes of gene families of duplicated genes
H. Influenzae (bacteria) 1709 1425 284
S. Cerevisiae (yeast) 6241 4383 1858
C. Elegans (worm) 18,424 9453 8971
D. Melanogaster (fly) 13,600 8065 5536
In Hemophilus, 1247 out of 1709 proteins do not
have paralogs Core proteome of the
multicellular organisms is only twice that of
yeast
12Outline
- All-against-all Self-comparison of Proteome
- Between-proteome Comparisons
- Family and Domain Analysis
- Ancient Conserved Regions (ACRs)
- Horizontal Gene Transfer
- Functional Classification of Genes
- Gene-order Comparisons
13Between-Proteome Comparisons Why?
- To identify orthologs, gene families, and domains
- Orthologs (proteins that share a common
ancestry function) - A pair of proteins in two organisms that align
along most of their lengths with a highly
significant alignment score. - These proteins perform the core biological
functions shared by the two organisms. - Two matched sequences (X in A, Y in B) may not be
orthologs - (Y and Z are paralogs in B, X and Z are
orthologs) - Identify true orthologs
- highest-scoring match (best hit)
- E value lt 0.01
- gt 60 alignment over both proteins
-
14Between-Proteome Comparisons How?
- Choose a yeast protein and perform a database
similarity search of the worm proteome
(WU-BLAST) a yeast-versus-worm search - Group the worm seqs that match the yeast query
seq with a high P value (10-10 to 10-100), also
include the yeast query seq in the group - From the group made in 2, choose a worm seq and
make a search of the yeast proteome, using the
same P limit - Add any matching yeast seq to the group made in
2 - Repeat 3 4 for all initially matched seqs in
the group - Repeat 1-5 for every yeast protein
- As 1-6, perform a comparable worm-versus-yeast
search - Coalesce the groups of related seqs. and remove
any redundancies so that every sequence is
represented only once. - Eliminate any matched pairs in which less than
80 of each seq is in the alignment
15Between-Proteome Comparison Result
Cut-off P value lt 10-10 lt 10-20 lt 10-50 lt 10-100
of seq groups 1171 984 552 236
of groups with gt2 members 560 442 230 79
and of all yeast proteins (6217) represented in groups 2697(40) 1848(30) 888(14) 330(5)
and of all worm proteins represented in groups 3653(19) 2497(13) 1094(6) 370(2)
The sequences also align to 80, so they
represent highly conserved sets of genes
16Cluster of orthologous group (COG)
- Motivation
- In the above database search, A protein seq
will not only match the orthologous seq in the
second proteome, but also those paralogous seqs
of the orthologous seq. - Objective
- To identify all matching proteins as an
orthologous group related by both speciation
(ortholog) and gene duplication (paralog) events. - Meaning
- COGs usually correspond to classes of metabolic
function - Application (example)
- Produce a COG database by analysis of microbial
yeast genomes - Search a newly identified microbial protein in
this database - Significant match will provide an indication of
its metabolic function
17Comparison of Proteome to EST database
- Why?
- For many organisms(Eukaryotic), complete genome
seq not available - While a large collection of EST seqs are
available - An EST database of an organism can also be
analyzed for the presence of gene families,
orthologs, and paralogs. - e.g. a protein from the yeast or fly proteome can
be used as a query of a human EST database - (translate EST seq in all six possible reading
frames) - Problem
- EST seqs are usually short( the equivalent of
100-150 amino acids) - Solution
- identify overlapping EST seq a longer alignment
can be produced - perform an exhaustive search for a protein
family
18Search for orthologs to a protein family in EST
database
- Retief et al. (1999) Use FAST-PAN to scan EST
database with multiple queries from a protein
family, sorts the alignment scores, and produces
charts and alignments of the matches found.
- Example
- Protein family glutathione transferase proteins
- Mammalian EST database
- TFASTY3 search system
- Shown are matches of two mouse ESTs to a query
seq
19Search for orthologs to a protein family in EST
database
- A large number of known glutathione transferase
proteins was first subjected to MSA, and a
phylogenetic tree was made to identify classes of
proteins within the family - The object was to choose class representatives
result
Class
Flow chat
Search
20Outline
- All-against-all Self-comparison of Proteome
- Between-proteome Comparisons
- Family and Domain Analysis
- Ancient Conserved Regions (ACRs)
- Horizontal Gene Transfer
- Functional Classification of Genes
- Gene-order Comparisons
21Family and Domain Analysis
- What is domain?
- Proteins are modular often comprise separate
domains - Domains represent modules of structure and
function - Domain Comparison
- Comparison of the domain content of a proteome
with that of another proteome reveals the
biological roles of diverse domains in different
organisms. - Example an analysis of fly, worm, yeast
proteomes - 744 families and domains were common to all three
org. - gt 2000 fly worm proteins are multidomain
proteins (1/3 in yeast)
22Ancient Conserved Regions (ACRs)
- What is ACR?
- In some phylogenetically diverse groups of
organisms, there are conserved proteins or
protein domains that have been conserved over
long periods of evolutionary time. - How to find ACRs?
- Database similarity search of the SwissProt
database with human, worm, yeast and E. coli
genes - Identify matches with sequence from a different
phylum than the query sequence - The number of ACRs may be estimated by the
proportion of genes that match database sequence
of known function - e.g. 70 prokaryotic genomes contain ACRs
23Horizontal Gene Transfer
- Horizontal Transfer (HT)
- the acquisition of genetic material from a
different organism and these transferred material
then becomes a permanent addition to the
recipient - (HT is a significant source of genome
variation for bacteria) - Comparisons of bacterial genomes reveal that they
are mosaics of ancestral (vertical) and
horizontally transferred seqs. - 12.8 of the genome of E. coli is due to HT DNA
(the highest level) - How to detect HT?
- Fact each genome of bacterial species has a
unique base composition - HT can be detected as an island of seq with
different composition - If the amino acid composition of transferred
genes is typical, these islands may be detected
by a codon usage analysis - The time of the transfer may be estimated by the
degree of blend
24Outline
- All-against-all Self-comparison of Proteome
- Between-proteome Comparisons
- Family and Domain Analysis
- Ancient Conserved Regions (ACRs)
- Horizontal Gene Transfer
- Functional Classification of Genes
- Gene-order Comparisons
25Functional Classification of Genes
- Genes that are significantly similar in an
organism, i.e., paralogous seqs, frequently are
found to have a related biological function. - Classification Scheme
- Eight related groups of E. coli genes enzymes,
transport elements, regulators, membranes,
structural elements, protein factors, leader
peptides, and carriers. - 90 of E. coli genes fell into these same
broad categories - Special Commission, e.g. Enzyme Commission of
(IUBMB) provides a kind of detailed classes based
on the biochemical reactions they catalyze - Examine relationships among multiple enzymes that
perform the same biochemical function in the same
organism. (these enzymes showed variations in
metabolic regulation of their activity)
26Outline
- All-against-all Self-comparison of Proteome
- Between-proteome Comparisons
- Family and Domain Analysis
- Ancient Conserved Regions (ACRs)
- Horizontal Gene Transfer
- Functional Classification of Genes
- Gene-order Comparisons
27Gene Order Comparison
- Observations about gene order
- Gene order is highly conserved in closely related
species but becomes changed by rearrangements
over evolutionary time - Groups of genes that have a similar biological
function tend to remain localized in a group or
cluster - Chromosomal Rearrangement
- Occasional chromosomal breaks (random chromosomal
location) - Random rejoining of the fragments by a DNA repair
mechanism - Rearrangement Analysis
- By comparing the location of orthologs
28Chromosomal Rearrangement
29Computational Analysis of Genome Rearrangements
- Challenges
- The number and types of rearrangements that have
occurred - When they occurred?
- Example a comparison of human and mouse
chromosomes - Computational Approach
- Genome alignment
- Alignment reduction reconstruct the number and
types of rearrangement
30Computational Analysis of Genome Rearrangement
- Human chromosomes were cut into gt 100 pieces and
reassembled into a reasonable facsimile of the
mouse chromosome.
31Computational Analysis of Gene Rearrangement
Circular
- Lines indicate homologous position
- The more rearrangements there are, the more
intersections will occur - Sankoff Goldstein(1989) devised a shuffling
model for estimating the of rearrangements
given the of intersections.
32Computational Analysis of Gene Rearrangement
- Assume that those rearrangements have occurred
by some transposition or recombination events - And identify the rearrangements by undoing
those events. - The goal is to minimum the number of
rearrangements, which represents a genetic
distance between the two genome sequences
33Clusters of Genes on Chromosomes
- In a given organism, genes are found in a given
order that is maintained on the chromosomes. - On the other hand, genes with a related function
are frequently found to be clustered at one
chromosome location - Example tryptophan genes in different
prokaryotic organisms - Observation
- At least some of the trp genes are also clustered
together on the chromosomes of other species of
Bacteria Archaea - The order of genes within the cluster is
conserved within the first four species
(bacteria) - The order is much less conserved in the last
three species (Archaea) - Gene fusions, which generate a new protein that
performs both biochemical functions of the
single-gene, parent proteins.
34Clusters of Genes on Chromosomes
35Cluster of Genes on Chromosomes
- How to identify those clusters or coordinately
regulated genes? - Overbeek et al. (1999)
- Perform a full reciprocal search between the
proteomes of two org. - Protein pairs that gave a best hit with the other
genome had an E value lt 10-5 were identified,
called a bidirectional best hit (BBH) - Pairs of close BBH (PCBBH) that are within 300 bp
of each other on the chromosomes of the
respective organisms and that are transcribed
from the same strand, i.e., are in a typical
operon, were then identified - A score for these pairs was formulated. When the
of organisms in which the pair is observed is
greater and the phylogenetic distance between the
organisms is larger, this score is higher - 40 of these pairs with higher score
correspond to proteins that are known to act in a
common metabolic pathway. - ? A significant proportion of the pairs of PCBBH
correspond to genes that have a related function
and lie on the same pathway.