Title: Paramvir S Dehal, Jeffrey L. Boore
1A phylogenomic gene cluster resource the
Phylogenetically Inferred Groups (PhIGs) database
- Paramvir S Dehal, Jeffrey L. Boore
Presented By Jared Carter
2Principal Objectives
- Define and add context to the growing need for
better methodology for high-throughput sorting of
genes into Orthologous families. - Outline the steps involved in the computational
framework for the identification of these sets. - Examination of model output data
- Analysis of overall model effectiveness.
3Introduction
- Increasingly large whole genome projects require
further methodologies to increase the overall
contextual value of the data set. - Delving into the evolutionary history of each
gene leads to a more robust understanding of its
function. - Gives definition to the evolution of the genome
and specie. - Answers questions regarding the functional and
biochemical processes of the genome. - Previously homologs identified by pairwise
comparisons has been used, with many drawbacks. - Incorrect assignments caused by gene-duplication
events - Accelerated rates of AA substitution
- Domain shuffling
4Methodology Contextual Background
- Uses known evolutionary relationships to
construct gene clusters of queried genomes. - Analyzes each cluster for evolutionary
relationships between the queried genes. - Goal is to reconstruct the evolutionary history
of each family. - Allows for whole genome analysis with emphasis on
the evolutionary background of each gene. - Can discriminate and classify numerous additional
types of evolutionary events (gene duplications,
AA substitution rates, and gene loss)
5Methodology Overview
- Creates and populates a relational database with
all known annotations for each taxon. - All sequence data and annotations available are
included - Data such as previous sequence alignments and
trees are also included - General process involves 5 steps
- Step 1 All against All BLASTp
- Step 2 Global alignment and distance calculation
- Step 3 Hierarchal Clustering
- Step 4 Multiple Sequence Alignment (MSA)
- Step 5 Gene Tree Construction
6Methodology Workflow Diagram
Figure 1 Work flow diagram illustrating analysis
pipeline for processing select gene models. 1
7BLASTp and Global Alignment
- All against All BLASTp generates local alignments
which then must be processed into a global
alignment. - ClustalW is used to align each protein pair.
- These resulting alignments are stored in the PhIG
database - Distances are calculated using the JTT matrix
used in the ProDist program (PHYLIP) - Used later in the clustering process in
conjunction with the gap-free alignment lengths
Figure 2 BLASTp query output excerpt. 2
8Gene Cluster Analysis
- Process takes in the known evolutionary
relationships of the selected organisms as well
as the all pairwise protein distances - Uses an iterative approach, starting at the base
of the best known evolutionary tree at the common
ancestral gene, and then extends upwards - For each bifurcating node two new clades, A and
B, are created, with the remaining taxa being
added to an out-group. - Genes from clade A are more similar to each other
than in clade B. - Similarly genes from A and B are more closely
related than those in the out-group. - This is accomplished by comparing seeds (pairs of
sequences) and assigning a match quality. - Clusters grow by determining the shortest
distance from A and B. - If Proteins have a shorter distance than A-B it
is added to the cluster - This is repeated until all proteins have been
clustered
9Protein Distance Map and Cluster Tree
Figure 4 Species tree generated by cluster
analysis.1
Figure 3 Protein Distance map of a pairwise
alignment.1
10MSA and Phylogenetic Tree Extrapolation
- An MSA is performed on each cluster, using
ClustalW. - Alignments are trimmed to remove gaps, or removed
if fewer than 100 AA are aligned successfully. - Trees are generated using a precise method
- The quartet puzzling maximum likelihood method,
implemented by TREE-PUZZLE using JTT - The generated tree is assisted by the known
evolutionary relationships, further defining the
phylogeny. - MSAs are also used to find Hidden Markov Models,
which give better structural representation to
sparsely sampled genomes.
11Phylogenetic Tree
Figure 5 Phylogenetic Tree incorporating the
cluster tree and known phylogenetic data.
Illustrates numerous duplication events as well
as the proportionality of AA substitutions
(represented by varying branch lengths). 1
12Gene View, Annotation and Synteny Maps
- Gene View
- All known annotations to a gene can be displayed,
as well as its location on the chromosome of all
selected species. - Querying
- The database can be searched for exact and
relative sequences, or text matches to search
fields. - Several specialized search tools exist, such as
HMMER, which searches directly against Hidden
Markov Models. - Synteny Maps
- Produce a set of one-on orthologs plotted in
their relative locations across multiple genomes.
- Performed by selected a reference sequence span
of one species, and the species to be queried.
13Synteny Map
Figure 6 Synteny map of selected sequence with
rectangles representing genes, to the left and
right orthologs. Note Black lines indicated
similar transcriptional orientation, red lines
indicate inversion.1
14Model Effectiveness Overview
- Very efficient at analyzing evolutionary patterns
within whole genomes. - Assists in transferring annotations
- Identifies sources of evolutionary pressure on
genes - Utilizes the applicable known evolutionary
history of the genomes studied - Creates clusters as descendants from one
ancestral gene - Combines phylogenetic data with functional
annotation, gene structure and genomic position. - Multiple, varied applications
- Organismal Phylogeny construction
- Genome evolution via gene duplications
- Gene structure evolution via gain/loss of exons,
introns and domains. - Identification of Gene family expansions and
losses - Genome evolution as a whole.
15Conclusions
- With continuing development this highly modular
tool which is well founded in current analytical
molecular methodology will prove to be quite
useful in adding historical context to genome
data. - Understanding the history of a gene allows for a
better realization of that genes function in the
organism by relating it to other, similar
organisms which have also conserved its function.
16Resources
- 1 Dehal, Paramvir S., Boore, Jeffrey L., A
phylogenomic gene cluster resource the
Phylogenetically Inferred Groups (PhIGs)
database, BMC Bioinformatics, 2006, 7201 - 2 NCBI, BLASTp protein-protein query,
http//www.ncbi.nlm.nih.gov/BLAST/Blast.cgi