Title: Bioinformatics and Evolutionary Genomics
1Bioinformatics and Evolutionary Genomics
2Request
- We have a small group
- and also heterogeneous with respect to previous
knowledge - PLEASE interrupt / ask questions when I am going
to fast, when I use jargon, when I make
jumps/conclusions that to me seem obvious 100
logical, but to your are erratic please point
out my implicit assumptions regarding what
everybody knows
3Lectures and computer exercises
- Homology, trees,
- Genomic context , genome evolution, pathway
evolution - HTP data
- Eukaryotic Genome Evolution, tree of life.
- Exercises basic abilities, plus impression of
what is possible / how type of research is done
(albeit on a larger scale)
4Literature Discussion
- Each (set of) articles will be introduced
(presentation) by a 1 / 2 persons, presentation
should last approximately half an hour, followed
by a discussion - What to discuss
- What are the articles actually saying? What have
authors done? (so that everybody knows) - What does this mean in a larger context? (e.g. a
discussion of the discussion)
5Homology and Domains
6Gene / protein sequence evolution what is
homology
- Definition homology (biology)
- structures are said to be homologous if they are
alike because of shared ancestry. - Classic arms bird wings bat wings,
- Genes/proteins/stretches of dna sequence
similarity because derived from the same
ancestral sequence - Instead of analogous with sequences we have
convergence, but thought to be limited to
specific cases (e.g. coiled-coil, regulatory
motifs) but with function we have analogy e.g.
analogous enzymes
7Why are we interested in homology
- Function prediction ? Homologous proteins tend to
have similar functions - Evolutionary dynamics ? Tracing the evolution of
genes (duplication, gene trees, origin of new
gene families)
8How do we detect homology
- Similarity of
- 3D structure ? most conserved aspect, yet not
all structures are available. Structures are
compared and classified by eye and software
packages (Dali). (NB classical homology)
criterion shared idiosyncratic features that
are not strictly necessary for function
sequence features - Sequence ? less conserved, many sequences are
however available. Homology determination is
mainly based on models of sequence evolution and
the likelihood that when you compare a sequence
to a database you will find a sequence of at
least that similarity. - NB Manually curated databases of 3D structure
similarity are used as a benchmark for detection
of homology by sequence similarity (SCOP,
Blundels Bus).
9Gene / protein evolution beyond blast, distant
homology
- Not obvious by blast
- Substantial divergence, due to time and/or speed
- Use profile (HMMer or PSI-BLAST),
- In general work better because
ECGHR ECGHR C G TCQQL SIGNL
ECNHN ECNHN
10Gene / protein evolution beyond blast, distant
homology
- PSI-BLAST a multiple sequence alignment is
generated on the fly to detect which
residues/positions characterize the family. - OR use CDD, PFAM or SMART
- Experts have collected representative and
divergent members of a gene family and use HMMer
or RPS-BLAST to see if your query sequence
belongs to this gene family (i.e. is homologous
to the members) - clearer/cleaner than psi-blast or blast.
11How to detect very distant homology /
superfamilies
- When two protein families are homologous but the
homology is not obvious they are part of the same
so called superfamily - How to detect
- In depth PSI-BLAST
- Reciprocal
- Use of right seed
- hopping (homology is by definition transitive)
12Gene / protein evolution Distant homology
- alignment-vs-alignment, Profile-vs-profile, HMM
vs HMM comparison (whereas HHMer, PSI-BLAST
compare a profile to a single sequence) - Unfortunately statistic are still poor
- works because
ACRNG ACRNG ACGNR ACGNR C C TCQQL TCQQL
TFQQI TCILL
13Gene / protein evolution Distant homology
- 3D structure comparison/alignment plus visual
inspection of multiple sequence alignment by
Alexey Murzin - The results of this are stored in the SCOP
database - (Blundels bus)
14Structural alignment
- Secondary structure elements
- Alpha-helices
- Beta strands (beta sheets)
- Loops
- Fold vs superfamily?
15An example of distant homology
- E.g. superfamily P-loop containing nucleoside
triphosphate hydrolase - In humans AAA 130, ABC_tran 182, SMC_N 29
- Zot UPF0079 TraG SMC_N SKI Sigma54_activat
Rep_fac_C Rad17 NACHT Mg_chelatase MCM
KTI12 IstB GSPII_E DUF853 DNA_pol3_delta
Bac_DnaA APS_kinase ABC_tran AAA_PrkA AAA_5
AAA_3 AAA_2 AAA
16Apart from sequence and structural features
conservation of basic molecular function
17Distant HomologyApplications to function
prediction
- Bacterial protein of unknown function (DUF853)
- Member of the P-loop containing nucleoside
triphosphate hydrolase superfamily - Thus thought to be an ATPase
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Relevance of homology for function prediction
Similar function What is function ?
- Various levels of description
- Sequence similarity, Homology has the largest
relevance for Molecular Function. This is aspect
of protein function that is best conserved,
protein sequence, structure can often be
interpreted in terms of function.
22Using distant homology for function prediction
example from (just) before PSI-BLAST HMMer
- Secreted Fringe-like Signaling Molecules May Be
Glycosyltransferases. - Cell. 1997 Jan 1088(1)9-11.
- Y. Yuan, J. Schultz, M. Mlodzik, P. Bork
23Distant Homology Application to evolution
- Invention vs (duplication and) divergence
- First determine homology before putting sequences
in multiple sequence alignment tree building
software - Two (or more) Proteins families that are present
in all three kingdoms of life and which can be
determined to be homologous to each other
Information from before the Last Universal Common
Ancestor, information about very early evolution
b
24Protein domains structural definition separate
in structure
- a structural domain ("domain") is an element of
overall structure that is self-stabilizing and
often folds independently of the rest of the
protein chain
25Protein domains sequence/evolutionary
definition Separate in evolution
- Homologous parts of proteins that occur with
different partners - Mobile
- Modules
- Almost always same as structural definition
26Implications of domains for homology
- The shared ancestry is not a property of the
whole gene but only of part of the gene. - When studying the evolution of gene families,
consider fusions / domain combinations (also when
making trees etc.)
27Domain repeats. Homology?
- Blast homology vs the real homology unit
- Q8TKV1 (Methanosarcina acetivorans)
- ?
28Q8TKV1
29Ramifications for function prediction
understanding of cellular processes one domain
one (molecular) function (in contrast to one
gene one function)
- This bit does this and that bit does that
- E.g.
- multidomain enzymes
- Transcriptional regulators
30Example multidomain enzyme TrpG E.coli
31Ramifications for function prediction when doing
blast mind the domains
- Protein B is wrongly annotated as having the
function of domain 1, based on homology with the
multidomain protein A, but not with domain 1 - (multi-domain architecture problem for annotating
proteins via blast)
1 2
A
B
B
32Ramifications for function prediction when doing
blast mind the domains
- Protein B is incompletely annotated as having the
function of domain 2, based on homology with the
single domain protein A, the second domain is
missed in the annotation
2
1
A
B
B
33Ramifications for function predictionwhen doing
blast do psi-blast, cdd / pfam instead.
- Rather than discover the domain structure by
blast yourself, use e.g. SMART / PFAM / CDD to do
it for you - NB CDD
34Domains and distant homologies
- Promiscuous domains (i.e. that are present in
many proteins), are often quite diverged and
thus need sensitive homology detection tools in
order to be recognized.. - Moreover it is often only the most general
functional property of the domain that is
conserved over such long evolutionary distances - Over long evolutionary distances genes are often
only homologous in the sense that they share a
domain, rather than being full length homologous - We THUS use PFAM/SMART etc. for
- The domains
- And to improve upon BLAST / be cleaner than
PSI-BLAST - And because most of the sequences are covered by
these database. No need to reinvent the wheel.
The ones that are not, are often non globular,
recent inventions, or very fast evolving
35Disclaimer non-globular regions
- Low complexity
- Unstructured, Elongated (as opposed to globular)
- Many polar/charged residues few hydrophobic
residues - parts of proteins that do not posses a clear 3D
structure - Convergence
- Do not obey PAM or BLOSUM
36Disclaimer Coiled coil
- All alpha thought to arise independently
(convergence) - Hypothesis reservoir for new folds all alpha
folds (Koonin EV) - E.g. ras / rho / rab / ran / -GAPs
37Disclaimer Other protein motifs
- Signal peptides
- Lipid anchoring
- Convergence yet still important to predict
- Trans-membrane?
38Interesting result on protein evolution regarding
domains and duplications neutral?
Black observed Blue model of recombination
duplication separate Red also duplication of
combinations
b