Title: Identification of Protein Domains
1Identification of Protein Domains
2Orthologs and Paralogs
- Describing evolutionary relationships among genes
(proteins) - Two major ways of creating homologous genes is
gene duplication and speciation. - Homology not sufficiently well-defined Therefore
additional terms are used
3- Orthologs are two genes from two different
species that derive from a single gene in the
last common ancestor of the species.
ortho
para
- Paralogs are genes that derive from a single gene
that was duplicated within a genome.
ortho
4 Co-orthologs are paralogs produced by
duplications of orthologs subsequent to a given
speciation event.
co-ortho
5 Inparalogs are paralogs in a given lineage
that all evolved by gene duplications that
happened after the speciation event.
in-para
in-para
out-para
- Outparalogs are paralogs in the given lineage
that evolved by gene duplications that happened
before the speciation event
6Orthologs and Paralogs
- Orthologs - evolutionary functional
counterparts in different species - Inparalogs important for detecting
lineage-specific adaptations
7Proteins
- Rapidly growing databases of protein sequences
due to genome sequencing projects. - Many new proteins belong to protein families with
known functions, (significant sequence
similarity). - Only a small fraction of known proteins have
functions determined by experiment. - Databases providing computational sequence
analysis allow us to classify new proteins to
known families, and thus determine their function.
8Protein Domains
- A domain is an independent structural unit which
can be found alone or in conjunction with other
domains or repeats. - Module mobile domain.
- Different domains have distinct functions.
- Many eukaryotic proteins have multiple domains.
9Protein Domains
PX domain with ligand
SH3 domain with ligand
10Identifying Protein Domains
- Problems
- Defining the members of each family.
- Building multiple alignments of the members.
- Finding the boundaries of the domain.
11(No Transcript)
12Identifying Protein Domains
- Little structural data ? identification by
sequence analysis.
- Even when the structure of the domain is not
known it may be possible to define its boundaries
from sequence alone.
- Sequence characterization of families -
determine 3D structure and molecular functions.
13Identifying Protein Domains
Motif matches are often useful to
indicate functional sites, however
- They do not give a clear picture of the domain
boundaries. - Lack sensitivity.
14Identifying Protein Domains
- Automatic methods
- Fast, effective, deals with a lot of information.
- Might fragment domain families.
- Might cause fusion of domain families.
- Manual methods
- Knowledge of protein experts is put to use.
- Slow, require a lot of manpower.
15(No Transcript)
16SMART (Simple Modular Architecture Research
Tool)
- Web-based resource used for
- rapid annotation of protein domains.
- analysis of domain architectures.
17Domain Architecture
Protein PA-3427CG
Species Drosophila melanogaster
Protein ENSMUSP00000023109
Species Mus musculus
Protein ENSANGP00000009529
Species Anopheles gambiae
18SMART (Simple Modular Architecture Research Tool)
- There are over 600 domain families.
- Provides information about
- function .
- subcellular localization.
- phyletic distribution.
- tertiary structure.
- Based on HMMs (Hidden Markov Models).
19SMART (Simple Modular Architecture Research Tool)
- HMM based on seed alignment.
- Threshold values used to determine homology of
domains.
20SMART (Simple Modular Architecture Research Tool)
- Alignments of proteins by
- Minimize insertions/deletions in conserved
alignment blocks. - Optimize amino acid property conservation.
- Closing unnecessary gaps.
- Gapped alignments prefered over ungapped ones
- prediction of domain boundaries.
- greater information content.
- Alignment of entire structural domains.
21(No Transcript)
22(No Transcript)
23PROSITE - database of protein families and
domains
- Database of biologically significant sites and
patterns. Contains 1,609 profiles. - Pattern conserved sequence of a few amino
acids. - Identifies to which known family of proteins (if
any) the new sequence belongs. - Used to determine the function of uncharacterized
proteins translated from genomic or cDNA
sequences.
24PROSITE - database of protein families and domains
- A protein too distant from any other to detect
its resemblance by overall sequence alignment,
can be classified according to a Pattern. - Patterns arise because of requirements of binding
sites that impose very tight constraint on the
evolution of portions of the protein.
25PROSITE how is a pattern developed ?
- As short as possible.
- Detects all/most sequences it describes.
- As little false results as possible.
26PROSITE how is a pattern developed ?
- First study reviews on a protein family.
- Then build alignment table with particular
- attention to residues and regions important to
- the biological function of that family.
- - Enzyme catalytic sites.
- Prostethic group attachment sites (heme).
- Amino acids involved in binding a metal ion.
- Cysteines involved in disulfide bonds.
- - Regions involved in binding a molecule
(ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another
protein.
27PROSITE steps in the development of a pattern
- Finding a core pattern 4-5 biologically
significant residues. - Test the pattern on a large database.
- If lucky there is correlation in this region
which indicates a good pattern. - Mostly, there is no correlation
- Gradually increase the size of the pattern.
- search over other patterns.
28PROSITE An example
This pattern is small and would probably pick up
too many false positive results
- ALRDFATHDDF
- SMTAEATHDSI
- ECDQAATHEAS
29- Patterns - small regions, high sequence
similarity.
- Profiles characterize a protein family or
domain over its entire length.
30(No Transcript)
31Research Finding new domain familiesAutomatic
methods
- The team started with 107 nuclear domains.
- Using SMART - get all proteins with at least one
of these domains, characterize their complete
domain structure. - Regions not annotated using known SMART domain
models were extracted with their domain context.
32Finding new domain families Automatic methods
- Grouping proteins by region similarity.
- Finding homologs using PSI-BLAST on longest of
every group (Threshold E-value - Finding domain organization via SMART.
- Homologous regions candidates for a novel
domain family.
33Finding new domain families
34Finding new domain families Manual confirmation
- Different context novel module family.
- Proteins with nuclear AND extracellular domains
excluded. - Multiple alignments and known locations of
domains definition of domains borders. - Automatic searches to find more members, E-value
- Marginal similarity to domain family possible
divergent family.
35Prediction of Function Chromatin-Binding Domains
- Protein SPT6 containing CSZ domain, regulates
transcription through a histone-binding
capability. - It also contains two other types of domains,
which are unlikely to bind histones. - Therefore it was predicted that CSZ domain has
that function.
36Research
- Search of C-terminal by PSI-BLAST (E-valuefound UBX containing proteins and metazoan
homologs of PNGases. - PNGases proteins involved in UPR.
- UPR unfolded protein response.
- PUG the homologous regions.
- PUG domains found in proteins with
domains central to ubiquitin- mediated
proteolysis, (UBA and UBX).
37- Conclusion
- PUG containing proteins might link the UPR to
ubiquitin mediated protein degradation.
38PUG
UBA
Believed to have a role in the UPR
39(No Transcript)
40ApoptosisUbx domain from human faf1
Dna binding proteinc-terminal uba domain of the
human homologue of rad23a (hhr23a)
41- Orthologs of PNGases in metazoan are present
singly, (not in multiple paralogs) likely to
have similar cellular localization. - The ortholog in Sacharaomyces cervisiae is known
to be localized mainly in the nucleus.
42- HMM from the PUG marginal similarity to
IRE1p-like Kinases which are known to initiate
the UPR as well. - They suggest the presence of divergent PUG
domains in the C termini of these Proteins. - Analysis revealed a conserved region in metazoan
PNGases. Named it PAW. Put it in SMART.
43- The team found 28 novel nuclear domain families.
- Most of them with representatives in diverse
molecular context in different species. - Some specific to single species.
- Others divergent members of previously
recognized families.
44The End