Title: BCB 444544
1BCB 444/544
Lecture 16 Protein Function Prediction 16 Oct 1
- Thanks to Drena Dobbs for many borrowed
modified PPTs
2Required Reading (before lecture)
- Wed Oct 1 for Lecture 16
- Chp 7
- Fri Oct 3 for Lecture 17
- Chp 8
3Homework Assignments
- HW3 on HMMs is posted on the course webpage under
homework - Due Oct 6th by 5pm
- 544 Students (and any in 444 who want to do a
project) - Extra HW1 posted
- Due Friday Oct 3rd by 11AM
4Chp 7 - Protein Motifs Domain Prediction
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 7
- Protein Motifs and Domain Prediction
- Motif Discovery in Unaligned Sequences
- Identification of Motifs Domains in MSAs
- Motif Domain Databases Using Regular
Expressions - Motif Domain Databases Using Statistical Models
- Protein Family Databases
- vSequence Logos
5Protein Function
- Proteins are primary molecules responsible
- for carrying out cellular functions
- Most enzymes that catalyze chemical reactions are
proteins (but some are RNAs!) - Proteins have complex structures that are
critical for their functions
Protein structure for dystrophin encoded by the
largest known gene in humans
6Protein Structure 4 levels of organization
7Key Aspects of Protein Function Localization
Interactions
Protein localization - function depends on
proteins being in right place at right time!
Protein interactions - function depends on
proteins interacting with correct partners inside
cells!
8Protein Sequence-Structure-Function
- Amino acid sequence determines protein structure
- But some proteins need help folding
("chaperones") in vivo - Proteins fold to a single "native" structure
(under a specific set of conditions) - Protein structure determines function
- But level, timing location of expression are
important - Interactions with other proteins, DNA, RNA,
small ligands are also very important!! - PROBLEMS
- We don't know the "folding code" that determines
how proteins fold! - We don't know the "recognition code" that
determines how proteins find and bind their
correct partners!
9Motifs Domains
- Motif - short conserved sequence pattern
- Associated with distinct function in protein or
DNA - Avg 10 residues (usually 6-20 residues)
- e.g., zinc finger motif - in protein
- e.g., TATA box - in DNA
- Domain - "longer" conserved sequence pattern,
defined as a independent functional and/or
structural unit - Avg 100 residues (range from 40-700 in
proteins) - e.g., kinase domain or transmembrane domain - in
protein - Domains may (or may not) include motifs
102 Approaches for Representing "Consensus"
Information in Motifs Domains
- Regular expression
- Statistical model
11Regular Expressions
- Reduce information from MSA
- e.g., protein phosphorylation site motif
S,T- X- R,K - Symbols represent specific or unspecified
residues, spaces, etc. - 2 mechanisms for matching
- Exact
- "Fuzzy" (inexact, approximate) - flexible, more
permissive to detect "near matches"
12Statistical Models
- Includes probability information derived from MSA
- PSSM
- Profile
- HMM
13Motif Discovery in Unaligned Sequences
- Expectation Maximization
- Gibbs Sampling
14Expectation Maximization
- Generate "random" alignment of all sequences,
derive PSSM, iteratively match individual
sequences to PSSM to edit improve it - Problems? Can hit a local optimum (premature
convergence) - Sensitive to initial alignment
- MEME - Multiple EM for Motif Elicitation -
modified EM, avoids local optimum issues
15Gibbs Sampling
- Generate "trial" PSSM from random alignment
first, as in EM, but leave one sequence out of
initial alignment, then iteratively match PSSM to
left-out sequences - Gibbs Sampler - web-based motif search via Gibbs
sampling
16Practical Advice
17(No Transcript)
18Motif Domain Databases
- Based on regular expressions
- Prosite (Interpro)
- Emotif
- Limitation these don't take probability info
into account - Based on statistical models
- PRINTS
- BLOCKS
- ProDom
- Pfam
- SMART
- CDART
- Reverse PsiBLAST
- READ your textbook try some of these at home
there are distinct advantages/disadvantages
associated with each - TAKE HOME LESSON
- Always try several methods!
- (not just one!)
19Prosite
- Database of protein families and domains
- Attempts to derive a signature for each protein
family or domain that distinguishes the protein
from all others - Currently contains over 1000 signatures
- Uses both patterns (regular expressions) and
profiles
20Prosite Patterns
- Start with a MSA of related sequences pay
special attention to things like active sites,
residues that bind a ligand - Try to find a conserved 4-5 amino acid sequence
that includes these functional sites - Scan Swiss-Prot DB with this sequence
- If Swiss-Prot hits are specific for the family,
stop - If not specific, lengthen the pattern, and search
again
21Prosite Profiles
- Prosite Patterns attempt to find a very short
signature - Prosite Profiles attempt to characterize the
family or domain over the entire length - Issue profile describes both conserved and
non-conserved regions so it is possible to match
the profile but not be a member of the family - Prosite tests its profiles to try to avoid this
problem
22Emotif
- More than 170,000 highly specific and sensitive
protein sequence motifs - Uses regular expressions
- Based on MSA of related proteins
- Enumerates all possible motifs, not just one per
family or domain - Sensitivity is defined by how many of the
sequences in the MSA fit the motif - Specificity is based on how often you would
expect to see the motif in randomly generated
sequences
23PRINTS
- Collection of fingerprints a set of motifs
that can predict the occurrence of other motifs - Uses a MSA with only a small number of sequences
and attempts to define fingerprints - These fingerprints are used to scan their
database for more sequences containing the
fingerprint - Rebuild the fingerprints, and iterate until the
fingerprint converges
24Protein Family Databases
- In addition to databases of "related" protein
sequences, based on shared motifs or domains
(Pfam, BLOCKS, CDART), some databases "cluster"
sequences into families based on near full-length
sequence comparisons - COGs - Clusters of Orthologous Groups (at NCBI)
- Mostly Prokaryotic sequences
- KOG newer Eukaryotic version
- COGnitor - software to search database
- ProtoNet - also clusters of homologous protein
sequences - Advantages tree-like hierarchical structure
- Provide GO (gene ontology) annotations
- Provides InterPro keywords
25Pfam
- Contains over 10,000 protein families
- Pfam-A high quality, manually curated
- Pfam-B automatically generated, comprehensive
collection - Uses 2 profile HMMs to describe each protein
family - Why 2?
- One for global matches, one for local matches
26CDART
- Conserved Domain Architecture Retrieval Tool
- Given a query sequence
- Identifies functional domains
- Lists proteins with a similar domain architecture
sequential order of conserved domains - Can detect remote homologs domains are
conserved, but sequence of each domain may vary
quite a bit
27PIRSF
- Protein Information Resource SuperFamily
classification system - Network classification system based on
evolutionary relationships of whole proteins - Advantages
- Allows for both generic and specific functions to
be assigned - Allows for classification of proteins without
well-defined domains
28PIRSF
- Multiple levels of classification
- Superfamily
- one or more common domains
- Homoemorphic family
- proteins that are homologous full-length
sequence similarity AND common domain
architecture - Homeomorphic subfamily
- functional specialization
29What we didnt talk about
- Machine learning!
- Given a protein sequence, predict the function
- Many, many methods exist, no clear favorite
- Most are specific for some type of function,
e.g., kinase - In lab Jafa metaserver sends your sequence to
5 different prediction servers and combines the
results into a single prediction
30What we didnt talk about
- Protein function prediction based on protein
structure - Coming later