BCB 444544 - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

BCB 444544

Description:

If not specific, lengthen the pattern, and search again ... Rebuild the fingerprints, and iterate until the fingerprint converges ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 31
Provided by: dobbslabG
Category:

less

Transcript and Presenter's Notes

Title: BCB 444544


1
BCB 444/544
Lecture 16 Protein Function Prediction 16 Oct 1
  • Thanks to Drena Dobbs for many borrowed
    modified PPTs

2
Required Reading (before lecture)
  • Wed Oct 1 for Lecture 16
  • Chp 7
  • Fri Oct 3 for Lecture 17
  • Chp 8

3
Homework Assignments
  • HW3 on HMMs is posted on the course webpage under
    homework
  • Due Oct 6th by 5pm
  • 544 Students (and any in 444 who want to do a
    project)
  • Extra HW1 posted
  • Due Friday Oct 3rd by 11AM

4
Chp 7 - Protein Motifs Domain Prediction
  • SECTION II SEQUENCE ALIGNMENT
  • Xiong Chp 7
  • Protein Motifs and Domain Prediction
  • Motif Discovery in Unaligned Sequences
  • Identification of Motifs Domains in MSAs
  • Motif Domain Databases Using Regular
    Expressions
  • Motif Domain Databases Using Statistical Models
  • Protein Family Databases
  • vSequence Logos

5
Protein Function
  • Proteins are primary molecules responsible
  • for carrying out cellular functions
  • Most enzymes that catalyze chemical reactions are
    proteins (but some are RNAs!)
  • Proteins have complex structures that are
    critical for their functions

Protein structure for dystrophin encoded by the
largest known gene in humans
6
Protein Structure 4 levels of organization
7
Key Aspects of Protein Function Localization
Interactions
Protein localization - function depends on
proteins being in right place at right time!
Protein interactions - function depends on
proteins interacting with correct partners inside
cells!
8
Protein Sequence-Structure-Function
  • Amino acid sequence determines protein structure
  • But some proteins need help folding
    ("chaperones") in vivo
  • Proteins fold to a single "native" structure
    (under a specific set of conditions)
  • Protein structure determines function
  • But level, timing location of expression are
    important
  • Interactions with other proteins, DNA, RNA,
    small ligands are also very important!!
  • PROBLEMS
  • We don't know the "folding code" that determines
    how proteins fold!
  • We don't know the "recognition code" that
    determines how proteins find and bind their
    correct partners!

9
Motifs Domains
  • Motif - short conserved sequence pattern
  • Associated with distinct function in protein or
    DNA
  • Avg 10 residues (usually 6-20 residues)
  • e.g., zinc finger motif - in protein
  • e.g., TATA box - in DNA
  • Domain - "longer" conserved sequence pattern,
    defined as a independent functional and/or
    structural unit
  • Avg 100 residues (range from 40-700 in
    proteins)
  • e.g., kinase domain or transmembrane domain - in
    protein
  • Domains may (or may not) include motifs

10
2 Approaches for Representing "Consensus"
Information in Motifs Domains
  • Regular expression
  • Statistical model

11
Regular Expressions
  • Reduce information from MSA
  • e.g., protein phosphorylation site motif
    S,T- X- R,K
  • Symbols represent specific or unspecified
    residues, spaces, etc.
  • 2 mechanisms for matching
  • Exact
  • "Fuzzy" (inexact, approximate) - flexible, more
    permissive to detect "near matches"

12
Statistical Models
  • Includes probability information derived from MSA
  • PSSM
  • Profile
  • HMM

13
Motif Discovery in Unaligned Sequences
  • Expectation Maximization
  • Gibbs Sampling

14
Expectation Maximization
  • Generate "random" alignment of all sequences,
    derive PSSM, iteratively match individual
    sequences to PSSM to edit improve it
  • Problems? Can hit a local optimum (premature
    convergence)
  • Sensitive to initial alignment
  • MEME - Multiple EM for Motif Elicitation -
    modified EM, avoids local optimum issues

15
Gibbs Sampling
  • Generate "trial" PSSM from random alignment
    first, as in EM, but leave one sequence out of
    initial alignment, then iteratively match PSSM to
    left-out sequences
  • Gibbs Sampler - web-based motif search via Gibbs
    sampling

16
Practical Advice
17
(No Transcript)
18
Motif Domain Databases
  • Based on regular expressions
  • Prosite (Interpro)
  • Emotif
  • Limitation these don't take probability info
    into account
  • Based on statistical models
  • PRINTS
  • BLOCKS
  • ProDom
  • Pfam
  • SMART
  • CDART
  • Reverse PsiBLAST
  • READ your textbook try some of these at home
    there are distinct advantages/disadvantages
    associated with each
  • TAKE HOME LESSON
  • Always try several methods!
  • (not just one!)

19
Prosite
  • Database of protein families and domains
  • Attempts to derive a signature for each protein
    family or domain that distinguishes the protein
    from all others
  • Currently contains over 1000 signatures
  • Uses both patterns (regular expressions) and
    profiles

20
Prosite Patterns
  • Start with a MSA of related sequences pay
    special attention to things like active sites,
    residues that bind a ligand
  • Try to find a conserved 4-5 amino acid sequence
    that includes these functional sites
  • Scan Swiss-Prot DB with this sequence
  • If Swiss-Prot hits are specific for the family,
    stop
  • If not specific, lengthen the pattern, and search
    again

21
Prosite Profiles
  • Prosite Patterns attempt to find a very short
    signature
  • Prosite Profiles attempt to characterize the
    family or domain over the entire length
  • Issue profile describes both conserved and
    non-conserved regions so it is possible to match
    the profile but not be a member of the family
  • Prosite tests its profiles to try to avoid this
    problem

22
Emotif
  • More than 170,000 highly specific and sensitive
    protein sequence motifs
  • Uses regular expressions
  • Based on MSA of related proteins
  • Enumerates all possible motifs, not just one per
    family or domain
  • Sensitivity is defined by how many of the
    sequences in the MSA fit the motif
  • Specificity is based on how often you would
    expect to see the motif in randomly generated
    sequences

23
PRINTS
  • Collection of fingerprints a set of motifs
    that can predict the occurrence of other motifs
  • Uses a MSA with only a small number of sequences
    and attempts to define fingerprints
  • These fingerprints are used to scan their
    database for more sequences containing the
    fingerprint
  • Rebuild the fingerprints, and iterate until the
    fingerprint converges

24
Protein Family Databases
  • In addition to databases of "related" protein
    sequences, based on shared motifs or domains
    (Pfam, BLOCKS, CDART), some databases "cluster"
    sequences into families based on near full-length
    sequence comparisons
  • COGs - Clusters of Orthologous Groups (at NCBI)
  • Mostly Prokaryotic sequences
  • KOG newer Eukaryotic version
  • COGnitor - software to search database
  • ProtoNet - also clusters of homologous protein
    sequences
  • Advantages tree-like hierarchical structure
  • Provide GO (gene ontology) annotations
  • Provides InterPro keywords

25
Pfam
  • Contains over 10,000 protein families
  • Pfam-A high quality, manually curated
  • Pfam-B automatically generated, comprehensive
    collection
  • Uses 2 profile HMMs to describe each protein
    family
  • Why 2?
  • One for global matches, one for local matches

26
CDART
  • Conserved Domain Architecture Retrieval Tool
  • Given a query sequence
  • Identifies functional domains
  • Lists proteins with a similar domain architecture
    sequential order of conserved domains
  • Can detect remote homologs domains are
    conserved, but sequence of each domain may vary
    quite a bit

27
PIRSF
  • Protein Information Resource SuperFamily
    classification system
  • Network classification system based on
    evolutionary relationships of whole proteins
  • Advantages
  • Allows for both generic and specific functions to
    be assigned
  • Allows for classification of proteins without
    well-defined domains

28
PIRSF
  • Multiple levels of classification
  • Superfamily
  • one or more common domains
  • Homoemorphic family
  • proteins that are homologous full-length
    sequence similarity AND common domain
    architecture
  • Homeomorphic subfamily
  • functional specialization

29
What we didnt talk about
  • Machine learning!
  • Given a protein sequence, predict the function
  • Many, many methods exist, no clear favorite
  • Most are specific for some type of function,
    e.g., kinase
  • In lab Jafa metaserver sends your sequence to
    5 different prediction servers and combines the
    results into a single prediction

30
What we didnt talk about
  • Protein function prediction based on protein
    structure
  • Coming later
Write a Comment
User Comments (0)
About PowerShow.com