Functional Analysis of Proteins and Proteomes - PowerPoint PPT Presentation

1 / 160
About This Presentation
Title:

Functional Analysis of Proteins and Proteomes

Description:

Functional Analysis of Proteins and Proteomes – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 161
Provided by: douglasl7
Category:

less

Transcript and Presenter's Notes

Title: Functional Analysis of Proteins and Proteomes


1
Functional Analysis of Proteins and Proteomes
  • CSB2003 Tutorial
  • Steve Bennett, Ph.D.
  • steve_at_bennett.org

2
Introduction
  • Although genetic material contains all the
    information required for cellular function, DNA
    itself does not carry out much the work in cells.
  • Rather, it is the products of those genes,
    proteins and sometimes regulatory and catalytic
    RNAs, that carry out the chemical and mechanical
    work in biological systems.

3
Central Dogma in Biology
  • DNA
  • RNA (sequence, structure)
  • Protein (sequence, structure)

4
Introduction
  • With the completion of numerous genome projects,
    resources and focus are shifting from genomes to
    proteomes.
  • Once researchers have an accurate collection of
    gene sequences, the next question is what these
    genes do.

5
Introduction
  • Although there are numerous definitions of
    function with respect to proteins, here we
    define it as precisely that what it is that a
    particular gene product, or protein, does in the
    cell.
  • Examples
  • Molecular motor (kinesin, myosin)
  • Zinc-finger transcription factor

6
Introduction
  • In this tutorial, I will first give a general
    protein introduction, followed by historical and
    current computational approaches for assigning
    function to a protein.
  • Background
  • Overview of some selected algorithms and
    approaches
  • Demos and software examples

7
Forming a Peptide Bond
Creates the Primary Structure, or protein sequence
8
Polypeptide Chain
The chemical nature of the R groups determine
the amino acid sequence of the peptide
9
Translation
10
Planar Peptide Bond
11
Alpha Helix Structure
12
Alpha Helix End-On
13
Alpha Helix Variable Pitch
14
Anti-Parallel Beta Sheets
15
Corrugated Beta Sheets
16
Beta Turn at End of Anti-Parallel Sheet
17
Protein Database (PDB) Growth
  • 19,225 released atomic coordinate entries
  • 17,315 proteins, peptides, and viruses
  • 1,892 nucleic acids, protein-nucleic acid
    complexes

18
Structural Challenges
  • Compare all known structures to each other
  • Classify and organize all structures in a
    biological way
  • Find common folding patterns and structural
    motifs
  • Compute evolutionary distances between protein
    structures
  • Study interactions between structures and other
    molecules (Protein Docking)
  • Use known structures to predict structure from
    sequence (Protein Threading)
  • Many more ...

19
Classification of Protein Structures
  • Class
  • Similar secondary structure content
  • All a all b ab a/b etc
  • Fold (Architecture)
  • Major structural similarity
  • SSEs in similar arrangement
  • globin-like fold, TIM barrel fold
  • Superfamily (Topology)
  • Probable common ancestry
  • globins phycocyanin
  • Family
  • Clear evolutionary relationship
  • Sequence similarity usually gt 25

20
Class
Fold / Architecture
Superfamily
21
Classes of Protein Structures
  • Mainly ?
  • Mainly ?
  • ????
  • Parallel ? sheets, ?-?-? units
  • ???
  • Anti-parallel ? sheets, segregated ? and ?
    regions
  • helices mostly on one side of sheet

22
Classes of Protein Structures
  • Others
  • Multi-domain, membrane and cell surface, small
    proteins, peptides and fragments, designed
    proteins

23
Folds / Architectures
  • ??? and ???
  • Closed
  • Barrel
  • Roll, ...
  • Open
  • Sandwich
  • Clam, ...
  • Mainly ?
  • Bundle
  • Non-Bundle
  • Mainly ?
  • Single sheet
  • Roll
  • Barrel
  • Clam
  • Sandwich
  • Prism
  • 4/6/7/8 Propeller
  • Solenoid

24
eg. The TIM Barrel Fold
25
Growth in PDB Folds
Gold Old Folds White New Folds
26
Databases of Folds
  • SCOP
  • Murzin AG, Brenner SE, Hubbard T, Chothia C
  • Structural Classification of Protein Structures
  • Manual assembly by inspection
  • All nodes are annotated (eg. All-alpha,
    alpha/beta)
  • Structural similarity search using 3dSearch
    (Singh and Brutlag)
  • CATH
  • Dr. C.A. Orengo, Dr. A.D. Michie, Dr. S. Jones,
    Dr. M.B. Swindells, Dr. G. Hutchinson, Dr. A.
    Martin, Dr. D.T. Jones, Prof. J.M. Thornton
  • Class - Architecture - Topology - Homologous
    Superfamily
  • Manual classification at Architecture level
  • Automated topology classification using the SSAP
    algorithm No structural similarity search

27
Databases of Folds
  • FSSP
  • L. L. Holm and C. Sander
  • Fully automated using the DALI algorithm (Holm
    and Sander)
  • No internal node annotations
  • Structural similarity search using DALI
  • Pclass
  • A. Singh, X. Liu, J. Chang, D. Brutlag
  • Fully automated using the LOCK and 3dSearch
    algorithms
  • All internal nodes automatically annotated with
    common terms
  • JAVA based classification browser
  • Structural similarity search using 3dSearch

28
Protein Structure Prediction
Sequence of 984 amino acids
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
G PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAG
LKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNV
LPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQ
HRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVL
PEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPL
TEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE
PFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPI
QKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYV
DGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGL
EVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHK
GIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKA
LVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNK
RTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFT
IPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVI
YQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLW
MGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQ
LCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIA
EIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKIT
TESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKL
WYQ
HIV reverse transcriptase
3D coordinates of 7404 atoms
29
Abstracting the problem
3D coords of C-alpha backbone
3D coords of all atoms
3D coords of secondary structure elements
C-alpha groups
30
Defining the secondary structure of a protein
sequence
Alpha helix and anti-parallel beta sheet
31
The Secondary Structure Prediction Problem
  • Given a protein sequence
  • NWVLSTAADMQGVVTDGMASGLDKD...
  • Predict a secondary structure sequence
  • LLEEEELLLLHHHHHHHHHHLHHHL...
  • 3-state problem ARNDCQEGHILKMFPSTWYVn -gt
    L,H,En

32
Amphipathic helix End view
33
Amphipathic helix backbone sidechains
34
Amphipathic helixhydrophobic sidechains
35
Amphipathic helixhydrophobic sidechains
36
Amphipathic helixsidechain periodicity
Sequence NLAKMVVKTAEAILKD
37
Structural Correlations in Alpha-Helices
38
Structural Correlations inBeta-Strands
39
Functional Analysis of Proteins
40
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

Seqs of known function
?
41
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

42
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

43
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

44
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

45
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

46
Sequence methods
  • The earliest general approach for assigning
    function to a protein sequence was to compare the
    sequence of unknown function to a sequence (or
    sequences) of known function.

47
Sequence methods
  • One such method for doing this is sequence
    alignment in which two sequences are aligned to
    determine how similar they are to one another.

48
Sequence methods
  • If an alignment is of sufficient quality, one
    might assign the function of the known sequence
    to be that of the unknown sequence as well.

Scorealignment gt threshold
Function Zn finger transcription factor
function assignment
49
Sequence Alignment
  • Well briefly discuss two different alignment
    approaches, pairwise sequence alignment and
    multiple sequence alignment before moving on to
    other topics.
  • Alignments are most often in one of two forms
    local or global.

50
Amino Acid Similarity
  • To discuss alignment methods, we first need to
    discuss methods for determining if characters in
    different sequences are similar.
  • Identity
  • Biochemical properties
  • PAM, BLOSUM matrices

51
PAM Matrices
  • Percent Accepted Mutation Matrices (Dayhoff)
  • Examine amino acid changes in groups of related
    proteins with at least 85 sequence similarity.
  • The differing amino acids are assumed to be
    accepted over evolutionary time.
  • Counts are normalized and used to estimate a
    matrix representing all possible amino acid
    changes.

52
PAM Matrices
53
BLOSUM Matrices
54
Dot Matrices
  • Dot matrices create an n x m matrix from the
    two sequences to be compared.
  • A match is scored in the matrix by strict
    character identity, chemical similarity of the
    amino acids, or the use of a symbol comparison
    matrix such as PAM or BLOSUM.
  • Mark the matrix location of each match with a
    dot. Connected regions of similarity will
    appear as diagonal lines.

55
Dot Matrices
  • A D S C T F G V V L I
  • A
  • E º
  • S
  • C
  • V
  • V
  • L
  • V º

56
Dot Matrices
  • Drosophila SLIT

57
Dot Matrices SLIT vs. itself
58
Dot Matrices

59
Dot Matrices
  • Improving signal-to-noise in dot matrices
  • Sliding Window
  • Scoring a match at position i, j in the matrix
    is not independent downstream positions are
    considered as well. This helps screen out
    spurious matches in favor of meaningful local
    regions of similarity.
  • Variable Stringency
  • Used with the sliding window method denotes
    how many characters in the window must match for
    a hit to be declared at position i, j.

60
Dot Matrices
  • Advantages
  • Intuitive and Straightforward
  • Immediate visualization of similar subsequences
  • Limitations
  • Although related subsequences are easily seen, it
    is unclear what the best alignment is between
    the.
  • Difficult to assess the quality of different
    alignments no scoring system

61
Dynamic Programming
  • Considerable improvement to the basic dot matrix
    approach DP generates a provably-optimal
    alignment between a pair of sequences.
  • Produces a score which can be evaluated for
    statistical significance given the aligned
    sequences and conditions.
  • Allows for the inclusion of gaps without the
    extremely large number of computations required
    in any direct computation.
  • Can be used for both local and global alignments.

62
Dynamic Programming
  • As observed for dot matrices, DP uses a scoring
    system that favors identical and similar amino
    acids, and penalizes dissimilar amino acids and
    gaps.
  • Values for the scoring system are usually derived
    from amino acid substitution tables such as PAM
    or BLOSUM matrices. Each position in a potential
    alignment is evaluated according to these
    substitution tables.
  • The scores for all positions are then summed to
    generate an overall log-odds score for the
    alignment.

63
Dynamic Programming
  • Similar to the dot matrix algorithm, we construct
    an n x m matrix consisting of the 2 sequences to
    be aligned and a gap row, allowing each
    sequence to begin with a gap if necessary.
  • Instead of marking a dot, we calculate a
    running best score that depends on the scores of
    the cells calculated previously. The matrix is
    built left to right, to bottom.

64
Dynamic Programming
  • Specifically, given two sequences, p and q
  • p p1p2pipn
  • q q1q2qjqm
  • then the score at each position i in sequence p
    and position j in sequence q (that is, the score
    Sij in each matrix cell) is given by

65
Example VDFS and VET
66
Example VDFS and VET
67
Example VDFS and VET
68
Example VDFS and VET
69
Example VDFS and VET
70
Example VDFS and VET
71
Example VDFS and VET
72
Example VDFS and VET
  • VDFS
  • V E -T

73
Dynamic Programming
  • Smith-Waterman more widely used implementation
    for local alignments.
  • Software packages
  • BESTFIT (Smith-Waterman)
  • GAP (Needleman-Wunsch)
  • On the web http//motif.stanford.edu/alion/

74
Dynamic Programming
  • Advantages
  • Alignments are optimal
  • Quantitative score associated with the alignments
  • Limitations
  • Costly in time and space hardware required for
    database-sized searches (increasingly important
    for modern bioinformatics applications).

75
Rapid Database Searching
  • Goal 1 Execute with less demands in time and
    space than dynamic programming.
  • Goal 2 Perform reasonably well using
    heurisitics, as compared to the DP optimal
    solution.
  • Drawback Resulting sequence alignments are not
    guaranteed to be optimal.

76
Rapid Database Searching FASTA
  • Dynamic Programming approaches match single
    characters at a time
  • FASTA matches groups of characters, called words
    or k-tuples, which are managed in a table.

ADCGPH
ADCGPH
ADCGPH
ADCGPH
77
Rapid Database Searching FASTA
db1
db2
Assume k 3 8000 possible 3-character words.
db3
Scan each database sequence, recording the
position of each 3-tuple in a lookup table of
size 8000 keyed on the 3-tuple
78
Rapid Database Searching FASTA
db1
AAC
db2
Assume k 3 8000 possible 3-character words.
AAC
db3
AAC
  • Assume that the 3-tuple AAC occurs
  • at position 12 in db1
  • at position 52 in db2
  • at position 20 in db3

After scanning the database sequences and
building the lookup table, the table element
corresponding to AAC would look like
AAC ? db112 , db252 , db320
79
Rapid Database Searching FASTA
db1
AAC
db2
AAC
db3
AAC
AAC ? db112 , db252 ,
db320
Suppose a query sequence, q, has the 3-tuple AAC
at position 60. The table returns the 3 database
sequences and the locations where the matching
tuple occurs in those sequences. Assuming this is
done for another 3-tuple, DFE, we might
have 3-tuple db1 q AAC
12 60 DFE 56 104
80
Rapid Database Searching FASTA
AAC ? db112 , db252 ,
db320
3-tuple db1 q AAC
12 60 DFE 56 104 Next,
FASTA compute the offsets between the locations
of matched tuples. Here, we see that for AAC and
DFE, the offest is identical, equal to 48. This
indicates that these 3-tuples are in-phase or
part of a larger locally-aligned region. FASTA
then rescores these local alignments using a
PAM250 matrix, and takes the 10 highest regions
of identity and performs a joining step in an
attempt to join the regions.
81
Rapid Database Searching BLAST
  • BLAST is a much faster algorithm than FASTA, and
    has been shown to be just as sensitive.
  • As such, BLAST is considerably more widely used.
  • Similar to FASTA in that it uses words, but the
    size is fixed at k 3.

82
Rapid Database Searching BLAST
  • BLAST first extracts all overlapping 3-tuples
    from a query sequence.
  • Then, the tuples in the query are evaluated
    against the possible 8000 tuples using a BLOSUM
    matrix. This determines if inexact matches
    between query words and potential database words
    are above a certain threshold score. Those tuples
    remaining are assembled into a tree for rapid
    database search.

ADCGPH
ADC
DCG
CGP
GPH
83
Rapid Database Searching BLAST
  • Suppose we observed the tuple SEI in the query
    sequence.
  • Step 1. Score against all 8000 tuples, keeping
    only those that are above our predetermined
    scoring threshold.
  • SEI scored against SEI gives a score of 13 (S-S
    E-E I-I) in the BLOSUM matrix)
  • SEI scored against SDI gives a score of 10
  • SEI scored against SDG gives a score of only 2.
  • Hence, if our cutoff score were 9, we would keep
    SEI and SDI, but not SDG when assembling the
    search tree.

84
Rapid Database Searching BLAST
  • Step 2. Once the possible matching tuples are
    stored in the tree, database sequences are
    searched for exact matches to these possible
    scores.
  • Matches are examined for regions that are on the
    same diagonal and within some distance, A of one
    another. These regions serve as starting points
    for a longer ungapped alignment between the
    words. These joined regions are then extended in
    each direction as long as the score is
    increasing.
  • PSI-BLAST Iterative BLAST approach that includes
    conservation information within a family of
    proteins as opposed to just between two proteins.

85
Multiple Sequence Alignments
  • So far, we have only discussed pairwise
    alignments since they are the most commonly used,
    have optimal solutions.
  • Multiple sequence alignments are vitally
    important to understanding true evolutionary
    conservation between sequences in a family.

86
Multiple Sequence Alignments
  • Allows for the extraction of probes for new
    members of a family (motifs / patterns).
  • Helps identify the functionally important amino
    acids in a protein family. Amino acids not
    required for function or structural integrity
    will in general not be highly conserved within a
    family
  • VTDIAYRCGFSDSNHFSTLFRREFNWSPRDI
  • VTEIAYRCGFGDSNHFSTLFRREFNWSPRDI
  • VFQISHRCGFGSNAYFCDVFKRKYNMTPSQF
  • VFQISHRCGFGSNAYFCDAFKRKYGMTPSQF

87
Representations for Similarity and Alignments
  • Short, simple representations for conserved
    sequence information in MSAs make assigning new
    proteins to the family considerably easier.
  • Short representations can suggest functional and
    biological conclusions regarding why certain
    amino acids are conserved at certain positions.
  • Such representations can identify function in a
    protein that more global homology methods (such
    as BLAST) might miss.

88
Profiles
  • A highly conserved local region in an MSA is
    identified, then a profile (a type of PSSM) is
    constructed to describe it.
  • 20 x n matrix with each column describing the
    scores, or probabilities of different amino acids
    appearing at a given position.

89
PROSITE patterns
  • Patterns (motifs, signatures, fingerprints) are
    short regular expression-like text strings that
    describe a conserved region.
  • PROSITE is a manually curated database of
    profiles and patterns.
  • Focuses on particular regions of family MSAs
    shown in the literature to be biologically
    important (usually catalytic sites, metal-binding
    sites, reduced cysteines, or ligand binding
    sites).

90
PROSITE patterns
  • Short conserved sequences from the MSA are then
    extracted as a core and used to search
    SWISS-PROT. If no additional sequences are found,
    the core is designated as the actual signature.
    If numerous false positives are picked up, then
    the core is increased in size until good
    discrimination is achieved, or until it is clear
    that good discrimination wont be possible.
  • C-x(15)-A-x(3,4)-G-x(3)-C-x(2)-G-x(8,9)-P-x(7)-
    C

91
Blocks
  • Blocks are short, ungapped conserved regions in
    multiple sequence alignments.

92
Blocks
  • They are created from one of two starting with
    either
  • Unaligned sequences from PROSITE families
  • An existing MSA.
  • Since PROSITEs manual curation limits is size,
    the BLOCKS database currently includes families
    from PRINTS-S and InterPro in addition to PROSITE.

93
BLOCKS Contains Many Protein Families(Henikoff
Henikoff, 1999)
94
Properties of eMOTIFshttp//emotif.stanford.edu/
  • Discrete motifs that represent specific functions
  • Highly specific motifs for searching entire
    proteomes
  • Maintain sensitivity with multiple motifs
  • Generate motifs automatically from protein
    alignments
  • Resistant to sequence errors, misalignment
    misclassification
  • Robust with respect to protein subclasses
  • Generates structural motifs potential drug
    targets
  • Biological generalization from known examples

95
eMOTIFshttp//emotif.stanford.edu/
fly..h...hst..krpfy.c
96
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
97
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
98
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEAREDLAALERDYEEV S K I A
99
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
100
Amino Acid Substitution Groups Based on Physical
Properties
  • Only permit groups of amino acids
  • sharing some chemical or physical property

Group


AG

ST

PAGST

QN

QNED

KR

VLI

VLIM

FYW

KRH

DE



101
Allowable Amino AcidSubstitution Groups
fly..h...hst..krpfy.c
102
(No Transcript)
103
Discovery of eMOTIFshttp//emotif.stanford.edu/
104
Discovery of eMOTIFshttp//emotif.stanford.edu/
105
  • Each red dot is an eMOTIF
  • Most specific eMOTIFs along pareto-optimal curve
  • High Sensitivity gt Low Specificity
  • High Specificity gt Low Sensitivity

106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
Protein Function with eMOTIF Searchhttp//emotif.
stanford.edu/
110
Protein Function with eMOTIF-Searchhttp//emotif.
stanford.edu/
111
3MOTIFs 3MATRICEShttp//3motif.stanford.edu/
112
(No Transcript)
113

Searched for 3est
Visualization Features Conservation strength
shading Relative and overall solvent
accessibilities per residue, and for the eMOTIF
as a whole Accessibility shading Multiple display
and manipulation options
114
Visualization Features
3est - cgg.lilv...wvilmvstaahc
115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
(No Transcript)
122
(No Transcript)
123

3motif Pipeline Construction Query
124
(No Transcript)
125
eMotifs and SCOP
  • eMotifs were observed to correlate strongly with
    SCOP classification, even when global sequences
    were not overly similar.
  • eMotifs that were found to hit proteins in
    different SCOP locations were particularly
    interesting.

126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
eMATRIXPosition-Specific Scoring Matrices
132
An eMATRIXhttp//ematrix.stanford.edu/
133
eMATRIX Scanhttp//ematrix.stanford.edu/
134
eMATRIX Scan Resultshttp//ematrix.stanford.edu/e
matrix-scan/
135
eMATRIX Searchhttp//ematrix.stanford.edu/
136
eMATRIX Search Resultshttp//ematrix.stanford.edu
/
137
eMATRIX Makerhttp//ematrix.stanford.edu/
138
3MATRIXhttp//3matrix.stanford.edu/
139
(No Transcript)
140
ePROTEOMEA Functional Genomics
Databasehttp//eproteome.stanford.edu/
141
BLOCKS Is Based On SeveralProtein Family
Databases
142
eBLOCKs - Discovering Protein Motifshttp//eblock
s.stanford.edu/
Higher Specificity
A
B
C
Higher Sensitivity
143
Building eBLOCKs with PSI-BLAST
  • 1) Compare the query to database with BLAST
  • 2) Construct profile from significant
    similarities
  • 3) Compare the profile to database
  • 4) Repeat step 2 and 3 until convergence

144
Generating Multiple OverlappingeBLOCKs from
PSI-BLAST Results
G2B1
G2B2
1
2
G3B1
G3B2
G3B3
G1B1
G1B2
1 Clustering Grouping 2 Aligning Trimming
145
Clusters Are Organized Into Groupswith Varying
Specificity Sensitivity
Higher Specificity
A
B
C
Higher Sensitivity
146
eBLOCKs Summary
  • SWISS-PROT
  • 79,449 Sequences
  • Filtered Target Set
  • Homologous, putative, fragment, hypothetical,
    probable, possible
  • 57,266 Sequences
  • PSI-BLAST Searches
  • 17,415
  • Final Number Of Groups
  • 19,889
  • Final Number Of Blocks
  • 81,413

147
eBLOCKs are More Comprehensive
148
Properties of 52,671 Novel eBLOCKs
  • New eBLOCKs are the same width as BLOCKS blocks
  • Average new eBLOCK 34 positions, others 37
  • New eBLOCKs have fewer sequences than BLOCKS
    blocks
  • Average new eBLOCK has 18 sequences, BLOCKS 27
  • New eBLOCKs have similar information content
  • New eBLOCKs have 2.82 bits/position, BLOCKS 2.88
    bits
  • One half of new eBLOCKs (26,254) are in known
    families
  • One half of new eBLOCKs (26,471) are in 6,713 new
    families

149
Example of New eBLOCK in a Known Family
14-3-3 Family of Proteins
68 Sequences, 72 Sequences
BL00796A, P29358G1B1 BL00796B,
P29358G1B2 BL00796C, P29358G1B4
(28, 33)
150
Catalytic Site of ATP Synthase
P-SAP-LIV-DNH-x(3)-S-x-S
PROSITE PS00152
eBLOCKs P19483G1B2
BLOCKS BL00152F
151
Protein Functional AnalysisUsing BLOCKS or
eBLOCKs
Motifs Significant at an Expectation of 10-4
Red eBLOCKs Black BLOCKS
152
Two Human Protein Sets
  • Ensembl - (http//www.ensembl.org)
  • 29,304 proteins (Feb 2002) from the human genome
    project
  • Based on GenScan Models
  • Shorter, more fragmentary protein sequences
  • RefSeq (http//www.ncbi.nlm.nih.gov/)
  • 21,724 -- curated (XP)
  • 11,407 reviewed sequences (NP)
  • Based on full length cDNAs
  • Longer, more reliable protein sequences

153
eBLOCKs Assignments forRefSeq and ENSEMBL
Proteins
154
Web Access to eBLOCKshttp//eblocks.stanford.edu/
155
An Entry From eBLOCKs http//eblocks.stanford.edu
/
156
Another Entry from eBLOCKs http//eblocksanford.e
du/
157
A Sample Keyword Search http//eblocks.stanford.e
du/
158
Search A Sequence http//eblocks.stanford.edu/
159
Software demos
  • Dot Matrices
  • http//bioinf.ibun.unal.edu.co/java/dotlet/Dotlet.
    html
  • Software Smith-Waterman alignments
  • http//motif.stanford.edu/alion/
  • Hardware Smith-Waterman alignments
  • http//decypher.stanford.edu
  • eMotif
  • http//motif.stanford.edu/emotif/
  • eMatrix
  • http//motif.stanford.edu/ematrix/
  • 3motif / 3matrix
  • http//motif.stanford.edu/3motif/
  • http//motif.stanford.edu/3matrix/
  • eBlocks
  • http//eblocks.stanford.edu/
  • LOCK
  • http//dlb3.stanford.edu/lock/
  • SCOP / PDB
  • http//scop.berkeley.edu

160
Conclusion
  • Functional Analysis is more important than ever
    with the rate of growth of sequence databases.
  • Important for understanding of biology give
    researchers a head start on how to experimentally
    examine proteins.
  • Important in pharmaceuticals allows rapid
    discovery of targets.
Write a Comment
User Comments (0)
About PowerShow.com