CSE182-L4: Scoring matrices, Dictionary Matching - PowerPoint PPT Presentation

About This Presentation
Title:

CSE182-L4: Scoring matrices, Dictionary Matching

Description:

You can subscribe ... Sometimes, homology (evolutionary similarity) exists at very low levels of ... types, which is variously known as a pattern, motif, ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 61
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE182-L4: Scoring matrices, Dictionary Matching


1
CSE182-L4 Scoring matrices, Dictionary Matching
2
Class Mailing List
  • fa05_182_at_cs.ucsd.edu
  • To subscribe, send email to
  • fa05_182-subscribe_at_cs.ucsd.edu
  • You can subscribe from the course web page
  • Use the list for all course related queries,
    discussions,

3
Protein Sequence Analysis
  • What can you do if BLAST does not return a hit?
  • Sometimes, homology (evolutionary similarity)
    exists at very low levels of sequence similarity.
  • A Accept hits at higher P-value.
  • This increases the probability that the sequence
    similarity is a chance event.
  • How can we get around this paradox?
  • Reformulated Q suppose two sequences B,C have
    the same level of sequence similarity to sequence
    A. If A B are related in function, can we assume
    that A C are? If not, how can we distinguish?

4
Protein sequence motifs
  • Premise
  • The sequence of a protein sequence gives clues
    about its structure and function.
  • Not all residues are equally important in
    determining function.
  • How can we identify these key residues?

5
Prosite
  • In some cases the sequence of an unknown protein
    is too distantly related to any protein of known
    structure to detect its resemblance by overall
    sequence alignment. However, relationships can be
    revealed by the occurrence in its sequence of a
    particular cluster of residue types, which is
    variously known as a pattern, motif, signature or
    fingerprint. These motifs arise because specific
    region(s) of a protein which may be important,
    for example, for their binding properties or for
    their enzymatic activity are conserved in both
    structure and sequence. These structural
    requirements impose very tight constraints on the
    evolution of this small but important portion(s)
    of a protein sequence. The use of protein
    sequence patterns or profiles to determine the
    function of proteins is becoming very rapidly one
    of the essential tools of sequence analysis. Many
    authors ( 3,4) have recognized this reality.
    Based on these observations, we decided in 1988,
    to actively pursue the development of a database
    of regular expression-like patterns, which would
    be used to search against sequences of unknown
    function.
  • Kay Hofmann ,Philipp Bucher, Laurent Falquet and
    Amos Bairoch
  • The PROSITE database, its status in 1999

6
Basic idea
  • It is a heuristic approach. Start with the
    following
  • A collection of sequences with the same function.
  • Region/residues known to be significant for
    maintaining structure and function.
  • Develop a pattern of conserved residues around
    the residues of interest
  • Iterate for appropriate sensitivity and
    specificity

7
Zinc Finger domain
8
Proteins containing zf domains
How can we find a motif corresponding to a zf
domain
9
From alignment to regular expressions
ALRDFATHDDF SMTAEATHDSI
ECDQAATHEAS
ATH-DE
  • Search Swissprot with the resulting pattern
  • Refine pattern to eliminate false positives
  • Iterate

10
The sequence analysis perspective
  • Zinc Finger motif
  • C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
  • 2 conserved C, and 2 conserved H
  • How can we search a database using these motifs?
  • The motif is described using a regular
    expression. What is a regular expression?
  • How can we search for a match to a regular
    expression? Not allowed to use Perl -)
  • The regular expression motif is weak. How can
    we make it stronger

11
Profiles
  • Start with an alignment of strings of length m,
    over an alphabet A,
  • Build an A X m matrix F(fki)
  • Each entry fki represents the frequency of symbol
    k in position i

0.71
0.71
0.28
0.14
12
Scoring Profiles
Scoring Matrix
i
k
fki
s
13
Psi-BLAST idea
  • Multiple alignments are important for capturing
    remote homology.
  • Profile based scores are a natural way to handle
    this.
  • Q What if the query is a single sequence.
  • A Iterate
  • Find homologs using Blast on query
  • Discard very similar homologs
  • Align, make a profile, search with profile.

14
Psi-BLAST speed
  • Two time consuming steps.
  • Multiple alignment of homologs
  • Searching with Profiles.
  • Does the keyword search idea work?
  • Pigeonhole principle again
  • If profile of length m must score gt T
  • Then, a sub-profile of length l must score gt
    lT/m
  • Generate all l-mers that score at least lT/M
  • Search using an automaton
  • Multiple alignment
  • Use ungapped multiple alignments only

15
(No Transcript)
16
CSE182-L6
  • Regular Expression Matching
  • Protein structure basics

17
Zinc Finger domain
18
The sequence analysis perspective
  • Zinc Finger motif
  • C-x(2,4)-C-x(3)-LIVMFYWC-x(8)-H-x(3,5)-H
  • 2 conserved C, and 2 conserved H
  • How can we search a database using these motifs?
  • The motif is described using a regular
    expression. What is a regular expression?

19
Regular Expressions
  • Concise representation of a set of strings over
    alphabet ?.
  • Described by a string over
  • R is a r.e. if and only if

20
Regular Expression
  • Q Let ?A,C,E
  • Is (AC)EEC a regular expression?
  • (AC)?
  • AC..E?
  • Q When is a string s in a regular expression?
  • R (AC)EEC
  • Is CEEC in R?
  • AEC?
  • ACEE?

21
Regular Expression Automata
  • Every R.E can be expressed by an automaton (a
    directed graph) with the following properties
  • The automaton has a start and end node
  • Each edge is labeled with a symbol from ?, or ?
  • Suppose R is described by automaton A
  • S ? R if and only if there is a path from start
    to end in A, labeled with s.

22
Examples Regular Expression Automata
  • (AC)EEC

C
A
E
E
start
end
C
23
Constructing automata from R.E
?
  • R ?
  • R ?, ? ? ?
  • R R1 R2
  • R R1 R2
  • R R1

?
?
?
?
?
24
Regular Expression Matching
  • Given a database D, and a regular expression R,
    is a substring of D in R?
  • Is there a string Dl..c that is accepted by the
    automaton of R?
  • Simpler Q Is D1..c accepted by the automaton
    of R?

25
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA

?
D1
D2
Dc
26
Alg. For matching R.E.
  • If D1..c is accepted by the automaton RA
  • There is a path labeled D1Dc that goes from
    START to END in RA
  • There is a path labeled D1..Dc-1 from START
    to node u, and a path labeled Dc from u to the
    END

u
D1 .. Dc-1
Dc
27
D.P. to match regular expression
u
?
v
  • Define
  • Au,? Automaton node reached from u after
    reading ?
  • Eps(u) set of all nodes reachable from node u
    using epsilon transitions.
  • Nc subset of nodes reachable from START node
    after reading D1..c
  • Q when is v ? Nc

?
u
Eps(u)
28
D.P. to match regular expression
  • Q when is v ? Nc?
  • A If for some u ? Nc-1, w Au,Dc,
  • v ? w Eps(w)

29
Algorithm
30
The final step
  • We have answered the question
  • Is D1..c accepted by R?
  • Yes, if END ? Nc
  • We need to answer
  • Is Dl..c (for some l, and some c) accepted by R

31
A structural view of proteins
32
CS view of a protein
  • gtspP00974BPT1_BOVIN Pancreatic trypsin
    inhibitor precursor (Basic protease inhibitor)
    (BPI) (BPTI) (Aprotinin) - Bos taurus (Bovine).
  • MKMSRLCLSVALLVLLGTLAASTPGCDTSNQAKAQRPDFCLEPPYTGPCK
    ARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGAIGPWENL

33
Protein structure basics
34
Side chains determine amino-acid type
  • The residues may have different properties.
  • Aspartic acid (D), and Glutamic Acid (E) are
    acidic residues

35
Bond angles form structural constraints
36
Various constraints determine 3d structure
  • Constraints
  • Structural constraints due to physiochemical
    properties
  • Constraints due to bond angles
  • H-bond formation
  • Surprisingly, a few conformations are seen over
    and over again.

37
Alpha-helix
  • 3.6 residues per turn
  • H-bonds between 1st and 4th residue stabilize the
    structure.
  • First discovered by Linus Pauling

38
Beta-sheet
  • Each strand by itself has 2 residues per turn,
    and is not stable.
  • Adjacent strands hydrogen-bond to form stable
    beta-sheets, parallel or anti-parallel.
  • Beta sheets have long range interactions that
    stabilize the structure, while alpha-helices have
    local interactions.

39
Domains
  • The basic structures (helix, strand, loop)
    combine to form complex 3D structures.
  • Certain combinations are popular. Many sequences,
    but only a few folds

40
3D structure
  • Predicting tertiary structure is an important
    problem in Bioinformatics.
  • Premise Clues to structure can be found in the
    sequence.
  • While de novo tertiary structure prediction is
    hard, there are many intermediate, and tractable
    goals.

41
Protein Domains
  • An important realization (in the last decade) is
    that proteins have a modular architecture of
    domains/folds.
  • Example The zinc finger domain is a DNA-binding
    domain.
  • What is a domain?
  • Part of a sequence that can fold independently,
    and is present in other sequences as well

42
Proteins containing zf domains
How can we find a motif corresponding to a zf
domain
43
Domain review
  • What is a domain?
  • How are domains expressed
  • Motifs (Regular expression others)
  • Multiple alignments
  • Profiles
  • Profile HMMs

44
Databases of protein domains
45
http//pfam.wustl.edu/ Also at Sanger
46
PROSITE
http//us.expasy.org/prosite/
47
(No Transcript)
48
(No Transcript)
49
http//hmmer.wustl.edu
50
HMMER programs
  • Hmmalign
  • Align a sequence to an HMM
  • Hmmbuild
  • Build a model from a multiple alignment
  • Hmmemit
  • Emits a probabilistic sequence from an HMM
  • Hmmpfam
  • Search PFAM with a sequence query
  • Hmmsearch
  • Search a sequence database with an HMM query

51
(No Transcript)
52
Post-translational modification
  • Residues undergo modification, usually by
    addition of a chemical group.
  • Key mechanism for signal transduction, and many
    other cellular functions
  • Some modifications might require single residues
    (Ex phosphorylation). Others might require a
    pattern

53
(No Transcript)
54
Protein targeting
55
Protein targeting
  • In 1970, Gunter Blobel showed that proteins have
    an N-terminal signal sequence which directs
    proteins to the membrane.
  • Proteins have to be transported to other
    organelles nucleus, mitochondria,
  • Can we computationally identify the signal
    which distinguishes the cellular compartment?

56
  • For transmembrane proteins, can we predict the
    transmembrane, outer, and inner regions?

57
(No Transcript)
58
Multiple alignment tools
59
Tools for secondary structure prediction
  • Each residue must be
  • given a state
  • Helix, Loop, Strand
  • HMMs/Neural
  • networks are used to
  • predict

60
Next topic Gene finding
Write a Comment
User Comments (0)
About PowerShow.com