Entropy, Information contents - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Entropy, Information contents

Description:

... Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics G=H-T S The entropy S of a system is the degree of disorder ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 23
Provided by: Thomas1448
Category:

less

Transcript and Presenter's Notes

Title: Entropy, Information contents


1
Entropy, Information contents Logo plotsBy
Thomas Nordahl Petersen
2
Biological information
Exon Intron Exon
  • Mutiple alignment of acceptor sites from 268
    yeast DNA sequences
  • What is the biological signal around the site ?
  • What are the important positions
  • How can it be visualized ?
  • Logo plot with Information Content

3
Entropy - Definition
  • Entropy of random variable is a measure of the
    uncertainty
  • In Thermodynamics ?GH-T?S
  • The entropy S of a system is the degree of
    disorder

4
Entropy - Definition
  • Entropy of a distribution of amino acids
  • The Shannon entropy
  • H(p) - ?a pa log2(pa), where p is an amino
  • acid distribution.
  • H(p) is measured in bits log2(2) 1, log2(4)2
  • Mutiple alignment of 3 sequences
  • Seq1 A L P K
  • Seq2 A V P R
  • Seq3 A I K R
  • High entropy - high disorder
  • Low entropy - low disorder

5
Entropy - example
  • H(p) - ?a pa log2(pa)
  • Mutiple alignment of 3 sequences
  • Seq1 A L R
  • Seq2 A V R
  • Seq3 A I K
  • Pos1 H(p) -1log2(1) 0
  • Pos2 H(p) -1/3log2(1/3) 1/3log2(1/3)
    1/3log2(1/3)
  • Pos3 H(p) -2/3log2(2/3) 1/3log2(1/3)

6
Relative EntropyThe Kullback-Leiber distance D
  • How different is an amino acid distribution pa
    compared to a background distribution qa - i.e.
    distance D between them.
  • D(pq) ?a pa log2(pa/qa)
  • Normally a background distribution of the amino
    acids is
  • obtained as frequencies from a large database
    like UniProt.
  • Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser
    (S) 6.87
  • Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr
    (T) 5.46
  • Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp
    (W) 1.16
  • Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr
    (Y) 3.07
  • Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val
    (V) 6.71

7
Information content
  • D(pq) ?a pa log2(pa/qa)
  • Often the Information content is used as a
    measure of the
  • degree of conservation.
  • I ?a pa log2(pa/qa)
  • A special case is that where all amino acids have
    the same background distribution qa 1/20

8
Information content
  • I ?a pa log2(pa/(1/20))
  • ?a pa log2pa - log2(1/20)
  • -H(p) - ?a palog2(1/20)
  • -H(p) ?a palog2(20)
  • -H(p) log2(20)
  • -H(p) 4.32

9
Information content
  • I -H(p) 4.32 ?a palog2pa 4.32
  • The Information content is at its maximum when
    then the entropy is zero - i.e. A fully conserved
    position in a multiple alignment.
  • Mutiple alignment of 3 sequences
  • Seq1 A L R
  • Seq2 A V R
  • Seq3 A I K
  • Pos1 I -1log2(1) 4.32 4.32
  • Pos2 I -1/3log2(1/3) 1/3log2(1/3)
    1/3log2(1/3) 4.32
  • Pos3 I -2/3log2(2/3) 1/3log2(1/3) 4.32

10
Logo plots - HowTo
11
Logo plots - Information Content
Calculate Information Content I ?a?palog2pa
log2(4), Maximal value is 2 bits
Completely conserved
0.5 each
  • Total height at a position is the Information
    Content measured in bits.
  • Height of letter is the proportional to the
    frequency of that letter.
  • A Logo plot is a visualization of a mutiple
    alignment.

12
Programs to make a Logo plot
  • WebLogo
  • Requires a mutiple alignment as input
  • Protein or DNA sequences
  • More output formats
  • Blast2Logo
  • Requires a fasta file as input
  • Only protein sequences
  • Runs PSI-blast and makes a table of frequencies
  • pdf logo plot

13
WebLogo - http//weblogo.berkeley.edu/
14
WebLogo - http//weblogo.berkeley.edu/
15
Find important positions
gtspQ00017RHA1_ASPAC Rhamnogalacturonan
acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGS
GTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYV
IVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLEN
AAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VE
YVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCT
GTSLKSVL TTTSFEGTCL
What is the next step ?
  1. Find homologous sequences - how ?
  • Blast or PsiBlast
  • Download sequences
  • Make a mutiple alignment
  • ClustalW or others
  • or use Blast2Logo program

16
Mutiple alignment programs
17
Blast2logo - http//www.cbs.dtu.dk/biotools/Blast2
logo-1.0/
18
Important positions
  • Important positions in proteins are conserved
  • positions gt high Information Content.
  • Conserved for a reason
  • Functionally important positions
  • Catalytic residues
  • Structurally important positions
  • Manitain the correct fold of the protein

19
Blast2logo
  • Runs iterative blast i.e. Psi-Blast
  • Searching for homologues sequences by use
  • of Position Specific Scoring Matrices (PSSM).
  • Iteration - use Blosum62 scoring matrix
  • Iteration - make profile of seq found in
    iteration 1
  • Iteration - make profile of seq found in
    iteration 2
  • Iteration - Calculate aa freq at each position
    in
  • query sequence. Correct for low counts and weight
  • seq such that very similar seq are down weighted

20
Important positions - counting
21
Example. Where is the active site?
  • Sequence profiles might show you where to look!
  • The active site could be around
  • S9, G42, N74, and H195

22
Exercise
  1. Calculate nucleotide frequencies from a mutiple
    alignment of human donor sites
  2. Calculate Entropy and Information content
  3. Draw (by hand) a Logo plot
  4. Use 2 Logo plot programs
  5. Learn to interpret Logo frequency plots
  6. Active site residues structural residues
Write a Comment
User Comments (0)
About PowerShow.com