Title: Entropy, Information contents
1Entropy, Information contents Logo plotsBy
Thomas Nordahl Petersen
2Biological information
Exon Intron Exon
- Mutiple alignment of acceptor sites from 268
yeast DNA sequences - What is the biological signal around the site ?
- What are the important positions
- How can it be visualized ?
- Logo plot with Information Content
3Entropy - Definition
- Entropy of random variable is a measure of the
uncertainty - In Thermodynamics ?GH-T?S
- The entropy S of a system is the degree of
disorder
4Entropy - Definition
- Entropy of a distribution of amino acids
- The Shannon entropy
- H(p) - ?a pa log2(pa), where p is an amino
- acid distribution.
- H(p) is measured in bits log2(2) 1, log2(4)2
- Mutiple alignment of 3 sequences
- Seq1 A L P K
- Seq2 A V P R
- Seq3 A I K R
- High entropy - high disorder
- Low entropy - low disorder
5Entropy - example
- H(p) - ?a pa log2(pa)
- Mutiple alignment of 3 sequences
- Seq1 A L R
- Seq2 A V R
- Seq3 A I K
- Pos1 H(p) -1log2(1) 0
- Pos2 H(p) -1/3log2(1/3) 1/3log2(1/3)
1/3log2(1/3) - Pos3 H(p) -2/3log2(2/3) 1/3log2(1/3)
-
6Relative EntropyThe Kullback-Leiber distance D
- How different is an amino acid distribution pa
compared to a background distribution qa - i.e.
distance D between them. - D(pq) ?a pa log2(pa/qa)
- Normally a background distribution of the amino
acids is - obtained as frequencies from a large database
like UniProt. - Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser
(S) 6.87 - Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr
(T) 5.46 - Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp
(W) 1.16 - Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr
(Y) 3.07 - Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val
(V) 6.71
7Information content
- D(pq) ?a pa log2(pa/qa)
-
- Often the Information content is used as a
measure of the - degree of conservation.
- I ?a pa log2(pa/qa)
- A special case is that where all amino acids have
the same background distribution qa 1/20
8Information content
- I ?a pa log2(pa/(1/20))
-
- ?a pa log2pa - log2(1/20)
- -H(p) - ?a palog2(1/20)
- -H(p) ?a palog2(20)
- -H(p) log2(20)
- -H(p) 4.32
-
-
9Information content
- I -H(p) 4.32 ?a palog2pa 4.32
- The Information content is at its maximum when
then the entropy is zero - i.e. A fully conserved
position in a multiple alignment. - Mutiple alignment of 3 sequences
- Seq1 A L R
- Seq2 A V R
- Seq3 A I K
- Pos1 I -1log2(1) 4.32 4.32
- Pos2 I -1/3log2(1/3) 1/3log2(1/3)
1/3log2(1/3) 4.32 - Pos3 I -2/3log2(2/3) 1/3log2(1/3) 4.32
-
10Logo plots - HowTo
11Logo plots - Information Content
Calculate Information Content I ?a?palog2pa
log2(4), Maximal value is 2 bits
Completely conserved
0.5 each
- Total height at a position is the Information
Content measured in bits. - Height of letter is the proportional to the
frequency of that letter. - A Logo plot is a visualization of a mutiple
alignment.
12Programs to make a Logo plot
- WebLogo
- Requires a mutiple alignment as input
- Protein or DNA sequences
- More output formats
- Blast2Logo
- Requires a fasta file as input
- Only protein sequences
- Runs PSI-blast and makes a table of frequencies
- pdf logo plot
13WebLogo - http//weblogo.berkeley.edu/
14WebLogo - http//weblogo.berkeley.edu/
15Find important positions
gtspQ00017RHA1_ASPAC Rhamnogalacturonan
acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGS
GTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYV
IVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLEN
AAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VE
YVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCT
GTSLKSVL TTTSFEGTCL
What is the next step ?
- Find homologous sequences - how ?
- Blast or PsiBlast
- Download sequences
- Make a mutiple alignment
- ClustalW or others
- or use Blast2Logo program
16Mutiple alignment programs
17Blast2logo - http//www.cbs.dtu.dk/biotools/Blast2
logo-1.0/
18Important positions
- Important positions in proteins are conserved
- positions gt high Information Content.
- Conserved for a reason
- Functionally important positions
- Catalytic residues
- Structurally important positions
- Manitain the correct fold of the protein
19Blast2logo
- Runs iterative blast i.e. Psi-Blast
- Searching for homologues sequences by use
- of Position Specific Scoring Matrices (PSSM).
- Iteration - use Blosum62 scoring matrix
- Iteration - make profile of seq found in
iteration 1 - Iteration - make profile of seq found in
iteration 2 - Iteration - Calculate aa freq at each position
in - query sequence. Correct for low counts and weight
- seq such that very similar seq are down weighted
20Important positions - counting
21Example. Where is the active site?
- Sequence profiles might show you where to look!
- The active site could be around
- S9, G42, N74, and H195
22Exercise
- Calculate nucleotide frequencies from a mutiple
alignment of human donor sites - Calculate Entropy and Information content
- Draw (by hand) a Logo plot
- Use 2 Logo plot programs
- Learn to interpret Logo frequency plots
- Active site residues structural residues