Title: CS177 Review/Summary of the Madej lectures
1CS177 Review/Summary of the Madej lectures
2Overview
- Basic biology.
- Protein/DNA sequence comparison.
- Protein structure comparison/classification.
- NCBI databases overview.
- Miscellaneous topics.
3(No Transcript)
4Lodish et al. Molecular Cell Biology, W.H.
Freeman 2000
5Protein/DNA sequence comparison
- What is the meaning of a sequence alignment?
- Scoring methods amino acid substitution
matrices, PSSMs. - Basic computational methods e.g. BLAST.
- Know how to run PSI-BLAST, interpret the results.
6Homology
- whenever statistically significant sequence or
structural similarity between proteins or protein
domains is observed, this is an indication of
their divergent evolution from a common ancestor
or, in other words, evidence of homology. - E.V. Koonin and M.Y. Galperin, Sequence
Evolution Function, Kluwer 2003
7(No Transcript)
8A simple phylogenetic tree
9Human hemoglobin and more distantly related
globins
- Human and horse
- Human and fish
- Human and insect
- Human and bacteria
10Alignment notation different notations for the
same alignment!
VISDWNMPN-------MDGLE CILVV----AANDGPMPQTRE
VISDWnm---pnMDGLE CILVVaandgpmPQTRE
11Computing sequence alignments
- You must be able to recognize the answer
(correct alignment) when you see it (scoring
system). - You must be able to find the answer i.e. compute
it efficiently.
12Scoring and computing alignments
- Position independent amino acid substitution
tables e.g. BLOSUM62. - Global alignment algorithms such as
Smith-Waterman (dynamic programming) or fast
heuristics such as BLAST.
13(No Transcript)
14Score this alignment
VISDWnm---pnMDGLE CILVVaandgpmPQTRE
Use BLOSUM62 matrix gap opening penalty 10 gap
extension penalty 1
(-1 4 2 3 3) 10 111 (-2 0 2 2
5) -27
15BLAST (Basic Local Alignment Search Tool)
- Extremely fast, can be on the order of 50-100
times faster than Smith-Waterman. - Method of choice for database searches.
- Statistical theory for significance of results
(extreme value distribution). - Heuristic does not guarantee optimal results.
- Many variants, e.g. PHI-, PSI-, RPS-BLAST.
16Why database searches?
- Gene finding.
- Assigning likely function to a gene.
- Identifying regulatory elements.
- Understanding genome evolution.
- Assisting in sequence assembly.
- Finding relations between genes.
17Issues in database searches
- Speed.
- Relevance of the search results (selectivity).
- Recovering all information of interest
(sensitivity). - The results depend on the search parameters, e.g.
gap penalty, scoring matrix. - Sometimes searches with more than one matrix
should be performed.
18E-values, P-values
- E-value, Expectation value this is the expected
number of hits of at least the given score, that
you would expect by random chance for the search
database. - P-value, Probability value this is the
probability that a hit would attain at least the
given score, by random chance for the search
database. - E-values are easier to interpret than P-values.
- If the E-value is small enough, e.g. no more than
0.10, then it is essentially a P-value.
19PSI-BLAST
- Position Specific Iterated BLAST
- As a first step runs a (regular) BLAST.
- Hits that cross the threshold are used to
construct a position specific score matrix
(PSSM). - A new search is done using the PSSM to find more
remotely related sequences. - The last two steps are iterated until convergence.
20PSSM (Position Specific Score Matrix)
- One column per residue in the query sequence.
- Per-column residue frequencies are computed so
that log-odds scores may be assigned to each
residue type in each column. - There are difficulties e.g. pseudo-counts are
needed if there are not a lot of sequences, the
sequences must be weighted to compensate for
redundancy.
21Two key advantages of PSSMs
- More sensitive scoring because of improved
estimates of probabilities for a.a.s at specific
positions. - Describes the important motifs that occur in the
protein family and therefore enhances the
selectivity.
22Position Specific Substitution Rates
Weakly conserved serine
Active site serine
23Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
24PSI-BLAST key points
- The first PSSM is constructed from all hits that
cross the significance threshold using standard
BLAST. - The search is then carried out with the PSSM to
draw in new significant hits. - If new hits are found then a new PSSM is
constructed these last two steps are iterated. - The computation terminates upon convergence,
i.e. when no new sequences are found to cross the
significance threshold.
25Protein structure comparison/classification
- Protein secondary structure elements.
- Supersecondary structures (simple structure
motifs). - Folds and domains.
- Comparing structures (VAST).
- Superfolds.
- Fold classification (SCOP).
- Conserved Domain Database (CDD).
26a-helix (3chy)
backbone atoms
with sidechains
27Parallel ß-strands (3chy)
28Anti-parallel ß-strands (1hbq)
29Higher level organization
- A single protein may consist of multiple domains.
Examples 1liy A, 1bgc A. The domains may or
may not perform different functions. - Proteins may form higher-level assemblies.
Useful for complicated biochemical processes that
require several steps, e.g. processing/synthesis
of a molecule. Example 1l1o chains A, B, C.
30Supersecondary structures
- ß-hairpin
- a-hairpin
- ßaß-unit
- ß4 Greek key
- ßa Greek key
31Supersecondary structure simple units
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
32Supersecondary structure Greek key motifs
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
33Protein folds
- There is a continuum of similarity!
- Fold definition two folds are similar if they
have a similar arrangement of SSEs (architecture)
and connectivity (topology). Sometimes a few
SSEs may be missing. - Fold classification To get an idea of the
variety of different folds, one must adjust for
sequence redundancy and also try to correctly
assign homologs that have low sequence identity
(e.g. below 25).
34Vector Alignment Search Tool (VAST)
- Fast structure comparison based on representing
SSEs by vectors. - A measure of statistical significance (VAST
E-value) is computed (very differently from a
BLAST E-value). - VAST structure neighbor lists useful for
recognizing structural similarity.
35Superfolds (Orengo, Jones, Thornton)
- Distribution of fold types is highly non-uniform.
- There are about 10 types of folds, the
superfolds, to which about 30 of the other folds
are similar. There are many examples of
isolated fold types. - Superfolds are characterized by a wide range of
sequence diversity and spanning a range of
non-similar functions. - It is a research question as to the evolutionary
relationships of the superfolds, i.e. do they
arise by divergent or convergent evolution?
36Superfolds and examples
- Globin 1hlm sea cucumber hemoglobin 1cpcA
phycocyanin 1colA colicin - a-up-down 2hmqA hemerythrin 256bA cytochrome
B562 1lpe apolipoprotein E3 - Trefoil 1i1b interleukin-1ß 1aaiB ricin 1tie
erythrina trypsin inhibitor - TIM barrel 1timA triosephosphate isomerase 1ald
aldolase 5rubA rubisco - OB fold 1quqA replication protein A 32kDa
subunit 1mjc major cold-shock protein 1bcpD
pertussis toxin S5 subunit
- a/ß doubly-wound 5p21 Ras p21 4fxn flavodoxin
3chy CheY - Immunoglobulin 2rhe Bence-Jones protein 2cd4
CD4 1ten tenascin - UB aß roll 1ubq ubiquitin 1fxiA ferredoxin 1pgx
protein G - Jelly roll 2stv tobacco necrosis virus 1tnfA
tumor necrosis factor 2ltnA pea lectin - Plaitfold (Split aß sandwich) 1aps
acylphosphatase 1fxd ferredoxin 2hpr
histidine-containing phosphocarrier
37Fold classification (when you have the structure)
- First, look up PubMed abstracts for any relevant
papers. E.g. if this is from a PDB file there
will be references in it. - Try checking SCOP or CATH.
- Look at VAST neighbors. See if the structure in
question is highly similar to another structure
with a known fold.
38SCOP (Structural Classification of Proteins)
- http//scop.mrc-lmb.cam.ac.uk/scop/
- Levels of the SCOP hierarchy
- Family clear evolutionary relationship
- Superfamily probable common evolutionary origin
- Fold major structural similarity
39(No Transcript)
40Bioinformatics databases
- Entrez is by far the most useful, because of the
links between the individual databases, e.g.
literature, sequence, structure, taxonomy, etc. - Other specialty databases available on the
internet can also be very useful, of course!
41The (ever expanding) Entrez System
NLM Catalog
PubChem
Compounds
BioAssays
Substances
Literature
Organism
Expression
HomoloGene
Gene
42 Links Between and Within Nodes
Word weight
Computational
3-D Structure
3 -D Structures
VAST
Phylogeny
Computational
Protein sequences
BLAST
BLAST
Computational
Computational
43Entrez queries
- Be able to formulate queries using index terms
(Preview/Index), and limits.
44(No Transcript)
45Exercises!
- How many protein structures are there that
include DNA and are from bacteria? - In PubMed, how many articles are there from the
journal Science and have Alzheimer in the title
or abstract, and amyloid beta anywhere? How
many since the year 2000? - Notice that the results are not 100 accurate!
- In 3D Domains, how many domains are there with no
more than two helices and 8 to 10 strands and are
from the mouse?
46P53 tumor suppressor protein
- Li-Fraumeni syndrome only one functional copy of
p53 predisposes to cancer. - Mutations in p53 are found in most tumor types.
- p53 binds to DNA and stimulates another gene to
produce p21, which binds to another protein cdk2.
This prevents the cell from progressing thru the
cell cycle.
47G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21
217-228.
48Exercise!
- Use Cn3D to investigate the binding of p53 to
DNA. - Formulate a query for Structure that will require
the DNA molecules to be present (there are 2
structures like this).
49Miscellaneous topics
- BLAST a sequence against a genome locate hits on
chromosomes with map viewer. - Obtain genomic sequence with map viewer.
- Spidey to predict intron/exon structure (but we
wont use spidey on the exam!). - How sequence variations can affect protein
structure/function.
50EST exercise summary
- BLAST the EST (or other DNA seq) against the
genome. - From the BLAST output you can get the genomic
coordinates of any nucleotide differences. - Use map viewer to locate the hit on a chromosome
assume the hit is in the region of a gene. - By following the gene link you can get an
accession for mRNA. - By using the dl link you can get an accession
for the genomic sequence. - Use spidey with the mRNA and genomic sequence
to locate changed residues in the protein.
51EST exercise summary (cont.)
- From the gene report you can follow the protein
link, and then Blink. - From the BLAST link page you can get to CDD and
related structures. - Since you know where are the changed residues you
can use the structures to study what effect the
changes might have on the function of the protein.
52Gene variants that can affect protein function
- Mutation to a stop codon truncates the protein
product! - Insertion/deletion of multiple bases changes the
sequence of amino acid residues. - Single point change could alter folding
properties of the protein. - Single point change could affect the active site
of the protein. - Single point change could affect an interaction
site with another molecule.
53Important note!
- Most diseases (e.g. cancer) are complex and
involve multiple factors (not just a single
malfunctioning protein!).
54Investigating a genetic disease
- The following EST comes from a hemochromatosis
patient your task is to identify the gene and
specific mutation causing the illness, and why
the protein is not functioning properly. - The sequence
- TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG
- TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA
- ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT
- GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA
- TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG
- GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC
- TGGATCAGCCCCTCATTGTGATCTGGG
55ESTs
- Expressed Sequence Tags useful for discovering
genes, obtaining data on gene expression/regulatio
n, and in genome mapping. - Short nucleotide sequences (200-500 bases or so)
derived from mRNA expressed in cells. - The introns from the genes will already be
spliced out. - mRNA is unstable, however, and so it is reverse
transcribed into cDNA.
56Hemochromatosis 2
- BLAST the EST vs. the Human genome (could take a
few minutes). - - Which chromosome is hit?
- - What is the contig that is hit (reference
assembly)? - - Is the EST identical to the genomic sequence?
- - Take note of the coords of the difference.
- Click on Genome View.
- Select the map element at the bottom
corresponding to the contig.
57Hemochromatosis 3
- What gene is hit? Zoom in on the BLAST hit a few
times. - Display the entire gene sequence vi dl and
Display. - Copy and save the genomic sequence.
- Record the coords for the start of the genomic
sequence.
58Hemochromatosis 4
- Click on a UniGene link Hs.233325.
- Note Expression profile presents data for the
expression level of the gene in various tissues. - How many mRNAs and ESTs are there for the HFE
gene? - Take note of the mRNA accession NM_000410.
59Hemochromatosis 5
- Go to spidey http//www.ncbi.nlm.nih.gov/spidey
/ - To determine the intron/exon structure, paste the
HFE gene sequence into the upper box, and enter
the HFE mRNA accession NM_000410 in the lower
box. - Click Align.
60Hemochromatosis 6
- How many exons are there?
- Which exon codes the residue that is changed in
the original EST? (You have to do a little
arithmetic!) - Record some of the protein sequence around the
changed residue EQRYTCQVEHPG
61Hemochromatosis 7
- From the Map Viewer page click on the HFE gene
link. - How many HFE transcripts are there? Which is the
longest isoform? - Follow Links to Protein and then to the
report for NP_000410. - Determine the residue number that corresponds to
the mutation.
62RNA splicing and isoforms
63Hemochromatosis 8
- What effect does the mutation in the original EST
have on the protein? (Look at the table for the
Genetic Code.) - Go back to the Gene Report read the summary and
take note of the GeneRIF bibliography. - Now go to Links and then to GeneView in dbSNP
to a list of known SNPs.
64Hemochromatosis 9
- In the SNP list note that the one you want is
currently shown. - Select view rs in gene region and then click on
view rs. - How many nonsynonomous substitutions do you see?
- Do you see the one we are particularly interested
in?
65Digression SNPs
- Single Nucleotide Polymorphisms.
- A single base change that can occur in a persons
DNA. - On average SNPs occur about 1 of the time, most
are outside of protein coding regions. - Some SNPs may cause a disease some may be
associated with a disease others may affect
disposition to a disease others may be simple
genetic variation. - dbSNP archives SNPs and other variations such as
small-scale deletion/insertion polymorphisms
(DIPs), etc.
66(No Transcript)
67Hemochromatosis 10
- Back to the Gene Report, click on Links and go
to OMIM (can also get there via the Map
Viewer). - In the OMIM entry you can read a bit also click
on View List for Allelic Variants, where you
can see the mutation again.
68Hemochromatosis 11
- From the Gene Report again follow Links to
Protein and scroll down to NP_000401. - Click on Domains and then Show Details.
- What is the Conserved Domain in the region of
interest? - Follow the link to the CD.
- Click on View 3D Structure.
69Hemochromatosis 12
- Look for residue position 282 in the query
sequence. - Highlight that column.
- Is the Cys282 conserved in the family?
- The C282Y mutation therefore likely has the
effect of