Title: Bioinformatics: an introduction ii
1Bioinformatics an introduction (ii)
- Nick Rhodes
- Room 323 (Lynch Laboratory)
- n.rhodes_at_sheffield.ac.uk
2Week 1
- Lecture
- I - definitions
- II - the central dogma - DNA, RNA, proteins
- III - data in biology molecular biology
- IV - (information management interoperation)
- Practical
- information retrieval
- display of macromolecules
3Week 2
- Lecture - Extracting meaning from sequences
- I comparing sequences
- II - alignment
- III content and features
- IV (content analysis in more detail) -
subcellular location prediction - Practical
- prediction servers
- working with multiple sequences
4I Comparisons
- The ability to compare sequences is fundamental
to bioinformatics. - by comparing an unknown protein or nucleic acid
sequence against the databases, the unknown may
be identified if a match is found - if highly similar entries are identified,
conclusions may be drawn about the identity and
function of the unknown
5Definitions
- similarity
- sequences are in some sense similar, implies no
evolutionary connotations - homology
- evolutionarily related sequences stemming from a
common ancestor
6Comparisons - two techniques
- sequence comparison
- merely detects common features
- sequence alignment
- places the residues or bases into an optimal
one-to-one correspondence - Implies selection of an appropriate scoring
scheme which takes the form of a scoring matrix.
7Comparison criteria
- comparisons utilise certain criteria
- e.g.
- substitution of K for R is a conservative
substitution (chemically similar) - substitution of D for N not conservative in
chemical terms but is in structural ones - many sets (schemes) of criteria
- SCORE according to significance
- C v. C, W v. W score high
- C v. W scores low
8Comparison schemes
- chemical
- acidic (DE), aliphatic (AGILV), amide (NQ),
aromatic (FWY), basic (RHK), hydroxyl (ST), imino
(P), sulphur (CM) - functional
- acidic, basic, hydrophobic, (AILMFPWV), polar
(NCQGSTY) - charge
- acidic, basic, neutral
- structural
- ambivalent (ACGPSTWY), external (RNDQEHK),
internal (ILMFV)
9Willie Taylors classification
10The PAM 250 scoring matrix
11Similarity helps identify residues
- functional
- catalytic residues
- binding residues
- structural
- interactions may be essential for maintenance of
3D structure - formation of e.g. helixes sheets
- S-S bridges
12Similarity aids interpretation
- inferred relationships
- e.g. GPCRs to elucidate relationships between
receptor families - functional
- structure prediction
- (secondary and perhaps tertiary)
- interpretation of genome sequence data
inferences concerning the proteome
13Similarity - phylogenetic analysis
- evolutionary inference
- illustrates relationships and divergence between
sequences - stages
- multiple alignment
- calculate similarity scores between each pair of
sequences - construct a dendrogram or evolutionary tree
14Looking for similarities
- Simplest method is DIAGON
- plot of every residues in one sequence
- against every residue in the other
- this gives a matrix
- plot a dot for an identity
- plotting two identical sequences should produce a
diagonal line
15Diagon DotPlot
16Removing noise from DotPlots
Use a window, where for each diagonal the number
of hits must exceed a certain threshold (here 2,
2)
17Scoring identities
18II Aligning two sequences
- 2 sequences
- ABCDXEFGHI, ABCDEFGHI
- alignment gives
- ABCDXEFGHI
- 5 identities
- ABCDEFGHI
- but introduce a gap (and penalise...)
- ABCDXEFGHI
- 9 identities
- ABCD EFGHI
19Sample alignment
- mark identities
- indicate conservative substitutions
- gaps obvious
- remainder mismatches
EAGIGAVLKVLTTGLPALISWIKRKRQQG--GIGAVLKVLTTGLPALIS
WISRKKRQQ-
20Pairwise alignment
- Basic Local Alignment Search Tool
- BLAST
- finds high-scoring local alignments between
target sequence and a database (both either NA or
protein) - used for searching databases for sequences
similar to a query sequence
http//www.ncbi.nlm.nih.gov/BLAST
21BLAST comparisons
- blastp
- compares a protein query sequence against the
protein sequence databases - blastn
- compares a nucleotide query against the
nucleotide sequence databases - blastx
- compares the translation products of all six
frames of a nucleotide query against the protein
sequence databases
22TBLAST comparisons
- tblastn
- compares a protein query against a nucleotide
sequence database dynamically translated in all
six frames - tblastx
- compares the six frame translations of a
nucleotide query against a nucleotide sequence
database dynamically translated in all six frames
23Multiple alignments
- CLUSTALW
- Higgins D., Thompson J., Gibson T., Nucleic Acids
Res. 224673-4680 (1994).
CLUSTAL W (1.5) multiple sequence alignment
Bovine_Ins MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEAL
YLVCGERGFFYTPKARREVEG Human_Insu MALWMRLLPLLALLALW
GPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
. Bovine_Ins PQVGALELAGGPGAG-----GLE
GPPQKRGIVEQCCASVCSLYQLENYCN Human_Insu LQVGQVELGGG
PGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
. ..
24III Sequence content proteins
- endoprotease and modification sites
- hydrophobicity
- secondary structure prediction
- e.g. Chou and Fasman, Garnier and Robson
- http//bmbsgi11.leeds.ac.uk/bmb5dp/gor.html
- transmembrane helix prediction server
- http//www.cbs.dtu.dk/services/TMHMM-1.0/
- secretory and transport signal peptides
- http//psort.nibb.ac.jp/
- phosphorylation, glycosylation etc.
25Structural predictions
- secondary structure (helices, sheets)
- simple reasonably well understood
- association of secondary structures
- hydrophobic/hydrophilic faces
- transmembrane domains
- tertiary structure
- given a very similar molecule
26Predicting hydrophobicity
Hydrophobic regions of the molecule tend to
associate with other hydrophobic regions.
27Prediction of helices (GOR)
helices in blue, sheet in yellow taking the last
helix... (next slide)
28Helical wheel plots
Can provide information about the association of
helices.
29Transmembrane helices
30Sequence content DNA RNA
- cognate sites within NA sequences
- endonuclease and modification sites
- coding regions
- promoter sequences
- terminator sequences
- local deviations in structure due to variations
in sequence
31Restriction mapping
32Searching for ORFs
- in any given DNA sequence
- 3 reading frames for each strand
- identify start and stop codons
- identify promoter/terminator sequences?
- look for the longest?
- some genomes have overlapping genes
- exploit codon preferences
33IV Subcellular location
- closely correlated with protein function
- predict location, predict function?
- all proteins synthesised in cytoplasm
- Blobel, Milstein and Sabatini
- proposed "signal hypothesis
- secretory proteins marked by a signal peptide
34Sorting signals
- generally, the signals for any such "sorting"
reside in the primary sequence and take two
forms - short and contiguous sequences of amino acid
residues - specific 3D structures formed by two or more
non-contiguous sequences of amino acid residues
35Mammalian signal system
- well characterised
- proteins synthesised with N-terminal signal
peptides 13-36 residues long - consist of
- 7 to 13 residue hydrophobic core
- several relatively hydrophilic residues
- ? 1 basic residues close to the N-terminus
36Rule-based prediction methods
- "expert system" servers
- EMail access
- require knowledge of processing system
- rely on existence of leader sequences
- unreliability of automatically assigned 5-regions
37Statistical prediction methods (i)
based on amino acid (AA) composition
- Kou-Chen Chou and David W. Elrod, Protein
Engineering 12 2, 107-118, 1999. - different organelles have different
physio-chemical environments - protein surface is highly likely to correlate
with AA composition - structural class correlates with AA composition
- ? total AA composition may carry a signal that
identifies the subcellular location
38Statistical prediction methods (ii)
- three stage refinement of the technique
- predict
- intracellular / extracellular
- extracellular / integral membrane / anchored
membrane / intracellular / nuclear - chloroplast / cytoplasm / cytoskeleton /
endoplasmic reticulum / extracell / Golgi
apparatus / lysosome / mitochondrion / nucleus /
peroxisome / plasma membrane / vacuole
39The orexin story
- Sakurai, T. et al. Cell 92, 573-585, 1998.
- Background
- ESTs - Expressed Sequence Tags
- active sequences
- tissue specific
- low quality sequence
40Some ESTs were GPCRs
- GPCRs are
- largest class of targets of pharmaceutical
intervention - superfamily - structural and functional
similarities but no sequence similarity - Of the GPCRs identified
- some had known ligands
- unknown ligands - orphan GPCRs
41Supporting research
- molecular biology, tissue culture and analytical
biochemistry - Cell lines expressing range of orphan GPCRs
- Challenged with tissue extract fractions.
- protein chemistry
- identification of active component
- Orexin-A - 33 residues, N-terminal pyroglutamate,
C-terminally amidated. Bovine is identical to
rat. - Orexin-B - 28 residues, C-terminally amidated
42Isolation of precursor polypeptide
- PCR based methods using highly degenerate primers
yielded cDNA fragment encoding part of orexin-A - extension by RACE methods to obtain the full
length cDNA - found to encode both orexin-A and orexin-B
43The role of orexins?
- 46 identity with each other
- BLAST search revealed
- no similarity with other peptides
- not contained in any other existing GenBank entry
44Characterisation of orexins (i)
- 5-most ATG preceded by an in-frame stop codon
- ORF 130 residues long
- residues 1-33 secretory signal sequence
- (hydrophobic core followed by small polar
residues) - SignalP server predicted cleavage site to be
Ala32/Gln33 - Last residue of mature peptide is followed Gly66,
serving as NH2 donor for C-terminal amidation - Lys67-Arg68 - recognition site for prohormone
convertases
45Characterisation of orexins (ii)
- Arg69-Met96 Orexin-B
- Gly-Gly-Arg, a C-terminal amidation signal
- orexin-A
- rat, mouse and human are identical
- orexin-B
- rat and mouse identical
- human has two substitutions
- mouse rat human prepro-orexins show 95
identity - human rat show 83 identity
46Characterisation of receptors (i)
- OX1R most similar to
- neuropeptide Y receptor
- TRH receptor
- cholesystokinin type-A receptor
- NK2 neurokinin receptor
- consistent with orexins being small regulatory
peptides. - Radioimmunoassays indicated that orexin-A is a
specific, high-affinity agonist for OX1R - Orexin-B had much lower affinity - another higher
affinity receptor?
47Characterisation of receptors (ii)
- tBLASTn against dbEST revealed
- two highly similar entries which were later
shown to be both derived from the same transcript
coding for a GPCR designated OX2R - AA identity between human OX1R and OX2R
- 64
- OX2R binds both orexins with high affinity
- Both receptors were mapped and the subjects of
extensive immunohistochemical and physiological
studies.
48GenBank accession numbers
- prepro-orexin cDNA
- human AF041240
- rat AF041241
- mouse AF041242
- OX1R cDNA
- human AF041243
- rat AF041244
- OX2R cDNA
- human AF041245
- rat AF041246
49References - books (i)
- Richard Durbin and Sean Eddy
- "Biological sequence analysis - probabilistic
models of proteins and nucleic acids", Cambridge
University Press, 1998. - Teresa K. Attwood and David J. Parry-Smith
- Introduction to Bioinformatics, Addison Wesley
Longman Higher Education, 1999.
50References - books (ii)
- Andreas Baxevanis and B. F. Francis Ouellette
- Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins, John Wiley
Sons, 1998. - Gunnar von Heijne
- Sequence analysis in molecular biology
treasure trove or trivial pursuit, Academic
Press, 1987.
51Conference proceedings
- H.A. Lim and C.R. Cantor
- Bioinformatics and Genome Research, World
Scientific Publishing, 1995. - R. Hofestadt, T. Lengauer, M. Loffler and D.
Schomburg - Bioinformatics, Springer-Verlag, 1997.
52Week (ii) exercises
- Use a browser to access the file below at the
CISRG web site. The exercises are all web-based.
http//cisrg.shef.ac.uk/inf6140/week2.htm