Bioinformatics: an introduction ii - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Bioinformatics: an introduction ii

Description:

hydrophobicity. secondary structure prediction. e.g. ... Predicting hydrophobicity ... hydrophobicity. secretory and transport signal peptides. glycosylation ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 43
Provided by: nickr8
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: an introduction ii


1
Bioinformatics an introduction (ii)
  • Nick Rhodes
  • Room 323 (Lynch Laboratory)
  • n.rhodes_at_sheffield.ac.uk

2
Week 1
  • Lecture
  • I - definitions
  • II - the central dogma - DNA, RNA, proteins
  • III - data in biology molecular biology
  • IV - (information management interoperation)
  • Practical
  • information retrieval
  • display of macromolecules

3
Week 2
  • Lecture - Extracting meaning from sequences
  • I comparing sequences
  • II - alignment
  • III content and features
  • IV (content analysis in more detail) -
    subcellular location prediction
  • Practical
  • prediction servers
  • working with multiple sequences

4
I Comparisons
  • The ability to compare sequences is fundamental
    to bioinformatics.
  • by comparing an unknown protein or nucleic acid
    sequence against the databases, the unknown may
    be identified if a match is found
  • if highly similar entries are identified,
    conclusions may be drawn about the identity and
    function of the unknown

5
Definitions
  • similarity
  • sequences are in some sense similar, implies no
    evolutionary connotations
  • homology
  • evolutionarily related sequences stemming from a
    common ancestor

6
Comparisons - two techniques
  • sequence comparison
  • merely detects common features
  • sequence alignment
  • places the residues or bases into an optimal
    one-to-one correspondence
  • Implies selection of an appropriate scoring
    scheme which takes the form of a scoring matrix.

7
Comparison criteria
  • comparisons utilise certain criteria
  • e.g.
  • substitution of K for R is a conservative
    substitution (chemically similar)
  • substitution of D for N not conservative in
    chemical terms but is in structural ones
  • many sets (schemes) of criteria
  • SCORE according to significance
  • C v. C, W v. W score high
  • C v. W scores low

8
Comparison schemes
  • chemical
  • acidic (DE), aliphatic (AGILV), amide (NQ),
    aromatic (FWY), basic (RHK), hydroxyl (ST), imino
    (P), sulphur (CM)
  • functional
  • acidic, basic, hydrophobic, (AILMFPWV), polar
    (NCQGSTY)
  • charge
  • acidic, basic, neutral
  • structural
  • ambivalent (ACGPSTWY), external (RNDQEHK),
    internal (ILMFV)

9
Willie Taylors classification
10
The PAM 250 scoring matrix
11
Similarity helps identify residues
  • functional
  • catalytic residues
  • binding residues
  • structural
  • interactions may be essential for maintenance of
    3D structure
  • formation of e.g. helixes sheets
  • S-S bridges

12
Similarity aids interpretation
  • inferred relationships
  • e.g. GPCRs to elucidate relationships between
    receptor families
  • functional
  • structure prediction
  • (secondary and perhaps tertiary)
  • interpretation of genome sequence data
    inferences concerning the proteome

13
Similarity - phylogenetic analysis
  • evolutionary inference
  • illustrates relationships and divergence between
    sequences
  • stages
  • multiple alignment
  • calculate similarity scores between each pair of
    sequences
  • construct a dendrogram or evolutionary tree

14
Looking for similarities
  • Simplest method is DIAGON
  • plot of every residues in one sequence
  • against every residue in the other
  • this gives a matrix
  • plot a dot for an identity
  • plotting two identical sequences should produce a
    diagonal line

15
Diagon DotPlot
16
Removing noise from DotPlots
Use a window, where for each diagonal the number
of hits must exceed a certain threshold (here 2,
2)
17
Scoring identities
18
II Aligning two sequences
  • 2 sequences
  • ABCDXEFGHI, ABCDEFGHI
  • alignment gives
  • ABCDXEFGHI
  • 5 identities
  • ABCDEFGHI
  • but introduce a gap (and penalise...)
  • ABCDXEFGHI
  • 9 identities
  • ABCD EFGHI

19
Sample alignment
  • mark identities
  • indicate conservative substitutions
  • gaps obvious
  • remainder mismatches

EAGIGAVLKVLTTGLPALISWIKRKRQQG--GIGAVLKVLTTGLPALIS
WISRKKRQQ-
20
Pairwise alignment
  • Basic Local Alignment Search Tool
  • BLAST
  • finds high-scoring local alignments between
    target sequence and a database (both either NA or
    protein)
  • used for searching databases for sequences
    similar to a query sequence

http//www.ncbi.nlm.nih.gov/BLAST
21
BLAST comparisons
  • blastp
  • compares a protein query sequence against the
    protein sequence databases
  • blastn
  • compares a nucleotide query against the
    nucleotide sequence databases
  • blastx
  • compares the translation products of all six
    frames of a nucleotide query against the protein
    sequence databases

22
TBLAST comparisons
  • tblastn
  • compares a protein query against a nucleotide
    sequence database dynamically translated in all
    six frames
  • tblastx
  • compares the six frame translations of a
    nucleotide query against a nucleotide sequence
    database dynamically translated in all six frames

23
Multiple alignments
  • CLUSTALW
  • Higgins D., Thompson J., Gibson T., Nucleic Acids
    Res. 224673-4680 (1994).

CLUSTAL W (1.5) multiple sequence alignment
Bovine_Ins MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHLVEAL
YLVCGERGFFYTPKARREVEG Human_Insu MALWMRLLPLLALLALW
GPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED

. Bovine_Ins PQVGALELAGGPGAG-----GLE
GPPQKRGIVEQCCASVCSLYQLENYCN Human_Insu LQVGQVELGGG
PGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
. ..

24
III Sequence content proteins
  • endoprotease and modification sites
  • hydrophobicity
  • secondary structure prediction
  • e.g. Chou and Fasman, Garnier and Robson
  • http//bmbsgi11.leeds.ac.uk/bmb5dp/gor.html
  • transmembrane helix prediction server
  • http//www.cbs.dtu.dk/services/TMHMM-1.0/
  • secretory and transport signal peptides
  • http//psort.nibb.ac.jp/
  • phosphorylation, glycosylation etc.

25
Structural predictions
  • secondary structure (helices, sheets)
  • simple reasonably well understood
  • association of secondary structures
  • hydrophobic/hydrophilic faces
  • transmembrane domains
  • tertiary structure
  • given a very similar molecule

26
Predicting hydrophobicity
Hydrophobic regions of the molecule tend to
associate with other hydrophobic regions.
27
Prediction of helices (GOR)
helices in blue, sheet in yellow taking the last
helix... (next slide)
28
Helical wheel plots
Can provide information about the association of
helices.
29
Transmembrane helices
30
Sequence content DNA RNA
  • cognate sites within NA sequences
  • endonuclease and modification sites
  • coding regions
  • promoter sequences
  • terminator sequences
  • local deviations in structure due to variations
    in sequence

31
Restriction mapping
32
Searching for ORFs
  • in any given DNA sequence
  • 3 reading frames for each strand
  • identify start and stop codons
  • identify promoter/terminator sequences?
  • look for the longest?
  • some genomes have overlapping genes
  • exploit codon preferences

33
IV Subcellular location
  • closely correlated with protein function
  • predict location, predict function?
  • all proteins synthesised in cytoplasm
  • Blobel, Milstein and Sabatini
  • proposed "signal hypothesis
  • secretory proteins marked by a signal peptide

34
Sorting signals
  • generally, the signals for any such "sorting"
    reside in the primary sequence and take two
    forms
  • short and contiguous sequences of amino acid
    residues
  • specific 3D structures formed by two or more
    non-contiguous sequences of amino acid residues

35
Mammalian signal system
  • well characterised
  • proteins synthesised with N-terminal signal
    peptides 13-36 residues long
  • consist of
  • 7 to 13 residue hydrophobic core
  • several relatively hydrophilic residues
  • ? 1 basic residues close to the N-terminus

36
Rule-based prediction methods
  • "expert system" servers
  • EMail access
  • require knowledge of processing system
  • rely on existence of leader sequences
  • unreliability of automatically assigned 5-regions

37
Statistical prediction methods (i)
based on amino acid (AA) composition
  • Kou-Chen Chou and David W. Elrod, Protein
    Engineering 12 2, 107-118, 1999.
  • different organelles have different
    physio-chemical environments
  • protein surface is highly likely to correlate
    with AA composition
  • structural class correlates with AA composition
  • ? total AA composition may carry a signal that
    identifies the subcellular location

38
Statistical prediction methods (ii)
  • three stage refinement of the technique
  • predict
  • intracellular / extracellular
  • extracellular / integral membrane / anchored
    membrane / intracellular / nuclear
  • chloroplast / cytoplasm / cytoskeleton /
    endoplasmic reticulum / extracell / Golgi
    apparatus / lysosome / mitochondrion / nucleus /
    peroxisome / plasma membrane / vacuole

39
The orexin story
  • Sakurai, T. et al. Cell 92, 573-585, 1998.
  • Background
  • ESTs - Expressed Sequence Tags
  • active sequences
  • tissue specific
  • low quality sequence

40
Some ESTs were GPCRs
  • GPCRs are
  • largest class of targets of pharmaceutical
    intervention
  • superfamily - structural and functional
    similarities but no sequence similarity
  • Of the GPCRs identified
  • some had known ligands
  • unknown ligands - orphan GPCRs

41
Supporting research
  • molecular biology, tissue culture and analytical
    biochemistry
  • Cell lines expressing range of orphan GPCRs
  • Challenged with tissue extract fractions.
  • protein chemistry
  • identification of active component
  • Orexin-A - 33 residues, N-terminal pyroglutamate,
    C-terminally amidated. Bovine is identical to
    rat.
  • Orexin-B - 28 residues, C-terminally amidated

42
Isolation of precursor polypeptide
  • PCR based methods using highly degenerate primers
    yielded cDNA fragment encoding part of orexin-A
  • extension by RACE methods to obtain the full
    length cDNA
  • found to encode both orexin-A and orexin-B

43
The role of orexins?
  • 46 identity with each other
  • BLAST search revealed
  • no similarity with other peptides
  • not contained in any other existing GenBank entry

44
Characterisation of orexins (i)
  • 5-most ATG preceded by an in-frame stop codon
  • ORF 130 residues long
  • residues 1-33 secretory signal sequence
  • (hydrophobic core followed by small polar
    residues)
  • SignalP server predicted cleavage site to be
    Ala32/Gln33
  • Last residue of mature peptide is followed Gly66,
    serving as NH2 donor for C-terminal amidation
  • Lys67-Arg68 - recognition site for prohormone
    convertases

45
Characterisation of orexins (ii)
  • Arg69-Met96 Orexin-B
  • Gly-Gly-Arg, a C-terminal amidation signal
  • orexin-A
  • rat, mouse and human are identical
  • orexin-B
  • rat and mouse identical
  • human has two substitutions
  • mouse rat human prepro-orexins show 95
    identity
  • human rat show 83 identity

46
Characterisation of receptors (i)
  • OX1R most similar to
  • neuropeptide Y receptor
  • TRH receptor
  • cholesystokinin type-A receptor
  • NK2 neurokinin receptor
  • consistent with orexins being small regulatory
    peptides.
  • Radioimmunoassays indicated that orexin-A is a
    specific, high-affinity agonist for OX1R
  • Orexin-B had much lower affinity - another higher
    affinity receptor?

47
Characterisation of receptors (ii)
  • tBLASTn against dbEST revealed
  • two highly similar entries which were later
    shown to be both derived from the same transcript
    coding for a GPCR designated OX2R
  • AA identity between human OX1R and OX2R
  • 64
  • OX2R binds both orexins with high affinity
  • Both receptors were mapped and the subjects of
    extensive immunohistochemical and physiological
    studies.

48
GenBank accession numbers
  • prepro-orexin cDNA
  • human AF041240
  • rat AF041241
  • mouse AF041242
  • OX1R cDNA
  • human AF041243
  • rat AF041244
  • OX2R cDNA
  • human AF041245
  • rat AF041246

49
References - books (i)
  • Richard Durbin and Sean Eddy
  • "Biological sequence analysis - probabilistic
    models of proteins and nucleic acids", Cambridge
    University Press, 1998.
  • Teresa K. Attwood and David J. Parry-Smith
  • Introduction to Bioinformatics, Addison Wesley
    Longman Higher Education, 1999.

50
References - books (ii)
  • Andreas Baxevanis and B. F. Francis Ouellette
  • Bioinformatics A Practical Guide to the
    Analysis of Genes and Proteins, John Wiley
    Sons, 1998.
  • Gunnar von Heijne
  • Sequence analysis in molecular biology
    treasure trove or trivial pursuit, Academic
    Press, 1987.

51
Conference proceedings
  • H.A. Lim and C.R. Cantor
  • Bioinformatics and Genome Research, World
    Scientific Publishing, 1995.
  • R. Hofestadt, T. Lengauer, M. Loffler and D.
    Schomburg
  • Bioinformatics, Springer-Verlag, 1997.

52
Week (ii) exercises
  • Use a browser to access the file below at the
    CISRG web site. The exercises are all web-based.

http//cisrg.shef.ac.uk/inf6140/week2.htm
Write a Comment
User Comments (0)
About PowerShow.com