Bioinformatics A reintroduction - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Bioinformatics A reintroduction

Description:

Novel Lab-based. Experimental Programme. Amplify Clone & Express Gene. Create mutant etc. ... Into the twilight zone... The search for distant homologies ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 42

Provided by: markp95

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics A reintroduction

1
BioinformaticsA (re-)introduction

Professor Mark Pallen
University of Birmingham

2
BioinformaticsDefinitions

Fusion of Biology, Computer science, Mathematics
Broad Meaning
Any computationally intensive research with
biological relevance
Narrow meaning
Computer-based analysis and archiving of
macromolecular sequence data

3
(No Transcript)
4
The Scope of Bioinformatics
Algorithm Development

Domain Hunting
Detailed Analysis by Expert

Interface Design
Graphics
Web

Large-scale (semi-) Automated Analysis
5
Scale

Large-scale
(semi)-automated analysis of genome sequence
Interface with functional genomics
Medium-scale
Domain hunting
Small-scale
Analysis of individual sequence by power-user

6
(No Transcript)
7
BioinformaticsEnabling Technologies

Maths
Algorithm Development
Computer science
Ever-increasing list of free software
Bioinformatics programs
Operating system LINUX
Scripting languages glue programs together
PERL
Growth of Internet
Distance is dead!, Distributed Resources
User-friendliness of Web
Just Cut and Paste!

8
BioinformaticsChallenges

DNA Protein Sequences
Exponential increase
Genome Sequencing
Need for annotation
Molecular stamp collecting?
Role in drug discovery
Big business

9
BioinformaticsChallenges

Interplay between the wet and the dry
Bioinformatics Predictions
range from the very general to the very specific
range from highly speculative to the almost
certain
But in the end they are still only predictions
Need for experimental confirmation

10
Challenges The Data Flood

170 bacterial genomes completed and published
340,000 genes
491 genomes ongoing
982,000 more genes when finished!!
gt1,300,000 bacterial genes
gt40 x the number in the human genome!

11
Challenges Genomics

What is genomics?
Acquisition exploitation of whole genome
sequences
Think big!
Study of 1000s of genes
High-throughput
Global approaches
Automation and Technology-driven

1995 shotgun sequencing of H. influenzae, 1.8
Mb M. genitalium 0.6 Mb.
1996 S. cerevisiae, 13 Mb.
1998 C. elegans, 100 Mb.
2000 D. melanogaster, 120 Mb
2001 human (3 Gb) gt100 complete genome
sequences, mostly microbial.
2002 mouse
2003 pufferfish, D. pseudoobscura
2004 C. briggsae, rat, chimp, chicken many more
coming

12
Challenges Genomics

Uses of a Genome Sequence
Fuelling hypothesis driven research
Functional genomics
High-throughput global approaches
Genome, transcriptome, proteome, metabolome,
interaction maps, mass mutagenesis

In conventional biology, experiments are small
and designed to test a specific hypothesis
clearly and directly.
In genomics, experiments are massive and not
designed for a single hypothesis.
Every biology question about genomics data
corresponds to an information problem how to
find the desired pattern in a dataset.

13
The Post-Genomic Iceberg
Discovered Biology
The Undiscovered Genotype Most genes are of
unknown function Undiscovered genomic diversity
Undiscovered Biology
The Undiscovered Phenotype Most bacterial
physiology inapparent in the lab Undiscovered
regulators and regulons
14
Role of Sequence Analysisin the Pre-Genomic Era
Confirm

Sequence Analysis
Homology
Structural Features

Identify Clone Gene
Obtain Sequence
15
Role of Sequence Analysisin the Post-Genomic Era
Obtain Sequence from Genome Project
Formulate Hypothesis

Sequence Analysis
Homology
Structural Features
Genomic Context

Novel Lab-based Experimental Programme Amplify
Clone Express Gene Create mutant etc.
16
Bioinformatics Approaches

Multiple levels of analysis
Gene Finding
Protein function prediction
Power of homology
Pitfalls of homology
Comparative genomics
Metabolism reconstruction
Interface with functional genomics

17
What is a Sequence?

DNA Sequence, double stranded, antiparallel
Conventionally written 5 to 3
5-ATGAGTACCG CTAAATTAGT TAAATCAAAA-3
3-TACTCATGGC GATTTAATCA ATTTAGTTTT-5
RNA sequence, single stranded, U instead of T
5-AUGAGUACCG CUAAAUUAGU UAAAUCAAAA-3
Protein sequence
conventionally written N-terminal to C-terminal
3-letter code Met Ser Thr Ala Lys Leu
1-letter code MSTAKLVKSKATN
Sequences usually written in a monospaced font
like Courier
Times Courier
AGCGGGCGG AGCGGGCGG
ATCGTTCTG ATCGTTCTG

18
(No Transcript)
19
First get your sequence!

Most sequencing is now...
Performed on DNA (rather than RNA or protein)
Performed using the Sanger didexy method
Exceptions...
rRNA is sometimes sequenced directly
N-terminal and mass spectrometry sequencing of
proteins
Template for sequencing can be
DNA cloned in a plasmid (e.g. pUC19)
DNA cloned in a single-stranded phage (e.g. M13)
PCR products

20
First get your sequence!

Automated Sequencing
Fluorescent dyes used
Extract sequence from chromatogram
Must extract only the error-free region

21
(No Transcript)
22
Purely Sequence-Based Assembly
23
Analysis of nucleotide sequence data

Search for Sequence Features
Promoters
Ribosome-binding Sites
Repeats
Inverted Repeats (e.g. terminators)
Consensus Sequences for regulator binding sites

24
Searching for coding regions

Any given DNA sequence can be translated in 6
different reading frames, 3 on each strand

25
ORF maps
26
The Problem of Frameshift Errors
Actual sequence
10 20 30 40
50 60 70

ATGAGTACCGCTAAATTAGTTAAATCAAAAGCGACCAATCTGCTTTAT
ACCCGCAACGATGTCTCCGACAGCGAGAAA M S T A K L
V K S K A T N L L Y T R N D V S D
S E K V P L N L N Q K R P I
C F I P A T M S P T A R K E Y
R I S I K S D Q S A L Y P Q R
C L R Q R E K
10 20 30 40
50 60 70

ATGAGTACCGCTAAATTAGTTAAATCAAAAAGCGACCAATCTGCTTTA
TACCCGCAACGATGTCTCCGACAGCGAGAA M S T A K L
V K S K S D Q S A L Y P Q R C L R
Q R E V P L N L N Q K A T N
L L Y T R N D V S D S E K E Y
R I S I K K R P I C F I P A T
M S P T A R K
Frameshifted sequence after single base error
Markov Models (GLIMMER) now commonly used to
predict coding regions
27
Analysis of Protein Sequence Data
28
Analysis of Protein Sequence Data Signal peptides
SignalP uses neural networks
29
Homology

Similarity that arises because of descent from a
common ancestor
The formation of different languages and of
distinct species, and the proofs that both have
been developed through a gradual process, are
curiously parallel We find in distinct languages
striking homologies due to community of descent,
and analogies due to a similar process of
formation Languages, like organic beings, can be
classed in groups under groups and they can be
classed either naturally according to descent, or
artificially by other characters The survival or
preservation of certain favoured words in the
struggle for existence is natural selection.
Charles Darwin, 1871 THE DESCENT OF MAN, Chapter
3

30
Homology
the cat sat on the mat die Katze sass auf
der Matte
vgeGBant88-2 ITLITCVSVKDNSKRYVVAG vgeGEfae9-1
78 LTLITCDQATKTTGRIIVIA vgeGSpne1-403
MTLITCDPIPTFNKRLLVNF sortase_staur
LTLITCDDYNEKTGVWEKRK
31
Homology
Sequence homology is not just sequence similarity!
Sequences 1, 1A, 1B and 2 are all homologous to
one another Another sequence 2 is similar to
sequences 1, 1A, 1B 2, but not homologous to
them as it does not share a common ancestor with
them Another sequence 1 is neither homologous
nor similar to sequences 1, 1A, 1B 2
32
Types of Homology
33
Sequence Databases

All sequences when published are deposited in
Sequence Databases
Nucleic Acid Sequence Databases
EMBL, Heidelberg, http//www.embl-heidelberg.de/
GenBank, in the NCBI, USA, http//www.ncbi.nlm.nih
.gov/
Protein Sequence Databases
GenPept and TREMBL
Curated database SwissProt, Geneva,
http//www.expasy.ch/sprot/
Numerous others, reviewed every year in NAR
Problem of sequence formats
Simplest format is FASTA
gtsequence name
AATGATGCGTGATGATGATGATGACTGACTGATGATGAT

34
Homology Searches

The aim of homology searches is to identify
sequences within these databases that are
homologous to your sequence.
This involves comparing your sequence with all
the database sequences, looking for stretches of
sequence that appear to be similar, then scoring
the matches and ranking them. Usually a measure
of the significance of the match will be given.

35
Homology Searches Translate first!
36
Homology Searches with BLAST

BLASTN
Nucleotide query vs nucleotide database
BLASTP
protein query vs protein database
BLASTX
automatic 6-frame translation of nucleotide query
vs protein database
TBLASTN
protein query vs automatic 6-frame translation of
nucleotide database
TBLASTX
automatic 6-frame translation of nucleotide query
vs automatic 6-frame translation of nucleotide
database

37
Typical Blast Output
Sum
Reading High Probability Sequences
producing High-scoring Segment Pairs
Frame Score P(N) N embX69337ECDPS
E.coli dps gene for binding protein 2 834
6.4e-109 1 gbU04242ECU04242 Escherichia
coli core starvation p... 3 828 2.7e-106
1 embX14180ECGLNHPQ Escherichia coli glutamine
permeas... 3 443 2.8e-53
1 gbU18769HDU18769 Haemophilus ducreyi fine
tangled p... 1 150 4.0e-18 2
dbjD01016ANALTI46 Anabaena variabilis lti46
gene. gte... 2 129 4.8e-12 2
gbM84990P26BPO Plasmid pOP2621 ORF1 gene,
5' end... -2 131 6.7e-09
1 gbU16121HPU16121 Helicobacter pylori
neutrophil act... 1 112 1.8e-06
1 gbM32401TRPTYF1 T.pallidum pallidum
antigen TyF1 g... 3 101 5.6e-06
2 embX71436RPNTRB R.phaseoli ntrB gene 1
67 0.76
2 gbL35598DRODGC1A Drosophila melanogaster
receptor g... 1 48 0.97 3
38
Typical Blast Output
gbU18769HDU18769 Haemophilus ducreyi fine
tangled pili major pilin subunit gene Length
780 Plus Strand HSPs Score 150 (68.0 bits),
Expect 4.0e-18, Sum P(2) 4.0e-18 Identities
36/89 (40), Positives 46/89 (51), Frame
1 Query 30 ELLNRQVIQFIDLSLITKQAHWNMRGANFIAVH
EMLDGFRTALIDHLDTMAERAVQLGGV 89 E L
LLI K AHWN G FIAVHEMLD D D AER
LG Sbjct 253 EALQMRLQGLNELALILKHAHWNVVGPQFIA
VHEMLDSQVDEVRDFIDEIAERMATLGVA 432 Query 90
ALGTTQVINSKTPLKSYPLDIHNVQDHLK 118
G YPL QDHLK Sbjct 433
PNGLSGNLVETRQSPEYPLGRATAQDHLK 519
39
Sequence alignments
Dps_trepo .........N MCTDGKKYHS TATSAAVGAS
APGVPDARAI AAICEQLRRH Dps_helico ..........
.......... .......... .......... MKTFEILKHL
Dps_anab .......... .......... .......MPR
INIGLTDEQR QGVINLLNQD MrgA ..........
.......... .......... MKTENAKTNQ TLVENSLNTQ
Dps_haemo MRSKTITFPV LKLTGQSQAL TNDMHKNADH
TVPGLTVATG HLIAEALQMR Dps_ecoli ..........
....MSTAKL VKSKATNLLY TRNDVSDSEK KATVELLNRQ
Dps_strep ........MT SQPHLHQHAA EIQEFGTVTQ
LPIALSHDAR QYSCQRLNRV Dps_trepo VADLGVLYIK
LHNYHWHIYG IEFKQVHELL EEYYVSVTEA FDTIAERLLQ
Dps_helico QADAIVLFMK VHNFHWNVKG TDFFNVHKAT
EEIYEEFADM FDDLAERLVQ Dps_anab LADSYLLLVK
TKKYHWDVVG PQFRSLHQLW EEHYEKLTEN IDAIAERVRT
MrgA LSNWFLLYSK LHRFHWYVKG PHFFTLHEKF
EELYD..... .HAAETWIPS Dps_haemo LQGLNELALI
LKHAHWNVVG PQFIAVHEML DSQVDEVRDF IDEIAERMAT
Dps_ecoli VIQFIDLSLI TKQAHWNMRG ANFIAVHEML
DGFRTALIDH LDTMAERAVQ Dps_strep LADTQFLYAL
YKKCHWGMRG PTAYQLHLLF DKHAQEQLEL VDALAERVQT
ClustalW most commonly used program Note problems
of indels and ragged ends Need for manual
refinement Multiple alignments useful for
identifying active sites and distant homology
40
Into the twilight zoneThe search for distant
homologies
Signal Peptide
A
Proteins consist of domains
B
Signal Peptide
Transitivity of Homology
Coiled coil domain
C
Distant Homology
D
41
(No Transcript)

Write a Comment

User Comments (0)