Introduction to Bioinformatics and Biological databases

About This Presentation

Title:

Introduction to Bioinformatics and Biological databases

Description:

Biological databases. http://cbio.uct.ac.za/training/courses ... UniProt-specific taxonomy database is Newt: http://www.ebi.ac.uk/newt. Example taxonomy entry ... – PowerPoint PPT presentation

Number of Views:1093

Avg rating:3.0/5.0

Slides: 40

Provided by: bea121

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics and Biological databases

1
Introduction to Bioinformatics and Biological
databases

Nicky Mulder nicola.mulder_at_uct.ac.za

2
What is Bioinformatics?

The application of computer technology to the
management of biological information.
Specifically, it is the science of developing
computer databases and algorithms to facilitate
and expedite biological research, particularly in
genomics. www.informatics.jax.org/mgihome/other/g
lossary.shtml
Doing biology on a computer (computational
biology)
Some parts of biology involve strings of
characters, e.g. DNA and protein sequences, these
are best read by a computer.

3
Why is Bioinformatics needed?

Small- and large-scale biological analyses
New laboratory technologies, e.g. sequencing
Move away from single gene to whole genome
Collection and storage of biological information
Manipulation of biological information
Computers have capability for both, and cheap

4
Hypothesis-driven bioinformatics
Retrieve articles
Search PubMed for more information
Looking at whole genomes
Sequence similarity search
Retrieve the DNA sequence
Analysing DNA sequence
Gene of interest
Sequence alignments, finding motifs
Retrieve the protein sequence
Sequence similarity search
Finding domains, Classifying proteins
Phylogenetics
5
Hypothesis-generating bioinformatics
High-throughput experiment (microarray,
Proteomics, NGS)
Data mining
Gene lists
Data processing
Gene set enrichment
Experimental design
Pathway analysis
Data integration
Statistics
Systems biology
6
Two major components to Bioinformatics

Storing and retrieving data
Biological databases
Querying these to retrieve data
Manipulating the data tools e.g
Sequence similarity searches
Protein families and function prediction
Comparing sequences -phylogenetics

7
What is a database

an organized body of related infomation
www.cogsci.princeton.edu/cgi-bin/webwn
Data collection that is
Structured (computer readable)
Searchable
Updatable
Cross-linked
Publicly available

8
Biological Databases

Make data available to public
So much data available, needs ordering
Turn data into computer-readable form
Ability to retrieve data from various sources
Can have primary (archival) or secondary
databases (curated)
Most commonly used are sequence databases

9
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
10
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
11
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Ontologies -GO
Biological systems
12
Sequence databases

Used for retrieving a known gene/protein sequence
Useful for finding information on a gene/protein
Can find out how many genes are available for a
given organism
Can comparing your sequence to the others in the
database
Can submit your sequence to store with the rest
Main databases nucleotide and protein sequence
DBs

13
Requirements for good sequence database

It must be complete with minimal redundancy
It must contain as much up-to-date information
(annotation) as possible on each sequence
All the information items must be retrievable by
computer programs in a consistent manner
It must be highly interoperable with other
databases

14
Nucleotide sequence databases
Exons
Promoter
CDS (coding sequence)

EMBL, DDBJ, GenBank
Data submitted by sequence owner
Must provide certain information and CDS if
applicable
No additional annotation added
Entries never merged some redundancy

15
Example EMBL entry 1 general info
Accession number
ID AB083336 standard genomic DNA MAM 6116 BP.
AC AB083336 XX SV AB083336.1 DT 06-JAN-2005
(Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last
updated, Version 1) DE Sus scrofa p27Kip1 gene
for p27Kip1, p27Kip1R, complete cds, alternative
DE splicing. OS Sus scrofa (pig) OC Eukaryota
Metazoa Chordata Craniata Vertebrata
Euteleostomi Mammalia OC Eutheria
Cetartiodactyla Suina Suidae Sus. RN 1 RP
1-6116 RA Hirano K., Shintani Y., Hirano M.,
Kanaide H. RT RL Submitted (08-APR-2002) to
the EMBL/GenBank/DDBJ databases. RL Katsuya
Hirano, Graduate School of Medical Sciences,
Kyushu University, RL Division of Molecular
Cardiology, Research Institute of
Angiocardiology RL 3-1-1 Maidashi, Higashi-ku,
Fukuoka, Fukuoka, 812-8582, Japan RL
(E-mailkhirano_at_molcar.med.kyushu-u.ac.jp,
Tel81-92-642-5550, RL Fax81-92-642-5552) RN
2 RA Shintani Y., Hirano K., Hirano M.,
Nishimura J., Nakano H., Kanaide H. RT "Cloning
and Charaterization of full sequence of porcine
p27Kip1 gene and RT expression of splice isoform
p27Kip1R" RL Unpublished.
Description of gene
References
16
Example EMBL entry 2 features on the sequence
-CDS
FH Key Location/Qualifiers FT source 1..6116 FT
/db_xref"taxon9823" FT /mol_type"genomic DNA"
FT /organism"Sus scrofa" FT /cell_type"liver"
FT /clone_lib"lambda Fix II porcine genomic
DNA" FT exon 784..1714 FT /evidenceNOT_EXPERIME
NTAL FT /note"The residue 2591 corresponds to
the transcription FT initiation site determined
in human gene" FT CDS join(1240..1714,2261..2271
,5104..5160) FT /codon_start1 FT
/gene"p27Kip1" FT /product"p27Kip1R" FT
/protein_id"BAD83612.1" FT /translation"MSNVRVS
NGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL FT
EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPP
KGACKVPAQ FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDS
QTGLAEQCTGIRKRPATDDSSPP FT SVSLKIGMYQLNYSSVW"
Feature type and location
Corresponding protein sequence
Feature name and information
17
Example EMBL entry 3 features on the sequence
introns and exons
FT intron 1715..2260 FT /cons_splice(5'siteNO,3
'siteNO) FT exon 2261..2390 FT /number2 FT
intron 2391..4494 FT /cons_splice(5'siteNO,3'si
teNO) FT exon 4495..5824 FT /note"ending at a
putative poly A site following a polyA FT
signal" FT /number3 FT polyA_signal 5802..5807
XX SQ Sequence 6116 BP 1583 A 1392 C 1438 G
1703 T 0 other gcggccgcga gctcaattaa
ccctcactaa agggagtcga ctcgatctcg aagccctttt 60
cttgttttta ttgagggaga gcttgggttc agaatacatt
acaaatgcag catctattcc 120 agtctactta tagaaagacg
tcctcctggg cttcccccct aagccccctg cctcccctag 180
aacagcacag acttctaggt taagggtgag ctaaccactg
ctcaccccca gctaaggcac 240 ccaggctcag gggctccccg
cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300
gagcccagct gggggccgag cgcccagcgg cgagcccagc
tgcccgcccc tacccgctcg 360 gcgagcgagg ggaaaataag
atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420
DNA sequence
18
Feature lines in EMBL entries

Describes features on a sequence NB for function,
replication, recombination, structure etc.
Feature key e.g. CDS protein-coding sequence,
ribosome binding site
Functional group
Location
How to find feature
Qualifier
Additional info

19
Summary of information in EMBL entries

Describes sequence type, e.g. genomic DNA, RNA,
EST
Provides taxonomy from which sequence came
Provides information on submitters and references
Describes features on a sequence NB for function,
replication, recombination, structure etc.
Shows if the DNA encodes a protein (CDS) and
provides protein sequence
Provides actual nucleotide sequence

20
Protein sequences
All this info needs to be captured in a database
21
Protein Sequence Databases

UniProt
Swiss-Prot manually curated, distinguishes
between experimental and computationally derived
annotation
TrEMBL - Automatic translation of EMBL, no manual
curation, some automatic annotation
GenPept -GenBank translations
RefSeq - Non-redundant sequences for certain
organisms

22
Example of a Swiss-Prot entry 1
General information
References
23
Example of a Swiss-Prot entry 2
Functional information
Cross-references
24
Example of a Swiss-Prot entry 3
Keywords
Features
Sequence
25
Swiss-Prot annotation mainly found in

Comment (CC) lines
Feature table (FT)
features on the sequence, e.g. domain, active
site
Keyword (KW) lines
Set of a few hundred controlled vocabulary terms
Description (DE) lines
Protein name/function

26
Types of the CC lines

ALTERNATIVE PRODUCTS
CATALYTIC
CAUTION
COFACTOR
DEVELOPMENTAL STAGE
DISEASE
DOMAIN
ENZYME REGULATION
FUNCTION
INDUCTION

MASS SPECTROMETRY
PATHWAY
PHARMACEUTICALS
POLYMORPHISM
PTM
SIMILARITY
SUBCELLULAR LOCATION
SUBUNIT
TISSUE SPECIFICITY

27
Feature types handle

Change indicators (conflict, variant, varsplic)
Amino-acid modifications (PTM, disulphide bond,
glycosylation site, binding site etc.)
Regions (signal, chain, peptide, domain, repeat,
transmembrane region etc.)
Secondary structure
Other features

28
Other features

ACT_SITE - Amino acid(s) involved in the activity
of an enzyme
SITE - Any other interesting site on the sequence
INIT_MET - The sequence is known to start with an
initiator methionine
NON_TER - The residue at an extremity of the
sequence is not the terminal residue
NON_CONS - Non consecutive residues
UNSURE - Uncertainties in the sequence

29
Other line types

References
Taxonomy
Gene Ontology
Database cross-references
Sequence!

30
TrEMBL

Swiss-Prot cant cope with the quantities of new
sequence data
They dont want to dilute the quality of
Swiss-Prot
Solution TrEMBL (TRanslation of EMBL) contains
all translations of CDS in the Nucleotide
Sequence Database not in SWISS-PROT
TrEMBL is automatically generated and annotated
using software tools

31
Other parts to UniProt

UniParc archive of all sequences
UniProt Swiss-Prot TrEMBL
UniProt NREF100 (100 seqs merged)
UniProt NREF90 (90 seqs merged)
UniProt NREF50 (50 seqs merged)

32
Submitting sequences to EMBL or UniProt
WEB-IN -web-based submission tool for submitting
DNA sequences to EMBL database.
Protein sequences submitted when the peptides
have been directly sequenced. Submit through SPIN
33
Sequence formats

Not MSWord, but text!
Most include an ID/name/annotation of some sort
FASTA, E.g.
gtxyz some other comment ttcctctttctcgactccatct
tcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccg
cgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcca
gatcaaggctcatgtagcctcactgg
Others specific to programs, e.g. GCG, abi,
clustal, etc.

34
Accession numbers

GenBank/EMBL/DDBJ 1 letter digits, e.g.
U12345 or 2 letters 6 digits, e.g. AY123456
GenPept Sequence Records -3 letters 5 digits,
e.g. AAA12345
UniProt -All 6 characters A,B,O,P,Q 0-9
A-Z,0-9 A-Z,0-9 A-Z,0-9 0-9, e.g.P12345
and Q9JJS7

35
Cross-referencing identifiers

So many different IDs for same thing, e.g.
Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID,
etc.
Need mapping files to move between them to avoid
having to parse every entry
PICR (http//www.ebi.ac.uk/Tools/picr/) enables
mapping between IDs

36
Literature database PubMed/Medline

Source of Medical-related scientific literature
PubMed has articles published after 1965
Can search by many different means, e.g. author,
title, date, journal etc., or keywords for each
PubMed has list of tags to search specific
fields, e.g. AU, TI, DP etc.
Can save queries and results
Can usually retrieve abstracts and full papers

37
Taxonomy Databases

Most used is NCBIs taxonomy database
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbT
axonomy
Provides entries for all known organisms
Provides taxonomic lineage and translation table
for organisms
Sequence entries for organism
UniProt-specific taxonomy database is Newt
http//www.ebi.ac.uk/newt

Introduction to Bioinformatics and Biological databases - PowerPoint PPT Presentation

Introduction to Bioinformatics and Biological databases

Biological databases. http://cbio.uct.ac.za/training/courses ... UniProt-specific taxonomy database is Newt: http://www.ebi.ac.uk/newt. Example taxonomy entry ... – PowerPoint PPT presentation