Title: Introduction to Bioinformatics and Biological databases
1Introduction to Bioinformatics and Biological
databases
- Nicky Mulder nicola.mulder_at_uct.ac.za
2What is Bioinformatics?
- The application of computer technology to the
management of biological information.
Specifically, it is the science of developing
computer databases and algorithms to facilitate
and expedite biological research, particularly in
genomics. www.informatics.jax.org/mgihome/other/g
lossary.shtml - Doing biology on a computer (computational
biology) - Some parts of biology involve strings of
characters, e.g. DNA and protein sequences, these
are best read by a computer.
3Why is Bioinformatics needed?
- Small- and large-scale biological analyses
- New laboratory technologies, e.g. sequencing
- Move away from single gene to whole genome
- Collection and storage of biological information
- Manipulation of biological information
- Computers have capability for both, and cheap
4Hypothesis-driven bioinformatics
Retrieve articles
Search PubMed for more information
Looking at whole genomes
Sequence similarity search
Retrieve the DNA sequence
Analysing DNA sequence
Gene of interest
Sequence alignments, finding motifs
Retrieve the protein sequence
Sequence similarity search
Finding domains, Classifying proteins
Phylogenetics
5Hypothesis-generating bioinformatics
High-throughput experiment (microarray,
Proteomics, NGS)
Data mining
Gene lists
Data processing
Gene set enrichment
Experimental design
Pathway analysis
Data integration
Statistics
Systems biology
6Two major components to Bioinformatics
- Storing and retrieving data
- Biological databases
- Querying these to retrieve data
- Manipulating the data tools e.g
- Sequence similarity searches
- Protein families and function prediction
- Comparing sequences -phylogenetics
7What is a database
- an organized body of related infomation
www.cogsci.princeton.edu/cgi-bin/webwn - Data collection that is
- Structured (computer readable)
- Searchable
- Updatable
- Cross-linked
- Publicly available
8Biological Databases
- Make data available to public
- So much data available, needs ordering
- Turn data into computer-readable form
- Ability to retrieve data from various sources
- Can have primary (archival) or secondary
databases (curated) - Most commonly used are sequence databases
9Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
10Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
11Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Ontologies -GO
Biological systems
12Sequence databases
- Used for retrieving a known gene/protein sequence
- Useful for finding information on a gene/protein
- Can find out how many genes are available for a
given organism - Can comparing your sequence to the others in the
database - Can submit your sequence to store with the rest
- Main databases nucleotide and protein sequence
DBs
13Requirements for good sequence database
- It must be complete with minimal redundancy
- It must contain as much up-to-date information
(annotation) as possible on each sequence - All the information items must be retrievable by
computer programs in a consistent manner - It must be highly interoperable with other
databases
14Nucleotide sequence databases
Exons
Promoter
CDS (coding sequence)
- EMBL, DDBJ, GenBank
- Data submitted by sequence owner
- Must provide certain information and CDS if
applicable - No additional annotation added
- Entries never merged some redundancy
15Example EMBL entry 1 general info
Accession number
ID AB083336 standard genomic DNA MAM 6116 BP.
AC AB083336 XX SV AB083336.1 DT 06-JAN-2005
(Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last
updated, Version 1) DE Sus scrofa p27Kip1 gene
for p27Kip1, p27Kip1R, complete cds, alternative
DE splicing. OS Sus scrofa (pig) OC Eukaryota
Metazoa Chordata Craniata Vertebrata
Euteleostomi Mammalia OC Eutheria
Cetartiodactyla Suina Suidae Sus. RN 1 RP
1-6116 RA Hirano K., Shintani Y., Hirano M.,
Kanaide H. RT RL Submitted (08-APR-2002) to
the EMBL/GenBank/DDBJ databases. RL Katsuya
Hirano, Graduate School of Medical Sciences,
Kyushu University, RL Division of Molecular
Cardiology, Research Institute of
Angiocardiology RL 3-1-1 Maidashi, Higashi-ku,
Fukuoka, Fukuoka, 812-8582, Japan RL
(E-mailkhirano_at_molcar.med.kyushu-u.ac.jp,
Tel81-92-642-5550, RL Fax81-92-642-5552) RN
2 RA Shintani Y., Hirano K., Hirano M.,
Nishimura J., Nakano H., Kanaide H. RT "Cloning
and Charaterization of full sequence of porcine
p27Kip1 gene and RT expression of splice isoform
p27Kip1R" RL Unpublished.
Description of gene
References
16Example EMBL entry 2 features on the sequence
-CDS
FH Key Location/Qualifiers FT source 1..6116 FT
/db_xref"taxon9823" FT /mol_type"genomic DNA"
FT /organism"Sus scrofa" FT /cell_type"liver"
FT /clone_lib"lambda Fix II porcine genomic
DNA" FT exon 784..1714 FT /evidenceNOT_EXPERIME
NTAL FT /note"The residue 2591 corresponds to
the transcription FT initiation site determined
in human gene" FT CDS join(1240..1714,2261..2271
,5104..5160) FT /codon_start1 FT
/gene"p27Kip1" FT /product"p27Kip1R" FT
/protein_id"BAD83612.1" FT /translation"MSNVRVS
NGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL FT
EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPP
KGACKVPAQ FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDS
QTGLAEQCTGIRKRPATDDSSPP FT SVSLKIGMYQLNYSSVW"
Feature type and location
Corresponding protein sequence
Feature name and information
17Example EMBL entry 3 features on the sequence
introns and exons
FT intron 1715..2260 FT /cons_splice(5'siteNO,3
'siteNO) FT exon 2261..2390 FT /number2 FT
intron 2391..4494 FT /cons_splice(5'siteNO,3'si
teNO) FT exon 4495..5824 FT /note"ending at a
putative poly A site following a polyA FT
signal" FT /number3 FT polyA_signal 5802..5807
XX SQ Sequence 6116 BP 1583 A 1392 C 1438 G
1703 T 0 other gcggccgcga gctcaattaa
ccctcactaa agggagtcga ctcgatctcg aagccctttt 60
cttgttttta ttgagggaga gcttgggttc agaatacatt
acaaatgcag catctattcc 120 agtctactta tagaaagacg
tcctcctggg cttcccccct aagccccctg cctcccctag 180
aacagcacag acttctaggt taagggtgag ctaaccactg
ctcaccccca gctaaggcac 240 ccaggctcag gggctccccg
cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300
gagcccagct gggggccgag cgcccagcgg cgagcccagc
tgcccgcccc tacccgctcg 360 gcgagcgagg ggaaaataag
atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420
DNA sequence
18Feature lines in EMBL entries
- Describes features on a sequence NB for function,
replication, recombination, structure etc. - Feature key e.g. CDS protein-coding sequence,
ribosome binding site - Functional group
- Location
- How to find feature
- Qualifier
- Additional info
19Summary of information in EMBL entries
- Describes sequence type, e.g. genomic DNA, RNA,
EST - Provides taxonomy from which sequence came
- Provides information on submitters and references
- Describes features on a sequence NB for function,
replication, recombination, structure etc. - Shows if the DNA encodes a protein (CDS) and
provides protein sequence - Provides actual nucleotide sequence
20Protein sequences
All this info needs to be captured in a database
21Protein Sequence Databases
- UniProt
- Swiss-Prot manually curated, distinguishes
between experimental and computationally derived
annotation - TrEMBL - Automatic translation of EMBL, no manual
curation, some automatic annotation - GenPept -GenBank translations
- RefSeq - Non-redundant sequences for certain
organisms
22Example of a Swiss-Prot entry 1
General information
References
23Example of a Swiss-Prot entry 2
Functional information
Cross-references
24Example of a Swiss-Prot entry 3
Keywords
Features
Sequence
25Swiss-Prot annotation mainly found in
- Comment (CC) lines
- Feature table (FT)
- features on the sequence, e.g. domain, active
site - Keyword (KW) lines
- Set of a few hundred controlled vocabulary terms
- Description (DE) lines
- Protein name/function
26Types of the CC lines
- ALTERNATIVE PRODUCTS
- CATALYTIC
- CAUTION
- COFACTOR
- DEVELOPMENTAL STAGE
- DISEASE
- DOMAIN
- ENZYME REGULATION
- FUNCTION
- INDUCTION
- MASS SPECTROMETRY
- PATHWAY
- PHARMACEUTICALS
- POLYMORPHISM
- PTM
- SIMILARITY
- SUBCELLULAR LOCATION
- SUBUNIT
- TISSUE SPECIFICITY
27Feature types handle
- Change indicators (conflict, variant, varsplic)
- Amino-acid modifications (PTM, disulphide bond,
glycosylation site, binding site etc.) - Regions (signal, chain, peptide, domain, repeat,
transmembrane region etc.) - Secondary structure
- Other features
28Other features
- ACT_SITE - Amino acid(s) involved in the activity
of an enzyme - SITE - Any other interesting site on the sequence
- INIT_MET - The sequence is known to start with an
initiator methionine - NON_TER - The residue at an extremity of the
sequence is not the terminal residue - NON_CONS - Non consecutive residues
- UNSURE - Uncertainties in the sequence
29Other line types
- References
- Taxonomy
- Gene Ontology
- Database cross-references
- Sequence!
30TrEMBL
- Swiss-Prot cant cope with the quantities of new
sequence data - They dont want to dilute the quality of
Swiss-Prot - Solution TrEMBL (TRanslation of EMBL) contains
all translations of CDS in the Nucleotide
Sequence Database not in SWISS-PROT - TrEMBL is automatically generated and annotated
using software tools
31Other parts to UniProt
- UniParc archive of all sequences
- UniProt Swiss-Prot TrEMBL
- UniProt NREF100 (100 seqs merged)
- UniProt NREF90 (90 seqs merged)
- UniProt NREF50 (50 seqs merged)
32Submitting sequences to EMBL or UniProt
WEB-IN -web-based submission tool for submitting
DNA sequences to EMBL database.
Protein sequences submitted when the peptides
have been directly sequenced. Submit through SPIN
33Sequence formats
- Not MSWord, but text!
- Most include an ID/name/annotation of some sort
- FASTA, E.g.
- gtxyz some other comment ttcctctttctcgactccatct
tcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccg
cgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcca
gatcaaggctcatgtagcctcactgg - Others specific to programs, e.g. GCG, abi,
clustal, etc.
34Accession numbers
- GenBank/EMBL/DDBJ 1 letter digits, e.g.
U12345 or 2 letters 6 digits, e.g. AY123456 - GenPept Sequence Records -3 letters 5 digits,
e.g. AAA12345 - UniProt -All 6 characters A,B,O,P,Q 0-9
A-Z,0-9 A-Z,0-9 A-Z,0-9 0-9, e.g.P12345
and Q9JJS7
35Cross-referencing identifiers
- So many different IDs for same thing, e.g.
Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID,
etc. - Need mapping files to move between them to avoid
having to parse every entry - PICR (http//www.ebi.ac.uk/Tools/picr/) enables
mapping between IDs
36Literature database PubMed/Medline
- Source of Medical-related scientific literature
- PubMed has articles published after 1965
- Can search by many different means, e.g. author,
title, date, journal etc., or keywords for each - PubMed has list of tags to search specific
fields, e.g. AU, TI, DP etc. - Can save queries and results
- Can usually retrieve abstracts and full papers
37Taxonomy Databases
- Most used is NCBIs taxonomy database
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbT
axonomy - Provides entries for all known organisms
- Provides taxonomic lineage and translation table
for organisms - Sequence entries for organism
- UniProt-specific taxonomy database is Newt
- http//www.ebi.ac.uk/newt
38Example taxonomy entry
39Where to find the databases
- Table of addresses for major databases and tools
- Nucleic Acids Research Database issue January
each year - Nucleic Acids Research Software issue new
- Amoss list of tools http//www.expasy.ch/alinks.
html