Introduction to Bioinformatics and Biological databases - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Introduction to Bioinformatics and Biological databases

Description:

Biological databases. http://cbio.uct.ac.za/training/courses ... UniProt-specific taxonomy database is Newt: http://www.ebi.ac.uk/newt. Example taxonomy entry ... – PowerPoint PPT presentation

Number of Views:1085
Avg rating:3.0/5.0
Slides: 40
Provided by: bea121
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics and Biological databases


1
Introduction to Bioinformatics and Biological
databases
  • Nicky Mulder nicola.mulder_at_uct.ac.za

2
What is Bioinformatics?
  • The application of computer technology to the
    management of biological information.
    Specifically, it is the science of developing
    computer databases and algorithms to facilitate
    and expedite biological research, particularly in
    genomics. www.informatics.jax.org/mgihome/other/g
    lossary.shtml
  • Doing biology on a computer (computational
    biology)
  • Some parts of biology involve strings of
    characters, e.g. DNA and protein sequences, these
    are best read by a computer.

3
Why is Bioinformatics needed?
  • Small- and large-scale biological analyses
  • New laboratory technologies, e.g. sequencing
  • Move away from single gene to whole genome
  • Collection and storage of biological information
  • Manipulation of biological information
  • Computers have capability for both, and cheap

4
Hypothesis-driven bioinformatics
Retrieve articles
Search PubMed for more information
Looking at whole genomes
Sequence similarity search
Retrieve the DNA sequence
Analysing DNA sequence
Gene of interest
Sequence alignments, finding motifs
Retrieve the protein sequence
Sequence similarity search
Finding domains, Classifying proteins
Phylogenetics
5
Hypothesis-generating bioinformatics
High-throughput experiment (microarray,
Proteomics, NGS)
Data mining
Gene lists
Data processing
Gene set enrichment
Experimental design
Pathway analysis
Data integration
Statistics
Systems biology
6
Two major components to Bioinformatics
  • Storing and retrieving data
  • Biological databases
  • Querying these to retrieve data
  • Manipulating the data tools e.g
  • Sequence similarity searches
  • Protein families and function prediction
  • Comparing sequences -phylogenetics

7
What is a database
  • an organized body of related infomation
    www.cogsci.princeton.edu/cgi-bin/webwn
  • Data collection that is
  • Structured (computer readable)
  • Searchable
  • Updatable
  • Cross-linked
  • Publicly available

8
Biological Databases
  • Make data available to public
  • So much data available, needs ordering
  • Turn data into computer-readable form
  • Ability to retrieve data from various sources
  • Can have primary (archival) or secondary
    databases (curated)
  • Most commonly used are sequence databases

9
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
10
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Biological systems
11
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Ontologies -GO
Biological systems
12
Sequence databases
  • Used for retrieving a known gene/protein sequence
  • Useful for finding information on a gene/protein
  • Can find out how many genes are available for a
    given organism
  • Can comparing your sequence to the others in the
    database
  • Can submit your sequence to store with the rest
  • Main databases nucleotide and protein sequence
    DBs

13
Requirements for good sequence database
  • It must be complete with minimal redundancy
  • It must contain as much up-to-date information
    (annotation) as possible on each sequence
  • All the information items must be retrievable by
    computer programs in a consistent manner
  • It must be highly interoperable with other
    databases

14
Nucleotide sequence databases
Exons
Promoter
CDS (coding sequence)
  • EMBL, DDBJ, GenBank
  • Data submitted by sequence owner
  • Must provide certain information and CDS if
    applicable
  • No additional annotation added
  • Entries never merged some redundancy

15
Example EMBL entry 1 general info
Accession number
ID AB083336 standard genomic DNA MAM 6116 BP.
AC AB083336 XX SV AB083336.1 DT 06-JAN-2005
(Rel. 82, Created) DT 06-JAN-2005 (Rel. 82, Last
updated, Version 1) DE Sus scrofa p27Kip1 gene
for p27Kip1, p27Kip1R, complete cds, alternative
DE splicing. OS Sus scrofa (pig) OC Eukaryota
Metazoa Chordata Craniata Vertebrata
Euteleostomi Mammalia OC Eutheria
Cetartiodactyla Suina Suidae Sus. RN 1 RP
1-6116 RA Hirano K., Shintani Y., Hirano M.,
Kanaide H. RT RL Submitted (08-APR-2002) to
the EMBL/GenBank/DDBJ databases. RL Katsuya
Hirano, Graduate School of Medical Sciences,
Kyushu University, RL Division of Molecular
Cardiology, Research Institute of
Angiocardiology RL 3-1-1 Maidashi, Higashi-ku,
Fukuoka, Fukuoka, 812-8582, Japan RL
(E-mailkhirano_at_molcar.med.kyushu-u.ac.jp,
Tel81-92-642-5550, RL Fax81-92-642-5552) RN
2 RA Shintani Y., Hirano K., Hirano M.,
Nishimura J., Nakano H., Kanaide H. RT "Cloning
and Charaterization of full sequence of porcine
p27Kip1 gene and RT expression of splice isoform
p27Kip1R" RL Unpublished.
Description of gene
References
16
Example EMBL entry 2 features on the sequence
-CDS
FH Key Location/Qualifiers FT source 1..6116 FT
/db_xref"taxon9823" FT /mol_type"genomic DNA"
FT /organism"Sus scrofa" FT /cell_type"liver"
FT /clone_lib"lambda Fix II porcine genomic
DNA" FT exon 784..1714 FT /evidenceNOT_EXPERIME
NTAL FT /note"The residue 2591 corresponds to
the transcription FT initiation site determined
in human gene" FT CDS join(1240..1714,2261..2271
,5104..5160) FT /codon_start1 FT
/gene"p27Kip1" FT /product"p27Kip1R" FT
/protein_id"BAD83612.1" FT /translation"MSNVRVS
NGSPSLERMDARQAEYPKPSACRNLFGPVNHEELTRDL FT
EKHCRDMEEASQRKWNFDFQNHKPLEGKYEWQEVEKGSLPEFYYRPPRPP
KGACKVPAQ FT EGQGVSGTRQAVPLIGSQANSEDTHLVDQKTDAPDS
QTGLAEQCTGIRKRPATDDSSPP FT SVSLKIGMYQLNYSSVW"
Feature type and location
Corresponding protein sequence
Feature name and information
17
Example EMBL entry 3 features on the sequence
introns and exons
FT intron 1715..2260 FT /cons_splice(5'siteNO,3
'siteNO) FT exon 2261..2390 FT /number2 FT
intron 2391..4494 FT /cons_splice(5'siteNO,3'si
teNO) FT exon 4495..5824 FT /note"ending at a
putative poly A site following a polyA FT
signal" FT /number3 FT polyA_signal 5802..5807
XX SQ Sequence 6116 BP 1583 A 1392 C 1438 G
1703 T 0 other gcggccgcga gctcaattaa
ccctcactaa agggagtcga ctcgatctcg aagccctttt 60
cttgttttta ttgagggaga gcttgggttc agaatacatt
acaaatgcag catctattcc 120 agtctactta tagaaagacg
tcctcctggg cttcccccct aagccccctg cctcccctag 180
aacagcacag acttctaggt taagggtgag ctaaccactg
ctcaccccca gctaaggcac 240 ccaggctcag gggctccccg
cctcccccgc tgagcgagcg gtgggggccc ccccgggaga 300
gagcccagct gggggccgag cgcccagcgg cgagcccagc
tgcccgcccc tacccgctcg 360 gcgagcgagg ggaaaataag
atcgccctcg gcgaggagag ggaggtcggg gctccggagc 420
DNA sequence
18
Feature lines in EMBL entries
  • Describes features on a sequence NB for function,
    replication, recombination, structure etc.
  • Feature key e.g. CDS protein-coding sequence,
    ribosome binding site
  • Functional group
  • Location
  • How to find feature
  • Qualifier
  • Additional info

19
Summary of information in EMBL entries
  • Describes sequence type, e.g. genomic DNA, RNA,
    EST
  • Provides taxonomy from which sequence came
  • Provides information on submitters and references
  • Describes features on a sequence NB for function,
    replication, recombination, structure etc.
  • Shows if the DNA encodes a protein (CDS) and
    provides protein sequence
  • Provides actual nucleotide sequence

20
Protein sequences
All this info needs to be captured in a database
21
Protein Sequence Databases
  • UniProt
  • Swiss-Prot manually curated, distinguishes
    between experimental and computationally derived
    annotation
  • TrEMBL - Automatic translation of EMBL, no manual
    curation, some automatic annotation
  • GenPept -GenBank translations
  • RefSeq - Non-redundant sequences for certain
    organisms

22
Example of a Swiss-Prot entry 1
General information
References
23
Example of a Swiss-Prot entry 2
Functional information
Cross-references
24
Example of a Swiss-Prot entry 3
Keywords
Features
Sequence
25
Swiss-Prot annotation mainly found in
  • Comment (CC) lines
  • Feature table (FT)
  • features on the sequence, e.g. domain, active
    site
  • Keyword (KW) lines
  • Set of a few hundred controlled vocabulary terms
  • Description (DE) lines
  • Protein name/function

26
Types of the CC lines
  • ALTERNATIVE PRODUCTS
  • CATALYTIC
  • CAUTION
  • COFACTOR
  • DEVELOPMENTAL STAGE
  • DISEASE
  • DOMAIN
  • ENZYME REGULATION
  • FUNCTION
  • INDUCTION
  • MASS SPECTROMETRY
  • PATHWAY
  • PHARMACEUTICALS
  • POLYMORPHISM
  • PTM
  • SIMILARITY
  • SUBCELLULAR LOCATION
  • SUBUNIT
  • TISSUE SPECIFICITY

27
Feature types handle
  • Change indicators (conflict, variant, varsplic)
  • Amino-acid modifications (PTM, disulphide bond,
    glycosylation site, binding site etc.)
  • Regions (signal, chain, peptide, domain, repeat,
    transmembrane region etc.)
  • Secondary structure
  • Other features

28
Other features
  • ACT_SITE - Amino acid(s) involved in the activity
    of an enzyme
  • SITE - Any other interesting site on the sequence
  • INIT_MET - The sequence is known to start with an
    initiator methionine
  • NON_TER - The residue at an extremity of the
    sequence is not the terminal residue
  • NON_CONS - Non consecutive residues
  • UNSURE - Uncertainties in the sequence

29
Other line types
  • References
  • Taxonomy
  • Gene Ontology
  • Database cross-references
  • Sequence!

30
TrEMBL
  • Swiss-Prot cant cope with the quantities of new
    sequence data
  • They dont want to dilute the quality of
    Swiss-Prot
  • Solution TrEMBL (TRanslation of EMBL) contains
    all translations of CDS in the Nucleotide
    Sequence Database not in SWISS-PROT
  • TrEMBL is automatically generated and annotated
    using software tools

31
Other parts to UniProt
  • UniParc archive of all sequences
  • UniProt Swiss-Prot TrEMBL
  • UniProt NREF100 (100 seqs merged)
  • UniProt NREF90 (90 seqs merged)
  • UniProt NREF50 (50 seqs merged)

32
Submitting sequences to EMBL or UniProt
WEB-IN -web-based submission tool for submitting
DNA sequences to EMBL database.
Protein sequences submitted when the peptides
have been directly sequenced. Submit through SPIN
33
Sequence formats
  • Not MSWord, but text!
  • Most include an ID/name/annotation of some sort
  • FASTA, E.g.
  • gtxyz some other comment ttcctctttctcgactccatct
    tcgcggtagctgggaccgccgttcagtcgccaatatgcgctctttgtccg
    cgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcca
    gatcaaggctcatgtagcctcactgg
  • Others specific to programs, e.g. GCG, abi,
    clustal, etc.

34
Accession numbers
  • GenBank/EMBL/DDBJ 1 letter digits, e.g.
    U12345 or 2 letters 6 digits, e.g. AY123456
  • GenPept Sequence Records -3 letters 5 digits,
    e.g. AAA12345
  • UniProt -All 6 characters A,B,O,P,Q 0-9
    A-Z,0-9 A-Z,0-9 A-Z,0-9 0-9, e.g.P12345
    and Q9JJS7

35
Cross-referencing identifiers
  • So many different IDs for same thing, e.g.
    Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID,
    etc.
  • Need mapping files to move between them to avoid
    having to parse every entry
  • PICR (http//www.ebi.ac.uk/Tools/picr/) enables
    mapping between IDs

36
Literature database PubMed/Medline
  • Source of Medical-related scientific literature
  • PubMed has articles published after 1965
  • Can search by many different means, e.g. author,
    title, date, journal etc., or keywords for each
  • PubMed has list of tags to search specific
    fields, e.g. AU, TI, DP etc.
  • Can save queries and results
  • Can usually retrieve abstracts and full papers

37
Taxonomy Databases
  • Most used is NCBIs taxonomy database
    http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbT
    axonomy
  • Provides entries for all known organisms
  • Provides taxonomic lineage and translation table
    for organisms
  • Sequence entries for organism
  • UniProt-specific taxonomy database is Newt
  • http//www.ebi.ac.uk/newt

38
Example taxonomy entry
39
Where to find the databases
  • Table of addresses for major databases and tools
  • Nucleic Acids Research Database issue January
    each year
  • Nucleic Acids Research Software issue new
  • Amoss list of tools http//www.expasy.ch/alinks.
    html
Write a Comment
User Comments (0)
About PowerShow.com