Title: Biological Databases
1Biological Databases
- What types of data are available?
- What is a database?
- What are Genbank and Entrez?
- What does a typical entry look like?
- How does one use the database?
BIO520 Bioinformatics Jim Lund
2Biological Data
- Central Dogma-o-centric
- Genomic DNA sequence
- mRNA/cDNA sequence
- Protein sequence
- Protein 3D structure
- Literature (Function)
3Biological Data
- Genomic DNA sequence (complete)
- mRNA/cDNA sequence
- Gene expression data (NEW)
- Microarrays, SAGE
- Expression catalogs
- Protein sequence
- Protein interaction/complex data (NEW)
- Protein 3D structure
- Literature (Function)
- Organism databases (NEW)
- Annotation and classification projects (NEW)
4What is a Biological Database?
- An organized body of persistent data and
associated computer software for updating,
querying, and retrieving data records. - Collection of records and files
- Organized for a particular purpose
- The database is separate from the interface and
can have several interfaces. - NCBI Protein can be searched by protein name or
using BLAST (Basic Local Alignment Search Tool).
5Common database features
- Relational Databases
- Tables
- Relationships between tables
- Version Control
- Consistency enforcement
- Multiauthor/multiuser with security
6BIO520 Student Database
Table
- 2005
- Name ID Grade
- Amy 123 A
- Joe 456 B
- Sue 789 C
.
Record
Attribute
7Genbank Entry
LOCUS BC005255 495 bp
mRNA linear PRI 23-JUN-2006 DEFINITION Homo
sapiens insulin, mRNA (cDNA clone IMAGE3950204),
complete cds. ACCESSION BC005255 VERSION
BC005255.1 GI13528923 KEYWORDS MGC. SOURCE
Homo sapiens (human) ORGANISM Homo sapiens
Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Euarchontoglires Primates
Haplorrhini Catarrhini Hominidae
Homo. FEATURES Location/Qualifiers sou
rce 1..495
/organism"Homo sapiens" gene 1..495
/gene"INS"
/db_xref"GeneID3630" CDS 60..392
/gene"INS"
/translation"MALWMRLLPLLALLALWGPDPAAAFVNQHLCGS
HLVEALYLVCG ERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLAL
EGSLQKRGIVEQCCTSICSL
YQLENYCN" ORIGIN 1 agccctccag
gacaggctgc atcagaagag gccatcaagc agatcactgt
ccttctgcca 421 ccgcctcctg caccgagaga
gatggaataa agcccttgaa ccaacaaaaa aaaaaaaaaa
481 aaaaaaaaaa aaaaa //
8The CORE DDBJ, EMBL, and Genbank
9Genbank DNA Sequence Database
- Genbank/EMBL/DDBJ Mirror exchange sequence
records. - Primary vs Secondary Databases
- nr (non-redundant database)
- Primary vs secondary records
- Sequence vs inferred property (coding region)
- Format vs content
10Genbank Entry
LOCUS PCU30791 1234 bp mRNA
PLN 31-MAY-1996 DEFINITION Pneumocystis
carinii carinii form 6 guanine nucleotide binding
protein alpha subunit (pcg1) mRNA, complete
cds. ACCESSION U30791 NID
g1345098 VERSION U30791.1 GI1345098
Unique ID Version Control
11Content-Taxonomy
SOURCE Pneumocystis carinii f. sp. carinii.
ORGANISM Pneumocystis carinii f. sp. carinii
Eukaryota Fungi Ascomycota
Archiascomycetes Pneumocystidaceae Pneumocystis.
12Reference
REFERENCE 1 (bases 1 to 1234) AUTHORS
Smulian,A.G., Ryan,M., Staben,C. and Cushion,M.
TITLE Signal transduction in Pneumocystis
carinii characterization of the genes (pcg1)
encoding the alpha subunit of the G protein
(PCG1) of Pneumocystis carinii carinii and
Pneumocystis carinii ratti JOURNAL Infect.
Immun. 64 (3), 691-701 (1996) PUBMED 96186460
- Unique crossreferent
- Can be gt1 reference
13Features
FEATURES Location/Qualifiers source 1..1234
/organism"Pneumocystis carinii f. sp.
carinii /strain"Form 6 /note"450 kb
chromosome" /db_xref"taxon38081 5'UTR
1..90 gene 91..1155 /gene"pcg1"
14CDS
CDS 91..1155 /gene"pcg1 /note"G-protein
alpha subunit" /codon_start1
/product "guanosine nucleotide binding
protein alpha subunit"
/protein_id"AAC49295.1"
/db_xref"PIDg1345099"
/db_xref"GI1345099"
/translation"MGCCFSATYNQDTLRSKEIE
SYLRQEQEHACHEAKILLLGAGES
.
INFERRED
15DNA
BASE COUNT 421 a 171 c 195 g 447 t ORIGIN
1 tgaattctaa attttatatt 1201 tattttttta
tgctccagat aaaa //
16Genbank entries
- Combination of required (LOCUS, SOURCE) and
optional fields. - The entry is hierarchical, some fields contain
subfields. REFERENCE-gtAUTHORS - Some fields can appear multiple times (REFERENCE,
/gene) - Some fields are numerical, other are text. Some
fields contain free text, others use a controlled
vocabulary or an database ID.
17Other Genbank Formats
- FASTA
- Simple, little annotation information
- Easy to use
- Common denominator format
- ASN1
- Computer friendly, human unfriendly
- XML, INSDSeqXML, TinySeqXML
- Graph (graphical map of seq features)
- and more
18DNA Sequence Files Common formats
- Genbank (used by VectorNTI)
- FASTA
- GCG
- Accelrys GCG package
- formerly known as the GCG Wisconsin Package
- (GCG Genetics Computer Group)
- Many others!
19FASTA
One annotation line only!
gtgi1345098gbU30791.1PCU30791
TGAATTCTAAATTTTATATTTCTAATTGCATTTTATATTTTTGATAATAC
TAGATTTATTCCTGGAAACT TAAATTAGTTATTTTAAGTTATGGGATGT
TGTTTTTCTGCTACATATAACCAAGATACACTTCGTTCCAA
20Submitting sequences to Genbank
- Sequin
- Stand-alone sequence submission tool.
- BankIt
- Web based sequence submission.
21Genbank is an ARCHIVE
- The literature and secondary databases are the
knowledge sources. - There are many additional NCBI annotation
databases
22NCBI annotation databases!
- Genbank -gt RefSeq (Single sequence for each gene)
- Entrez Gene (Gene-based links to annotation
sources). - HomoloGene (Homologs)
- OMIM
- Conserved domains, 3D domains
- GEO (Gene expression datasets)
- DNA, protein, 3D structures
- Interaction data
- Links to other databases!
- NCBI Genomes
- NCBI Map viewer
23Accessing/Editing DNA files
- Find DNA Entrez
- Downloading files
- Format Conversion
- Sequence viewing/editing
24Entrez
- (Relational database manager)
- Database searching/browsing
- Example Pneumocystis G-proteins
- PCR a cDNA to express in E. coli
- Read about it and related genes
- Check similarity to related G-proteins
- View the 3D structure??
- http//www.ncbi.nlm.nih.gov/Entrez/
25Entrez Neighbors-Literature
DNA Protein Structure Genome Popset
Article
Keyword, authors
citation
Article
26Entrez Neighbors-DNA
citation
DNA
Literature
encoding
BLASTN
DNA
Protein
27Entrez Neighbors-Protein
3D Structure
citation
citation
Protein
Literature
encoding
BLASTP
Protein
DNA
28Entrez Neighbors-Structure
Protein
citation
Structure
citation
Literature
VAST
Structure
29Nucleic Acid Manipulations
30File Conversion
- Readseq
- Download program
- http//iubio.bio.indiana.edu/soft/molbio/readseq
- Use online
- http//www.ebi.ac.uk/cgi-bin/readseq.cgi
- http//searchlauncher.bcm.tmc.edu/seq-util/readseq
.html
- VectorNTI
- Other utilities
- Readseq ----gt
Beware Information Loss
31Reverse Complementing
5-GAATCA-3
5-TGATTC-3 NOT 5-ACTAAAG-3
32Sequence Statistics
- Nucleotide frequencies (di, tri)
- UV Absorbance
- MW
- Tm
33Restriction Map
- Linear vs Circular
- Enzyme sets
- Which enzymes, where they cut.
- Gel simulation
- Gel-to-map MUCH harder!!
- Useful for
- Cloning
- Southern blots
- Specialized mol bio techniques
34Translation/ORFs
- Translation table
- Standard vs non-standard
- Frame (1,2,3,4,5,6)
- Segmental translation (exon-intron)
- Primary translation vs mature polypeptide
35Sequence File Editing
- VectorNTI
- -Windows editor
- (eg Word-save as TEXT)
- Text editor
- Notepad, Simpletext
- Wordprocessor
- vi
MWGTCC IIIIII MWGTCC IIIIII
Nonproportional fonts (courier, monospaced)
36Plasmids-challenges
- Parent/child database
- Dynamic updating
- Known/unknown segments
- Heuristic constructions (PCR, Restriction
digests) - No uniform nomenclature
- No uniform datastructure
- No public database
VECTOR NTI HELPS
37Primer design program Primer3
http//frodo.wi.mit.edu/cgi-bin/primer3/primer3_ww
w.cgi