Title: Bioinformatic Databases
1Bioinformatic Databases
2Take home
- The internet is a powerful resource containing a
large volume of data and tools to manipulate
them unfortunately, connecting data between them
can sometimes be tricky.
3Overview
- Whirlwind tour of Web databases
- The Rat Genome Database data, tools, and
operations
4Bioinformatic databases on the WWW
- Loose definition of database here
- Vary widely in terms of offerings, data, tools
and specialization - Vary widely in terms of data collection
methodologies
5Some classifications per NAR
- Major sequence repositories
- Gene Expression
- Comparative genomics
- Gene Identification and Structure
- Genetic and physical maps
- Genomic Databases
- Intermolecular interactions
- Metabolic Pathways and Cellular Regulation
- Mutation Databases
- Pathology
6Some classifications per NAR
- Protein Databases
- Protein sequence Motifs
- Proteome Resources
- Retrieval systems
- RNA Sequences
- Structure
- Transgenics
- Varied Biomedical Content
7Major Sequence Repositories
- GenBank
- RefSeq
- DDBJ
- Ensemble
- Unigene
- Collection of sequence data
- Genomic
- Markers
- Genes
- Proteins
- Some provide tools to expedite access
- Blast Search
- Alignment tools
- Translation tools etc.
- Varying degrees of quality control
- Machine data upload
- Human curation and QC
8Major Sequence Repositories Genbank
- All know nucleotide and protein sequences
- Provides submission system for various authors
- Little QC
9Major Sequence Repositories RefSeq
- Non redundant collection of naturally occurring
biological molecules - Human QC
- Comprehensive, integrated set of sequences for
major research organisms - Provides a stable reference for further
characterization of sequences including
comparative analyses, mutations, expression, etc.
10Major Sequence Repositories Unigene
- Attempts to cluster GenBank sequences into
gene-oriented clusters - Each cluster contains sequences that represent
one gene - Provides a stable reference for further
characterization of sequences including
comparative analyses, mutations, expression, etc.
11Major Sequence Repositories DDBJ (DNA Data Bank
of Japan)
- Japanese equivalent to NCBI efforts
- Attempting to gather all known nucleotide and
protein sequences - Part of the International Nucleotide Sequence
Collaboration
12Major Sequence Repositories EMBL Nucleotide
Sequence Database
- European equivalent to NCBI efforts
- Attempting to gather all known nucleotide and
protein sequences - Part of the International Nucleotide Sequence
Collaboration
13Major Sequence Repositories UCSC Genome Browser
- Visual representation of genome and sequence data
- Run by University of California at Santa Cruz
14Comparative Genomics
- Examines the similarities and difference in
Genome organization - Clustering of like data across multiple genomes
protein motifs - Cross referencing of genome data across genomes
15Comparative Genomics Microbial Genome Database
for Comparative Analysis
- MBGD is a database for comparative analysis of
completely sequenced microbial genomes - MBGD aims to facilitate comparative genomics from
various points of view such as ortholog
identification, paralog clustering, motif
analysis and gene order comparison
16Comparative Genomics Some specialized sites
- Homophila human diseases and Drosophila gene
relationships - CORG conserved non coding sequence blocks
- ParaDB paralog mapping in human genomes
17Comparative Genomics Clusters of Orthologous
Groups
- Phylogenetic classification of the proteins
encoded in complete genomes - Proteins grouped according to sequence by a
program called COGNITOR - Must be represented in at least three species in
a group of 43 species representing phylogenetic
lineages - Each COG consists of individual proteins or
groups of paralogs from at least 3 lineages and
thus corresponds to an ancient conserved domain.
18Gene Expression
- Analysis of gene expression patterns
- Repositories of microarray data
- Analysis of tissue specificities of gene
expression - Analysis of expression patterns for genes linked
to specific diseases - Analysis of gene expression regulatory networks
19Gene Expression Array Express
- ArrayExpress is a new public database of
microarray gene expression data at the EBI - The ArrayExpress infrastructure consists of
- the database itself,
- data submissions in MAGE-ML format or via an
online submission tool MIAMExpress, - online database query interface, and the
- Expression Profiler online analysis tool.
20Gene Expression Edinburgh Mouse Atlas Project
- database to be a resource for spatially mapped
data such as in situ gene expression and cell
lineage - The gene expression database (emage) is being
developed as part of the Mouse Gene Expression
Information Resource (MGEIR) in collaboration
with the Jackson Laboratory, USA
21Gene Expression HugeIndex (Human Gene Expression
Index)
- aims to provide a comprehensive database to
understand the expression of human genes in
normal human tissues - mRNA expression levels of thousands of genes are
obtained using high-density oligonucleotide array
technology and used to create a public
database.
22Gene Expression Other specialized sites
- Kidney development database
- TRIPLES Transposon-insertion phenotypes,
localization and expression in Saccharomyces - Tooth Development database
- MethDB DNA methylation data, patterns and
profile
23Gene Identification and Structure
- Focuses on the analysis of sequences to determine
gene structures - Analysis of gene expression control signals
- Analysis of coding signals
- Analysis of variations in the exons alleles
- Analysis of codon usage
24Gene Identification and StructureSNP Consortium
database
- collaboration that has to date discovered and
characterized nearly 1.8 million SNPs - Now that the SNP discovery phase of the TSC
project is essentially complete, the emphasis has
shifted to studying SNPs in populations
25Gene Identification and StructureAlternative
Splicing Annotation Project (ASAP)
- for biologists to access and mine the enormous
wealth of alternative splicing information coming
from genomics and proteomics - use the UniGene clusters of human Expressed
Sequence Tags (ESTs) to identify splices
26Gene Identification and StructurePromEC
- Database of promoters of characterized genes in
E. coli
27Gene Identification and StructureSome other
specialized sites
- Place Plant cis acting regulatory elements
- Sputnik Functional annotation of clustered
plant ESTs - VIDA Virus Open reading frames
- HS3D Human exon, intron, splice regions
28Genetic and physical maps
- Repository for marker information
- Data on gene locations within the genome
- Map of cloned sequences
- Tools to integrate information across genomes
29Genetic and Physical MapsHuGeMap
- Collections of human genetic maps from Genethon
and the Coorperative Human Linkage Center - Collections of physical maps from Genethon and
the Whitehead Institute
30Genetic and Physical MapsGeneMap99
- A map of 30,181Â human gene-based markers was
assembled and integrated with the current genetic
map by radiation hybrid mapping. - constitutes an important infrastructure and tool
for the study of complex genetic traits, the
positional cloning of disease genes, the
cross-referencing of mammalian genomes, and
validated human transcribed sequences for
large-scale studies of gene expression
31Genomic Databases
- Data repositories for research results on various
model organisms - Rat
- Human
- Fruit fly
- Worm
- Arabidopsis
- Some other rodent
- Linking information across databases
- Tools to organize and integrate information
32Genomic DatabasesThe Rat Genome Database
- Consolidates and integrates Rat research data
- Presents data on genes, qtls, sslps,ests etc.
- Fields a series of tools to help analysis and
integration with data within and without.
33Genomic DatabasesFlyBase
- Focuses on Drophila genome data
- Presents data on genes, stocks, ests,
transposons, sequences. - Not a lot of tools
34Genomic DatabasesEcoGene
- EcoGene is a collection of information about the
genes, proteins, and intergenic regions of the E.
coli K-12 genome and proteome - Collaborative effort between many laboratories
35Genomic DatabasesSome other examples
- wormbase C. elegans
- oryzabase rice
- TAIR Arabidopsis
- IRIS Rice germplasm
- MitoDat Mitochondrial proteins
- MGI Medicago
- CropNet crop plants
- MGD another rodent
36Mutation Databases
- Allele distributions in populations
- Inherited genetics diseases
- Mutations in proteins implicated in disease
development
37Mutation Databases ALFRED
- designed to make allele frequency data on
anthropologically defined human population
samples readily available to the scientific
community - link these polymorphism data to the molecular
genetics-human genome databases
38Mutation Databases Human Gene Mutation Database
- an attempt to collate known (published) gene
lesions responsible for human inherited disease - provides information of practical diagnostic
importance to - researchers and diagnosticians in human molecular
genetics - physicians interested in a particular inherited
condition in a given patient or family - genetic counsellors.
39Mutation DatabasesOnline Mendelian Inheritance
in Man (OMIM)
- catalog of human genes and genetic disorders
- contains textual information, pictures, and
reference information
40Mutation Databases Other examples
- Atlas of Genetics and Cytogenetics in Oncology
and Haematology - Database of Germline p53 Mutations
- SV40 Large T-Antigen Mutant Database
- KinMutBase Disease causing kinase mutations
41Protein Databases
- Protein sequences collection
- Clustering of protein data into families
- Specialized protein sites
- Organism
- Function
- Large variety of enzymes
42Protein Databases InterPro
- a database of protein families, domains and
functional sites in which identifiable features
found in known proteins can be applied to unknown
protein sequences - amalgamating the major protein signature
databases, data have been manually integrated and
curated and are available in InterPro - PROSITE
- Pfam
- PRINTS
- ProDom
- SMART
- TIGRFAMs
43Protein DatabasesProtoNet
- provides global classification of the proteins,
from the SWISS-PROT database into hierarchical
clusters - clustering is based on an all-against-all BLAST
similarity search
44Protein DatabasesiProClass
- an integrated resource that provides
comprehensive family relationships and
structural/functional features of proteins - currently consists of non-redundant PIR and
SwissProt/TrEMBL proteins - 36,200 PIR superfamilies
- 145,300 families
- 5720 domains
- 1300 motifs
- 280 post-translational modification sites
- links to over 50 biological databases.
45Protein Databases Other Examples
- Nuclear Protein Database Proteins localized in
the nucleus - PLANT-Pls Plant protease inhibitors
- SWISS-PROT/TrEMBL Curated protein sequences
- SENTRA Sensory signal transduction proteins
- Ribonuclease P Database
46Protein Sequence Motifs
- Alignment of protein sequences
- Organization of proteins into families
47Protein Sequence MotifsBLOCKS
- multiply aligned ungapped segments corresponding
to the most highly conserved regions of proteins - Tools
- Block Searcher -- compare a protein or DNA
sequence to a database of protein blocks - Get Blocks -- retrieve blocks
- Block Maker -- create new blocks
48Protein Sequence MotifsPfam
- a large collection of multiple sequence
alignments and hidden Markov models covering many
common protein domains and families. - For each family in Pfam you can
- Look at multiple alignments
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures
49Protein Sequence MotifsPROSITE
- database of protein families and domains. It
consists of biologically characterized sites,
patterns and profiles that help to reliably
identify to which known protein family (if any) a
new sequence belongs - currently contains patterns and profiles specific
for more than a thousand protein families or
domains. - each of these signatures comes with documentation
providing background information on the structure
and function of these proteins
50Protein Sequence Motifs Other Examples
- ASC Active Sequence Collection Biologically
active oligopeptides - ClusTr Automatic classification of SWISS-PROT
and TrEMBL proteins - TMPDB Experimentally-characterized
transmembrane topology - O-GLYCBASE O- and C- linked glycosylation sites
in proteins
51RNA Sequences
- Repository of RNA sequences
- RNA structure data
- RNA metabolism information
- Specialized site by organism, function, etc
52RNA SequencesHyPaLib
- contains annotated structural elements
characteristic for certain classes of structural
and/or functional RNAs - developing software tools that allow a user to
search sequence databases for any pattern in
HyPaLib
53RNA SequencesRfam
- a collection of multiple sequence alignments and
covariance models representing non-coding RNA
families - allow the user to search a query sequence against
a library of covariance models, and view multiple
sequence alignments and family annotation
54RNA SequencestRNA sequences
- compilation of tRNA Sequences and Sequences of
tRNA genes
55RNA SequencesOther Examples
- 16S and 23S Ribosomal RNA Mutation Database
- ACTIVITY functional DNA/RNA site activity
- PLANTncRNAs Plant non-coding RNAs
- RNA Modification Database Naturally modified
nucleosides in RNA
56Structure
- Information on protein structure derived from
physical data crystallography, NMR - Classification of proteins according to tertiary
structures - Specialized site for specific proteins
57StructureASTRAL
- provides databases and tools useful for analyzing
protein structures and their sequences - Partially derived from the SCOP database
(Structural Classification of Proteins)
58StructureSCOP
- Comprehensive ordering of proteins to know
structures based on their evolutionary and
structural relationships - Protein domains are grouped into species and
hierarchically classified in families
superfamilies, folds, and classes
59StructurePDB
- Structure data determined by X-ray
crystallography and NMR
60Structure Other Examples
- CADB conformation angles of protein structures,
with associated crystallographic data - Database of Macromolecular Movements
- DSDBase Disulfide Bonds in proteins
- PSSH alignment between sequences and tertiary
structures - SUPERFAMILY Assignments of proteins to
structural superfamilies
61Other Databases
- Intermolecular Interactions
- Metabolic Pathways and Cellular Regulation
- Pathology
- Proteome Resources
- Retrieval Systems and Database Structure
- Transgenics
- Varied Medical Content
62Other Databases Intermolecular Interactions
- BIND Molecular interactions, complexes and
pathways - DIP (Database of Interacting Proteins)
Experimentally determined protein-protein
interactions - KDBI Kinetic data on biomolecular interactions
63Other Databases Metabolic Pathways and Cellular
Regulation
- KEGG Kyoto Encyclopedia of Genes and Genomes
- MetaCyc Metabolic Pathways and Enzymes from
Various organisms - PathDB
- EcoCyc E. coli K-12 genome and pathway data
- PRODORIC gene regulation and regulatory
networks in prokaryotes
64Other DatabasesPathology
- BayGenomics cardiovascular and pulmonary
disease - INFEVERS hereditary inflammatory disorder
- GOLD.db lipid-associated disorders
- Mouse Tumor Biology Database
65Other Databases Proteome Resources
- GELBANK 2D gel data repository
- REBASE Restriction enzymes and associated
methylases - SWISS-2DPAGE Annotated two-dimensional gel
electrophoresis database
66Other Databases Retrieval Systems and Database
Structure
- TESS Transcription Element search system
- Virgil Database interconnectivity
67Other DatabasesTransgenics
- Cre Transgenic database Cre transgenic
mouslines - Transgenic/targeted mutation database
information on transgenic animals and targeted
mutations
68Other Databases Varied Medical Content
- Tree of Life phylogeny and biodiversity
- PubMed biomedical literature
- NCBI Taxonomy Browser organisms with at least
one sequence deposited in the database - Pharmgkb Pharmacogenomics and variations in
drug response based on human variation
69The Rat Genome Database
70The Rat Genome Database data
- Genes
- Maps and Markers
- QTLs
- Strains
- Homologs
71The Rat Genome Database tools
- VCMap
- Mapserver
- Meta Gene
- Genome Scanner
- Ontology Browser
72The Rat Genome Database operations
- Curation
- Data QC and Loading
- Data development
- Tool development
73The Rat Genome Database Operations Curation
- Information gathering from peer-reviewed work
- Coordination with other model organism data bases
- Data quality policy development and assessment
74The Rat Genome Database Operations data
development
- Development of data integration strategies
- Development of ontology annotation protocols
- Some development of curation policies
- Outreach
- Ontology development
75The Rat Genome Database Operations tool
development
- Ontology system development
- Systems analysis
- Tool integration
- Tool building
- Software system migration