Title: Gene Expression: Microarray Theory and NCBI GeneCentered Resources
1Gene Expression Microarray Theoryand NCBI
Gene-Centered Resources
June 20, 2007
2Eucaryote Gene Expression Control
nucleus
cytosol
inactive mRNA
mRNA degradation control
Primary RNA transcript
DNA
mRNA
mRNA
RNA transport control
translation control
transcriptional control
RNA processing control
protein
protein activity control
nucleus membrane
Expressed Genes mRNA
inactive protein
3Gene expression profiles
Expression (relatively levels to reference point
at 0)
Time/Condition
4Goal of Microarray Experiments
- Regulation/function in pathway/cellular
state/phenotype - Disease diagnosis / disease gene identification
Gene expression
Microarray data
Biological pathway
5Similarity between Profiles
expression
- Similarity measure
- Euclidean distance
- Correlation coefficient
- Trend
-
- Correlation coefficient
- often works better.
0
time
Expression profile
6Data-Mining through Clustering
- Assumptions for clustering analysis
- Expression level of a gene reflects the genes
activity. - Genes involved in same biological process exhibit
- statistical relationship in their expression
profiles.
7Idea of Clustering
- Clustering group objects into clusters so that
- objects in each cluster have similar features
- objects of different clusters have dissimilar
features
8Methods of Clustering
- discriminant analysis (Fisher,1931)
- self-organizing maps (Kohonen, 1980)
- support vector machines (Vapnik, 1985)
- single linkage (dendrogram)
- minimal spanning tree based clustering
9Issues in Cluster Analysis
- A lot of clustering algorithms
- A lot of distance/similarity metrics
- Which clustering algorithm runs faster and uses
less memory? - How many clusters after all?
- Are the clusters stable?
- Are the clusters meaningful?
10Which Clustering Method Should I Use?
- What is the biological question?
- Do I have a preconceived notion of how many
clusters there should be? - How strict do I want to be? Spilt or Join?
- Can a gene be in multiple clusters?
- Hard or soft boundaries between clusters
11Clustering Algorithms
- K-Means clustering (most popular)
- Hierarchical clustering
- Minimum spanning tree-based clustering
12NCBI Databases and Services
- GenBank largest sequence database
- Free public access to biomedical literature
- PubMed free Medline
- PubMed Central full text online access
- Entrez integrated molecular and literature
databases - BLAST highest volume sequence search service
- VAST structure similarity searches
- Software and Databases
13Types of Databases
- Primary Databases
- Original submissions by experimentalists
- Content controlled by the submitter
- Examples GenBank, SNP, GEO, Probe, GenSat
- Derivative Databases
- Built from primary data
- Content controlled by third party (NCBI)
- Examples Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain, Gene
14Gene-centered Databases
- Sequences
- GenBank
- Reference Sequences
- Expression
- UniGene
- GEO
- Probe
- GenSat
- Variation (dbSNP)
- HomoloGene
- Gene
15What is GenBank? NCBIs Primary Sequence Database
- Nucleotide only sequence database
- Archival in nature
- Historical
- Reflective of submitter point of view
(subjective) - Redundant
- GenBank Data
- Direct submissions (traditional records)
- Batch submissions (EST, GSS, STS)
- ftp accounts (genome data)
- Three collaborating databases
- GenBank
- DNA Database of Japan (DDBJ)
- European Molecular Biology Laboratory (EMBL)
Database
16RefSeq NCBIs Derivative Sequence Database
- Curated transcripts and proteins (NM_, NP_)
- reviewed
- human, mouse, rat, fruit fly, zebrafish,
arabidopsis - microbial genomes (proteins), and more
- Model transcripts and proteins (XM_, XP_)
- Assembled Genomic Regions (contigs) (NT_, NW_)
- human
- mouse
- rat
- Chromosome records (NC_)
- Human genome
- microbial
- organelle
- chicken
- honeybee
- sea urchin
srcdb_refseqProperties
ftp//ftp.ncbi.nih.gov/refseq/release/
17GenBank to RefSeq Transcript (NM_)
18RefSeqs Annotation Reagents
Genomic DNA (NC, NT, NW)
Scanning....
Model mRNA (XM) (XR)
Model protein (XP)
?
Curated mRNA (NM) (NR)
Curated Protein (NP)
RefSeq
GenBank Sequences
19Mouse Assembly
20Expressed Sequences
21Expressed Sequence Tags in Entrez
Total 41 million records Human 7.9
million Mouse 4.7 million Cow 1.3
million Rice 1.2 million Zebrafish 1.2
million Maize 1.2 million Xenopus
tropicalis 1.0 million Rat 0.9
million Wheat 0.9 million Chicken 0.6
million Barley 0.4 million
22What is UniGene?
A gene-oriented view of sequence entries
- MegaBlast based automated sequence clustering
- Now informed by genome hits New!
- Nonredundant set of gene oriented clusters
- Each cluster a unique gene
- Information on tissue types and map locations
- Includes known genes and uncharacterized ESTs
- Useful for gene discovery and selection of
mapping reagents
23EST hits Human mRNA
Albumin mRNA
5 EST hits
3 EST hits
24UniGene
25Gene Catalog X. laevis MLH1Cluster
Uncharacterized ESTs
26Associating Sequences Human ALB Cluster
27Expression Data
28Gene Expression Omnibus
- Expression data repository
- Primary GEO data records
- Platform (e.g. Affymetrics Gene Chip, Expression
Set 230 Array ) - Sample (e.g. E19 type II epithelial cells, 16h,
control ) - Series (e.g. lung development, fetal type II
epithelial cells, strain response, microarray, 6
samples) - Derivative GEO Databases
- Datasets
- Biologically and statistically compatible Samples
curated and suitable for GEO analysis tools - Profiles
- Expression measurements for an individual gene
across all Samples in a DataSet.
29Experiment Types in GEO
- Array-based
- Single Channel
- Dual Channel
- Antibody arrays
- Universal Microarray System (UMAS)
- Genomic (Comparative Genomic Hybridization (CGH),
Chromatin ImmunoPrecipitation (ChIP-chips),
tiling arrays, SNP detection) - Non-array-based
- Serial Analysis of Gene Expression
- Massively Parallel Signal Sequencing
- Serial Analysis of Ribosomal Sequencing Tags
- Mass Spectrometry Proteomics Data
30Expression Platform
31Expression Platform
32Expression Sample
33Expression Series
34GEO Dataset
35Expression ProfileGAD1
36Entrez Probe
- Nucleic acid reagents
- Sequence targeted oligos
- Over 7 million entries
- Applications
- Genotyping
- Gene Silencing
- SNP Discovery
- Genome Mapping
- Gene Expression
- Real time PCR
- In Situ hybridization
37Ataxia Telangiaectasia Mutated Gene Probes
38Resequencing Amplicon SNP Discovery
39Gene Silencing Probes
40Gene Expresion Probe to Gensat
41Gene Expression Nervous System Atlas
- NINDS and Rockefeller University
- BAC Transgenic Mouse Strains
- Enhanced GFP expression
- Driven by endogenous gene regulatory sequences
- Histologic sections mouse brain
- Confocal Microscopy GFP reporter
- Immunohistochemical staining transgenics
- In Situ Hybridization native expression
- Results Important for understanding brain
microanatomy and development
42GENSAT Dopamine Receptor 2 (Drd2)
43Bright Field Immunostaining BAC Transgenic
44dbSNP Nucleotide Polymorphisms
45NCBIs SNP Database
- Primary Database and Derivative (RefSNP)
- Single Nucleotide Polymorphism
- Repeat polymorphisms
- Insertion-Deletion Polymorphisms
- 29 Species
- Over 46 million submissions (submitted SNPs)
- Over 26 million submissions (reference SNPs)
46Submitted SNP
Hemachromatosis SNP
47RefSNP
48RefSnp Analysis GeneView
49Mapping to Structure
50Population Genetics Genotype
51The Gene Database
- Gene Centered Information
- Unifies LocusLink and microbial Genomes
- 2.4 million records for 3,822 taxa
52Genes MLH1 One Stop Shopping
53Genes MLH1 One Stop Shopping (cont.)
54Genes Display Options and Links
55Homologene
- Completely Annotated Eukaryotic Genomes
- Homologous UniGene determined for other organisms
- Protein similarities first
- Guided by taxonomic tree
- Includes orthologs and paralogs
- Provides Neighboring function for Entrez Gene
database
56 Homologene Cluster
57WWWAccess
Entrez BLAST
58Entrez Database Integration
Word weight
PubMed abstracts
3-D Structure
3 -D Structure
Taxonomy
VAST
Genomes
Phylogeny
Neighbors Related Structures
Protein sequences
Nucleotide sequences
BLAST
BLAST
Neighbors Related Sequences BLink Domains
Neighbors Related Sequences
Hard Link
59Entrez Use Gene for everything
BLink
Homologene Gene Neighbors
60Glutamate decarboxylases
- Finding
- Sequences
- Expression Information
- Molecular Probes
- Homologs in other species
- Genomic Context
- Polymorphisms
- Structures and Conserved Domains
www.ncbi.nlm.nih.gov
61Mouse GAD Genes
62GAD2 Gene Record
63GAD2 Sequences
64Gene Table Genomic Sequences
65The Map Viewer Genomic Sequences
Download data here
Zoom or enter region of interest
66Expression Data
GENSAT
GEO
UniGene
67GenSat BAC transgenics
68GEO Profiles
69Preview / Index Refining Results
70Gad2 Expression Profile
71Genes with similar expression
72Molecular Probes
73RSA Primers (Human)
Known SNPs in Red
Download entire tiling set
74Finding Homologs HomoloGene
Protien mRNA Genomic
75HomoloGene Cluster
Gene Links
Protein Links
76Homologous UniGene
77UniGene Hs.231829 GAD2
78Expression Profile
79Finding Homologs 2 BLink
80BLink BLAST Link (Best Hits)
Redundant Proteins
First 200 only
Goldfish homolog
BLAST
81Finding Polymorphisms Human
82SNP GeneView
83Related Structures structure model
84Sequence Similar Structures
Conserved Domain
Link to Structure
Link to Alignment
85Pig DOPA decarboxylase
Structure Neighbors
Cn3D viewer
3D Domain Neighbors
Conserved Domains
Pubchem compound
86Alignment Based Model
87Better Model Conserved Domain
Ala-Gly from dbSNP
88Service Addresses
- info_at_ncbi.nlm.nih.gov
- blast-help_at_ncbi.nlm.nih.gov