Gene Expression: Microarray Theory and NCBI GeneCentered Resources - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Gene Expression: Microarray Theory and NCBI GeneCentered Resources

Description:

Regulation/function in pathway/cellular state/phenotype ... Do I have a preconceived notion of how many clusters there should be? ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 88
Provided by: peterc87
Category:

less

Transcript and Presenter's Notes

Title: Gene Expression: Microarray Theory and NCBI GeneCentered Resources


1
Gene Expression Microarray Theoryand NCBI
Gene-Centered Resources
June 20, 2007
2
Eucaryote Gene Expression Control
nucleus
cytosol
inactive mRNA
mRNA degradation control
Primary RNA transcript
DNA
mRNA
mRNA
RNA transport control
translation control
transcriptional control
RNA processing control
protein
protein activity control
nucleus membrane
Expressed Genes mRNA
inactive protein
3
Gene expression profiles
Expression (relatively levels to reference point
at 0)
Time/Condition
4
Goal of Microarray Experiments
  • Regulation/function in pathway/cellular
    state/phenotype
  • Disease diagnosis / disease gene identification

Gene expression
Microarray data
Biological pathway
5
Similarity between Profiles
expression
  • Similarity measure
  • Euclidean distance
  • Correlation coefficient
  • Trend
  • Correlation coefficient
  • often works better.

0
time
Expression profile
6
Data-Mining through Clustering
  • Assumptions for clustering analysis
  • Expression level of a gene reflects the genes
    activity.
  • Genes involved in same biological process exhibit
  • statistical relationship in their expression
    profiles.

7
Idea of Clustering
  • Clustering group objects into clusters so that
  • objects in each cluster have similar features
  • objects of different clusters have dissimilar
    features

8
Methods of Clustering
  • discriminant analysis (Fisher,1931)
  • K-means (Lloyd,1948)
  • hierarchical clustering
  • self-organizing maps (Kohonen, 1980)
  • support vector machines (Vapnik, 1985)
  • single linkage (dendrogram)
  • minimal spanning tree based clustering

9
Issues in Cluster Analysis
  • A lot of clustering algorithms
  • A lot of distance/similarity metrics
  • Which clustering algorithm runs faster and uses
    less memory?
  • How many clusters after all?
  • Are the clusters stable?
  • Are the clusters meaningful?

10
Which Clustering Method Should I Use?
  • What is the biological question?
  • Do I have a preconceived notion of how many
    clusters there should be?
  • How strict do I want to be? Spilt or Join?
  • Can a gene be in multiple clusters?
  • Hard or soft boundaries between clusters

11
Clustering Algorithms
  • K-Means clustering (most popular)
  • Hierarchical clustering
  • Minimum spanning tree-based clustering

12
NCBI Databases and Services
  • GenBank largest sequence database
  • Free public access to biomedical literature
  • PubMed free Medline
  • PubMed Central full text online access
  • Entrez integrated molecular and literature
    databases
  • BLAST highest volume sequence search service
  • VAST structure similarity searches
  • Software and Databases

13
Types of Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Content controlled by the submitter
  • Examples GenBank, SNP, GEO, Probe, GenSat
  • Derivative Databases
  • Built from primary data
  • Content controlled by third party (NCBI)
  • Examples Refseq, TPA, RefSNP, UniGene, NCBI
    Protein, Structure, Conserved Domain, Gene

14
Gene-centered Databases
  • Sequences
  • GenBank
  • Reference Sequences
  • Expression
  • UniGene
  • GEO
  • Probe
  • GenSat
  • Variation (dbSNP)
  • HomoloGene
  • Gene

15
What is GenBank? NCBIs Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • Historical
  • Reflective of submitter point of view
    (subjective)
  • Redundant
  • GenBank Data
  • Direct submissions (traditional records)
  • Batch submissions (EST, GSS, STS)
  • ftp accounts (genome data)
  • Three collaborating databases
  • GenBank
  • DNA Database of Japan (DDBJ)
  • European Molecular Biology Laboratory (EMBL)
    Database

16
RefSeq NCBIs Derivative Sequence Database
  • Curated transcripts and proteins (NM_, NP_)
  • reviewed
  • human, mouse, rat, fruit fly, zebrafish,
    arabidopsis
  • microbial genomes (proteins), and more
  • Model transcripts and proteins (XM_, XP_)
  • Assembled Genomic Regions (contigs) (NT_, NW_)
  • human
  • mouse
  • rat
  • Chromosome records (NC_)
  • Human genome
  • microbial
  • organelle
  • chicken
  • honeybee
  • sea urchin
  • zebrafish
  • cow
  • dog
  • black poplar

srcdb_refseqProperties
ftp//ftp.ncbi.nih.gov/refseq/release/
17
GenBank to RefSeq Transcript (NM_)
18
RefSeqs Annotation Reagents
Genomic DNA (NC, NT, NW)
Scanning....
Model mRNA (XM) (XR)
Model protein (XP)
?
Curated mRNA (NM) (NR)
Curated Protein (NP)
RefSeq
GenBank Sequences
19
Mouse Assembly
20
Expressed Sequences
  • UniGene
  • GEO
  • Probe
  • GenSat

21
Expressed Sequence Tags in Entrez
Total 41 million records Human 7.9
million Mouse 4.7 million Cow 1.3
million Rice 1.2 million Zebrafish 1.2
million Maize 1.2 million Xenopus
tropicalis 1.0 million Rat 0.9
million Wheat 0.9 million Chicken 0.6
million Barley 0.4 million
22
What is UniGene?
A gene-oriented view of sequence entries
  • MegaBlast based automated sequence clustering
  • Now informed by genome hits New!
  • Nonredundant set of gene oriented clusters
  • Each cluster a unique gene
  • Information on tissue types and map locations
  • Includes known genes and uncharacterized ESTs
  • Useful for gene discovery and selection of
    mapping reagents

23
EST hits Human mRNA
Albumin mRNA
5 EST hits
3 EST hits
24
UniGene
25
Gene Catalog X. laevis MLH1Cluster
Uncharacterized ESTs
26
Associating Sequences Human ALB Cluster
27
Expression Data
28
Gene Expression Omnibus
  • Expression data repository
  • Primary GEO data records
  • Platform (e.g. Affymetrics Gene Chip, Expression
    Set 230 Array )
  • Sample (e.g. E19 type II epithelial cells, 16h,
    control )
  • Series (e.g. lung development, fetal type II
    epithelial cells, strain response, microarray, 6
    samples)
  • Derivative GEO Databases
  • Datasets
  • Biologically and statistically compatible Samples
    curated and suitable for GEO analysis tools
  • Profiles
  • Expression measurements for an individual gene
    across all Samples in a DataSet.

29
Experiment Types in GEO
  • Array-based
  • Single Channel
  • Dual Channel
  • Antibody arrays
  • Universal Microarray System (UMAS)
  • Genomic (Comparative Genomic Hybridization (CGH),
    Chromatin ImmunoPrecipitation (ChIP-chips),
    tiling arrays, SNP detection)
  • Non-array-based
  • Serial Analysis of Gene Expression
  • Massively Parallel Signal Sequencing
  • Serial Analysis of Ribosomal Sequencing Tags
  • Mass Spectrometry Proteomics Data

30
Expression Platform
31
Expression Platform
32
Expression Sample
33
Expression Series
34
GEO Dataset
35
Expression ProfileGAD1
36
Entrez Probe
  • Nucleic acid reagents
  • Sequence targeted oligos
  • Over 7 million entries
  • Applications
  • Genotyping
  • Gene Silencing
  • SNP Discovery
  • Genome Mapping
  • Gene Expression
  • Real time PCR
  • In Situ hybridization

37
Ataxia Telangiaectasia Mutated Gene Probes
38
Resequencing Amplicon SNP Discovery
39
Gene Silencing Probes
40
Gene Expresion Probe to Gensat
41
Gene Expression Nervous System Atlas
  • NINDS and Rockefeller University
  • BAC Transgenic Mouse Strains
  • Enhanced GFP expression
  • Driven by endogenous gene regulatory sequences
  • Histologic sections mouse brain
  • Confocal Microscopy GFP reporter
  • Immunohistochemical staining transgenics
  • In Situ Hybridization native expression
  • Results Important for understanding brain
    microanatomy and development

42
GENSAT Dopamine Receptor 2 (Drd2)
43
Bright Field Immunostaining BAC Transgenic
44
dbSNP Nucleotide Polymorphisms
45
NCBIs SNP Database
  • Primary Database and Derivative (RefSNP)
  • Single Nucleotide Polymorphism
  • Repeat polymorphisms
  • Insertion-Deletion Polymorphisms
  • 29 Species
  • Over 46 million submissions (submitted SNPs)
  • Over 26 million submissions (reference SNPs)

46
Submitted SNP
Hemachromatosis SNP
47
RefSNP
48
RefSnp Analysis GeneView
49
Mapping to Structure
50
Population Genetics Genotype
51
The Gene Database
  • Gene Centered Information
  • Unifies LocusLink and microbial Genomes
  • 2.4 million records for 3,822 taxa

52
Genes MLH1 One Stop Shopping
53
Genes MLH1 One Stop Shopping (cont.)
54
Genes Display Options and Links
55
Homologene
  • Completely Annotated Eukaryotic Genomes
  • Homologous UniGene determined for other organisms
  • Protein similarities first
  • Guided by taxonomic tree
  • Includes orthologs and paralogs
  • Provides Neighboring function for Entrez Gene
    database

56
Homologene Cluster
57
WWWAccess
Entrez BLAST
58
Entrez Database Integration
Word weight
PubMed abstracts
3-D Structure
3 -D Structure
Taxonomy
VAST
Genomes
Phylogeny
Neighbors Related Structures
Protein sequences
Nucleotide sequences
BLAST
BLAST
Neighbors Related Sequences BLink Domains
Neighbors Related Sequences
Hard Link
59
Entrez Use Gene for everything
BLink
Homologene Gene Neighbors
60
Glutamate decarboxylases
  • Finding
  • Sequences
  • Expression Information
  • Molecular Probes
  • Homologs in other species
  • Genomic Context
  • Polymorphisms
  • Structures and Conserved Domains

www.ncbi.nlm.nih.gov
61
Mouse GAD Genes
62
GAD2 Gene Record
63
GAD2 Sequences
64
Gene Table Genomic Sequences
65
The Map Viewer Genomic Sequences
Download data here
Zoom or enter region of interest
66
Expression Data
GENSAT
GEO
UniGene
67
GenSat BAC transgenics
68
GEO Profiles
69
Preview / Index Refining Results
70
Gad2 Expression Profile
71
Genes with similar expression
72
Molecular Probes
73
RSA Primers (Human)
Known SNPs in Red
Download entire tiling set
74
Finding Homologs HomoloGene
Protien mRNA Genomic
75
HomoloGene Cluster
Gene Links
Protein Links
76
Homologous UniGene
77
UniGene Hs.231829 GAD2
78
Expression Profile
79
Finding Homologs 2 BLink
80
BLink BLAST Link (Best Hits)
Redundant Proteins
First 200 only
Goldfish homolog
BLAST
81
Finding Polymorphisms Human
82
SNP GeneView
83
Related Structures structure model
84
Sequence Similar Structures
Conserved Domain
Link to Structure
Link to Alignment
85
Pig DOPA decarboxylase
Structure Neighbors
Cn3D viewer
3D Domain Neighbors
Conserved Domains
Pubchem compound
86
Alignment Based Model
87
Better Model Conserved Domain
Ala-Gly from dbSNP
88
Service Addresses
  • info_at_ncbi.nlm.nih.gov
  • blast-help_at_ncbi.nlm.nih.gov
Write a Comment
User Comments (0)
About PowerShow.com