Analysis of Affymetrix GeneChip Data - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis of Affymetrix GeneChip Data

Description:

For details on reading vignettes, see. the openVignette help page. openVignette() Please select (by number) a vignette. 1: affy primer. 2: affy: Built-in ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 32
Provided by: davidm133
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Affymetrix GeneChip Data


1
Analysis of Affymetrix GeneChip Data
  • EPP 245/298
  • Statistical Analysis of
  • Laboratory Data

2
Basic Design of Expression Arrays
  • For each gene that is a target for the array, we
    have a known DNA sequence.
  • mRNA is reverse transcribed to DNA, and if a
    complementary sequence is on the on a chip, the
    DNA will be more likely to stick
  • The DNA is labeled with a dye that will fluoresce
    and generate a signal that is monotonic in the
    amount in the sample

3
Intron
Exon
TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGA
TACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG
ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGC
TATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATG
C
Probe Sequence
  • cDNA arrays use variable length probes
  • Long oligoarrays use 60-70mers
  • Affymetrix GeneChips use multiple 25-mers
  • For each exon, 8-20 distinct probes
  • May overlap
  • May cover more than one exon
  • Affymetrix chips also use mismatch (MM) probes
    that have the same sequence as perfect match
    probes except for the middle base which is
    changed to inhibitbinding.
  • This is supposed to act as a control, but often
    instead binds to another mRNA species, so many
    analysts do not use them

4
Expression Indices
  • A key issue with Affymetrix chips is how to
    summarize the multiple data values on a chip for
    each probe set (aka gene).
  • There have been a large number of suggested
    methods.
  • Generally, the worst ones are those from Affy, by
    a long way worse means less able to detect real
    differences

5
Usable Methods
  • Li and Wongs dCHIP and follow on work is
    demonstrably better than MAS 4.0 and MAS 5.0, but
    not as good as RMA and GLA
  • The RMA method of Irizarry et al. is available in
    Bioconductor.
  • The GLA method (Durbin, Rocke, Zhou) is also
    available in Bioconductor

6
How to Install R and Bioconductor
  • Download R install file from website and run
    installation
  • Download Bioconductor as given below

gtsource("http//www.bioconductor.org/biocLite.R")
gtbiocLite()
7
Bioconductor Documentation
  • gt library(affy)
  • Welcome to Bioconductor
  • Vignettes contain introductory material.
    To view,
  • simply type openVignette()
  • For details on reading vignettes, see
  • the openVignette help page.
  • gt openVignette()
  • Please select (by number) a vignette
  • 1 affy primer
  • 2 affy Built-in Processing Methods
  • 3 affy Custom Processing Methods (HowTo)
  • 4 affy Automatic downloading of cdfenvs
    (HowTo)
  • 5 affy Import Methods (HowTo)
  • 6 Biobase Primer
  • 7 Howto Bioconductor
  • 8 HowTo HowTo
  • 9 esApply Introduction
  • Selection 6
  • 1 TRUE

8
The exprSet class
  • An object of class exprSet has several slots
  • exprs A matrix of expression levels with chips as
    columns and genes as rows.
  • se.exprs A matrix of standard errors for exprs if
    available.
  • phenoData An object of class phenoData containing
    phenotype or experimental data.

9
  • description A description of the experiment,
    object of class MIAME
  • annotation A character string indicating the base
    name for the associated annotation
  • notes Notes describing features or aspects of the
    data, experiment, etc.
  • The phenoData class has slots
  • pData A dataframe where rows are cases and
    columns are variables
  • varLabels A list of labels and descriptions for
    the variables
  • The number of rows in pData is the same as the
    number of columns in exprs.

10
gt library(affy) gt data(geneData) gt
class(geneData) 1 "matrix" gt dim(geneData) 1
500 26 gt data(geneCov) gt dim(geneCov) 1 26
3 gt covN lt- list(cov1 "Covariate 1 2
levels",cov2 "Covariate 2 2 levels",
cov3 lt- "Covariate 3 3 levels") gt pD lt-
new("phenoData",pDatageneCov, varLabels
covN) gt eSet lt- new("exprSet",exprsgeneData,pheno
DatapD) gt dim(eSet_at_exprs) 1 500 26
11
Reading Affy Data into R
  • The CEL files contain the data from an array. We
    will look at data from an older type of array,
    the U95A which contains 12,625 probe sets and
    409,600 probes.
  • The CDF file contains information relating probe
    pair sets to locations on the array. These are
    built into the affy package for standard types.

12
The ReadAffy function
  • ReadAffy() function reads all of the CEL files in
    the current working directory into an object of
    class AffyBatch
  • ReadAffy(widgetT) does so in a GUI that allows
    entry of other characteristics of the dataset
  • You can also specify filenames, phenotype or
    experimental data, and MIAME information

13
gt getClass("exprSet") Slots

Name
exprs se.exprs phenoData
description Class exprMatrix
exprMatrix phenoData characterORMIAME
annotation notes
character character gt
getClass("AffyBatch") Slots


Name
cdfName nrow ncol
exprs Class character
numeric numeric exprMatrix
se.exprs phenoData description
annotation notes exprMatrix
phenoData characterORMIAME character
character Extends "exprSet"
14
A Sample Experiment
  • This consists of 10 samples one each from 10 cell
    lines.
  • The cell lines are of 5 types
  • We have 10 Affymetrix U95A arrays

15
group lt- data.frame(as.factor(rep(15,each2))) co
vN lt- list(group "Type of Cell Line") pD lt-
new("phenoData",pDatagroup,varLabelscovN) myMIAM
E lt- new("MIAME",name"John Post, PhD",
lab"Jane Q. Investigator, MD",
contact"916-734-0000", title"Sample
Experiment for EPP 298") Data lt-
ReadAffy(filenamesc("LN0A.CEL","LN0B.CEL","LN1A.C
EL","LN1B.CEL", "LN2A.CEL","LN2B.CEL","LN
3A.CEL","LN3B.CEL","LN4A.CEL","LN4B.CEL"),
phenoDatapD,descriptionmyMIAME)
gt dim(Data_at_exprs) 1 409600 10 gt
Data_at_exprs15, LN0A.CEL LN0B.CEL LN1A.CEL
LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL
LN4A.CEL LN4B.CEL 1, 137.0 103.0 120.0
133.0 124.0 188.0 183.0 143.0
117.0 93 2, 2484.0 2040.8 2065.3
1687.5 9454.5 7732.8 7813.5 9873.8
2286.8 1930 3, 274.0 146.0 145.0
133.0 159.0 190.0 185.0 153.0
133.0 141 4, 1966.8 2387.0 2465.0
1961.0 9753.0 8088.0 7047.0 14705.0
2593.0 2125 5, 71.5 51.5 65.0
58.5 73.3 68.3 66.3 87.3
58.8 53
16
gt Data_at_phenoData phenoData object with 1
variables and 10 cases varLabels
group Type of Cell Line gt
Data_at_phenoData_at_pData as.factor.rep.1.5..each...
2.. 1 1 2
1 3
2 4 2 5
3 6
3 7 4 8
4 9
5 10 5 gt
Data_at_description Experimenter name John Post,
PhD Laboratory Jane Q. Investigator, MD
Contact information 916-734-0000 Title Sample
Experiment for EPP 298 URL No abstract
available. Information is available on
preprocessing
17
Expression Indices
  • The 409,600 rows of the expression matrix in the
    AffyBatch object Data each correspond to a probe
    (25-mer)
  • Ordinarily to use this we need to combine the
    probe level data for each probe set into a single
    expression number
  • This has conceptually several steps

18
Steps in Expression Index Construction
  • Background correction is the process of adjusting
    the signals so that the zero point is similar on
    all parts of all arrays.
  • We like to manage this so that zero signal after
    background correction corresponds approximately
    to zero amount of the mRNA species that is the
    target of the probe set.

19
  • Data transformation is the process of changing
    the scale of the data so that it is more
    comparable from high to low.
  • Common transformations are the logarithm and
    generalized logarithm
  • Normalization is the process of adjusting for
    systematic differences from one array to another.
  • Normalization may be done before or after
    transformation, and before or after probe set
    summarization.

20
  • One may use only the perfect match (PM) probes,
    or may subtract or otherwise use the mismatch
    (MM) probes
  • There are many ways to summarize 20 PM probes and
    20 MM probes on 10 arrays (total of 200 numbers)
    into 10 expression index numbers

21
The RMA Method
  • Background correction that does not make 0 signal
    correspond to 0 amount
  • Quantile normalization
  • Log2 transform
  • Median polish summary of PM probes

22
gt eidata lt- rma(Data) Background
correcting Normalizing Calculating Expression gt
class(eidata) 1 "exprSet" attr(,"package") 1
"Biobase" gt dim(eidata_at_exprs) 1 12625 10
23
gt summary(eidata_at_exprs) LN0A.CEL
LN0B.CEL LN1A.CEL LN1B.CEL
Min. 2.724 Min. 2.598 Min. 2.634
Min. 2.655 1st Qu. 4.489 1st Qu.
4.469 1st Qu. 4.471 1st Qu. 4.482 Median
6.090 Median 6.071 Median 6.076
Median 6.086 Mean 6.135 Mean 6.139
Mean 6.135 Mean 6.142 3rd Qu.
7.450 3rd Qu. 7.485 3rd Qu. 7.474 3rd
Qu. 7.476 Max. 12.175 Max. 12.258
Max. 12.140 Max. 11.999 LN2A.CEL
LN2B.CEL LN3A.CEL LN3B.CEL
Min. 2.622 Min. 2.743 Min.
2.687 Min. 2.655 1st Qu. 4.461 1st
Qu. 4.474 1st Qu. 4.433 1st Qu. 4.447
Median 6.016 Median 6.067 Median 6.021
Median 6.034 Mean 6.124 Mean
6.139 Mean 6.130 Mean 6.132 3rd
Qu. 7.438 3rd Qu. 7.434 3rd Qu. 7.456
3rd Qu. 7.465 Max. 13.277 Max. 13.346
Max. 13.267 Max. 13.287 LN4A.CEL
LN4B.CEL Min. 2.775 Min.
2.667 1st Qu. 4.483 1st Qu. 4.449
Median 6.077 Median 6.053 Mean 6.136
Mean 6.134 3rd Qu. 7.469 3rd Qu.
7.488 Max. 12.058 Max. 12.153
24
The GLA Method
  • The Glog Average (GLA) method is simpler than the
    RMA method, though it can require estimation of a
    parameter
  • Background correction is intended to make a
    measured value of zero correspond to a zero
    quantity in the sample
  • Transformation uses the glog ln for large
    values
  • Normalization via lowess
  • Summary is a simple average of PM probes

25
gt source("gla.r")gt emat.gla lt- gla(Data)gt
class(emat.gla) 1 "matrix" gt dim(emat.gla) 1
12625 10
26
gt summary(emat.gla) X1 X2
X3 X4 Min. 3.859
Min. 3.801 Min. 3.811 Min. 3.816
1st Qu.4.948 1st Qu.4.939 1st Qu.4.942
1st Qu.4.942 Median 5.456 Median 5.449
Median 5.446 Median 5.453 Mean 5.569
Mean 5.573 Mean 5.569 Mean 5.566
3rd Qu.6.032 3rd Qu.6.052 3rd Qu.6.038
3rd Qu.6.034 Max. 8.794 Max. 8.724
Max. 8.733 Max. 8.731 X5
X6 X7 X8
Min. 3.800 Min. 3.896 Min. 3.792
Min. 3.866 1st Qu. 4.950 1st Qu. 4.952
1st Qu.4.945 1st Qu.4.936 Median 5.455
Median 5.459 Median 5.458 Median 5.441
Mean 5.595 Mean 5.578 Mean 5.597
Mean 5.578 3rd Qu. 6.058 3rd Qu.
6.030 3rd Qu.6.062 3rd Qu.6.036 Max.
10.083 Max. 10.270 Max. 9.948 Max.
9.976 X9 X10 Min.
3.907 Min. 3.835 1st Qu.4.949 1st
Qu.4.937 Median 5.452 Median 5.453
Mean 5.569 Mean 5.579 3rd Qu.6.033
3rd Qu.6.053 Max. 8.772 Max. 8.732
27
glog lt- function(y,lambda) yt lt-
log(ysqrt(y2lambda)) return(yt)
28
lnormlt- function(mat1,span.1) print ("Start
Lowess Normalization") mat2 lt-
as.matrix(mat1) p lt- dim(mat2)1 n lt-
dim(mat2)2 rmeans lt- apply(mat2,1,mean) rlt-r
norm(p,0,1e-7) rmeanslt-rrmeans rranks lt-
rank(rmeans) matsort lt- mat2order(rranks),
r0 lt- 1p lcol lt- function(x) lx lt-
lowess(r0,x,fspan)y returns vector of
length p lmeans lt- apply(matsort,2,lcol)
column lowess lgrand lt- apply(lmeans,1,mean)
grand lowess lgrand lt- matrix(rep(lgrand,n),
byrowF,ncoln) grand lowess matnorm0 lt-
matsort-lmeanslgrand normalized matrix
matnorm1 lt- matnorm0rranks, returns row
order to original print ("End Lowess
Normalization") return(matnorm1)
29
gla lt- function(AB,lambda1000,alpha50) gn
lt- geneNames(AB) I lt- length(gn) ind
will be a vector with i repeated as many times
as gni has PM probes This shows which gene
number is associated with each probe
indlt-vector() for (i in 1I) ind lt-
c(ind, rep(i, dim(pm(AB,gni))1)) mat
lt- pm(AB) EI lt- matrix(0, I, dim(mat)2)
mat.gloglt-lnorm(glog(mat-alpha,lambda)) for (i
in 1I) tmp lt- apply(mat.glogindi,,2,m
ean) EIi, lt- t(tmp) return(EI)
30
Probe Sets not Genes
  • It is unavoidable to refer to a probe set as
    measuring a gene, but nevertheless it can be
    deceptive
  • The annotation of a probe set may be based on
    homology with a gene of possibly known function
    in a different organism
  • Only a relatively few probe sets correspond to
    genes with known function and known structure in
    the organism being studied

31
Exercise (Optional)
  • Download the ten arrays from the web site, and
    the gla.r file
  • Load the arrays into R using Read.Affy and
    construct the RMA expression indices
  • Source the gla.r functions, and construct the GLA
    expression indices
Write a Comment
User Comments (0)
About PowerShow.com