Title: Bioconductor
1Bioconductor
- Sandrine Dudoit
- Division of Biostatistics, UC Berkeley
- www.stat.berkeley.edu/sandrine
- MGED7
- September 8, 2004
- Toronto, Canada
2Core Development Team
- Douglas Bates, University of Wisconsin,
Madison,USA. - Benjamin Bolstad, Division of
Biostatistics, UC Berkeley, USA. - Vincent Carey, Harvard Medical School,
USA. - Marcel Dettling, Federal Inst. Technology,
Switzerland. - Sandrine Dudoit, Division of
Biostatistics, UC Berkeley, USA. - Byron Ellis, Department of Statistics,
Harvard University, USA. - Laurent Gautier, Technical University of
Denmark, Denmark. - Robert Gentleman, Harvard Medical School,
USA. - Jeff Gentry, Dana-Farber Cancer Institute,
USA. - Kurt Hornik, Technische Universitat Wien,
Austria. - Torsten Hothorn, Institut fuer
Medizininformatik, Biometrie und Epidemiologie,
Germany. - Wolfgang Huber, DKFZ Heidelberg, Germany.
- Stefano Iacus, University of Milan, Italy
- Rafael Irizarry, Department of
Biostatistics, Johns Hopkins University, USA. - Friedrich Leisch, Technische Universitat
Wien, Austria. - James MacDonald, University of Michigan, USA.
- Martin Maechler, Federal Inst. Technology,
Switzerland. - Crispin Miller, The Paterson Institute
Bioinformatics Group, UK. - Colin Smith, NASA Center for Astrobioinformatics,
USA.
3References
- Bioconductor www.bioconductor.org
- software, data, and documentation (vignettes)
- training materials from short courses
- mailing list.
- R www.r-project.org, cran.r-project.org
- software base and contributed (CRAN)
- documentation
- newsletter R News
- mailing list.
- Bioconductor Project Working Papers
- www.bepress.com/bioconductor.
- Personal
- www.stat.berkeley.edu/sandrine.
4Outline
- Overview of the Bioconductor Project.
- Getting Started R and Bioconductor.
- Hands On!
5Overview of the Bioconductor Project
6Bioconductor
- Bioconductor is an open-source and
open-development software project for the
analysis of biomedical and genomic data. - The project was started in the Fall of 2001 and
includes 25 core developers in the US, Europe,
and Australia. - R and the R package system are used to design and
distribute software. - Semi-annual releases
- v 1.0 May 2nd, 2002, 15 packages.
-
- v 1.4 May 17th, 2004, 81 packages.
- ArrayAnalyzer Commercial port of Bioconductor
packages in S-Plus.
7Goals
- Provide access to powerful statistical and
graphical methods for the analysis of biomedical
and genomic data. - Facilitate the integration of biological metadata
from WWW in the analysis of experimental data. - E.g. GenBank, GO, LocusLink, PubMed.
- Allow the rapid development of extensible,
interoperable, and scalable software. - Promote high-quality documentation and
reproducible research. - Provide training in computational and statistical
methods.
8Bioconductor Packages
- Bioconductor software consists of R add-on
packages. - An R package is a structured collection of code
(R, C, or other), documentation, and/or data for
performing specific types of analyses. - E.g. affy, cluster, graph, hexbin packages
provide implementations of specialized
statistical and graphical methods.
9Bioconductor Packages
- Statistical methods cluster analysis, estimation
and testing for linear and non-linear models
(with possibly censored continuous and
polychotomous outcomes), multiple hypothesis
testing, resampling, visualization, etc. - Biological assays cell-based assays, DNA
microarrays (transcript levels, DNA copy number
from CGH), proteomics, SAGE, SELDI-TOF, SNP, etc. - Biological metadata from WWW GenBank, GO, KEGG,
PubMed, etc. - Interfaces with other languages C, Java, Perl,
Python, XML, etc. -- Omega Project
(www.omegahat.org). - Interactions with other projects BGL,
GeneSpring, Graphviz, MAGE-ML, Resourcerer, etc. - R as a broker.
10Bioconductor Packages
- Data packages
- Biological metadata mappings between different
gene identifiers (e.g., AffyID, GO ID, LocusID,
PMID), CDF and probe sequence information for
Affy arrays. - E.g. hgu95av2, GO, KEGG.
- Experimental data code, data, and documentation
for specific experiments or projects. - ALL Chiaretti et al. (2004) ALL data.
- golubEsets Golub et al. (2000) ALL/AML data.
- yeastCC Spellman et al. (1998) yeast cell
cycle. - Course packages code, data, documentation, and
labs for the instruction of a particular course.
E.g. EMBO03 course package.
11Bioconductor Packages
- Bioconductor provides two main classes of
software packages. - End-user packages
- aimed at users unfamiliar with R or computer
programming - polished and easy-to-use interfaces to a wide
variety of computational and statistical methods
for the analysis of biomedical and genomic data. - Developer packages aimed at software developers,
in the sense that they provide software to write
software.
12Bioconductor PackagesRelease 1.4, May 17th,
2004Over 80 packages!
- General infrastructure
- Biobase, Biostrings, DynDoc, reposTools, rhdf5,
ruuid, tkWidgets, widgetTools. - Annotation
- annotate, AnnBuilder metadata packages.
- Graphics
- geneplotter, hexbin.
- Pre-processing Affymetrix oligonucleotide chip
data - affy, affycomp, affydata, affylmGUI, affyPLM,
annaffy, gcrma, makecdfenv, vsn. - Pre-processing two-color spotted DNA microarray
data - arrayMagic, arrayQuality, limma, limmaGUI,
marray, vsn. - Other assays aCGH, DNAcopy, prada, PROcess,
RSNPer, SAGElyzer. - Differential gene expression
- EBarrays, edd, factDesign, genefilter, limma,
limmaGUI, multtest, ROC. - Graphs and networks
- graph, RBGL, Rgraphviz.
- Gene Ontology GOstats, goTools.
- MAGE RMAGEML.
N.B. Many new packages in Bioconductor
development version.
13Ongoing Efforts
Many methods already implemented in CRAN packages.
- Variable/model selection
- Prediction
- Cluster analysis
- Resampling bootstrap, cross-validation
- Multiple testing procedures
- Quality measures for microarray data
- Other biological data types e.g., proteomics,
sequence analysis - Interactions with other projects
- Web services.
14Microarray Data Analysis
.gpr, .Spot
CEL, CDF
marray limma vsn
affy vsn
Pre-processing
exprSet
Annotation
annotate annaffy metadata packages
Differential expression
Graphs networks
Cluster analysis
Prediction
CRAN class e1071 ipred LogitBoost MASS nnet random
Forest rpart
graph RBGL Rgraphviz
edd genefilter limma multtest ROC CRAN
CRAN class cluster MASS mva
Graphics
geneplotter hexbin CRAN
15Microarray Data Analysis
- Pre-processing of
- spotted array data with marray packages
- Affymetrix array data with affy packages.
- List of differentially expressed genes from
genefilter, limma, or multtest packages. - Prediction of tumor class using randomForest
package. - Clustering of genes using cluster package.
- Use of annotate package
- to retrieve and search PubMed abstracts
- to generate an HTML report with links to
LocusLink for each gene.
16marray
- Pre-processing two-color spotted array data
- diagnostic plots,
- robust adaptive normalization (lowess, loess).
maImage
maBoxplot
maPlot hexbin
17 arrayMagic
R Rb R-Rbcolor scale by rank
another array print-tip
color scale log(G)
color scale rank(G)
Spatial effects
18affy
- Pre-processing oligonucleotide chip data
- diagnostic plots,
- background correction,
- probe-level normalization,
- computation of expression measures.
plotAffyRNADeg
barplot.ProbeSet
image
plotDensity
19vsn
- Variance stabilization (shrinkage) more stable
expression estimates in cases where there are few
replicates. - Model-based normalization parameter estimation
for affine calibration and additive-multiplicative
error model.
20limma
- LInear Models for MicroArrays pre-processing
and differential expression - Pre-processing background correction,
normalization. - Complex experimental designs, e.g.,
multifactorial. - Empirical Bayes methods for identifying
differentially expressed genes t-statistics,
F-statistics, posterior odds. - Inference methods for duplicate spots and
technical replicates. - Analysis based on log-ratios or absolute
log-intensities. - Spot quality weights.
- Graphics heat diagrams, Venn diagrams.
21limmaGUI
22aCGH
- Pre-processing imputation of missing values
(lowess), filtering. - Visualization measured and derived information
as a function of genomic position. - HMM-based algorithm for finding genomic events,
e.g., copy number transitions and high-level
amplifications. - Perform and interpret tests for associations
between clinical variables and copy number of
individual loci as well as collective features of
genomic profiles
23Statistics and significance cut-off
Copy number transitions
Frequency plot
Genomic profile
24annotate, annafy, and AnnBuilder
Metadata package hgu95av2 mappings between
different gene identifiers for hgu95av2 chip.
- Assemble and process genomic annotation data from
public repositories. - Build annotation data packages or XML data
documents. - Associate experimental data in real time to
biological metadata from web databases such as
GenBank, GO, KEGG, LocusLink, and PubMed. - Process and store query results e.g., search
PubMed abstracts. - Generate HTML reports of analyses.
GENENAME zinc finger protein 261
LOCUSID 9203
ACCNUM X95808
MAP Xq13.1
AffyID 41046_s_at
SYMBOL ZNF261
PMID 10486218 9205841 8817323
GO GO0003677 GO0007275 GO0016021
many other mappings
25 stats
heatmap
26R Cluster Analysis Packages
- cclust convex clustering methods.
- class self-organizing maps (SOM).
- cluster
- AGglomerative NESting (agnes),
- Clustering LARe Applications (clara),
- DIvisive ANAlysis (diana),
- Fuzzy Analysis (fanny),
- MONothetic Analysis (mona),
- Partitioning Around Medoids (pam).
- e1071
- fuzzy C-means clustering (cmeans),
- bagged clustering (bclust).
- flexmix flexible mixture modeling.
- fpc fixed point clusters, clusterwise regression
and discriminant plots. - GeneSOM self-organizing maps.
- mclust, mclust98 model-based cluster analysis.
- mva
- hierarchical clustering (hclust),
- k-means (kmeans).
Download these and other packages from CRAN.
27R Class Prediction Packages
Download these and other packages from CRAN.
- class
- k-nearest neighbor (knn),
- learning vector quantization (lvq).
- classPP projection pursuit.
- e1071 support vector machines (svm).
- ipred bagging, resampling based estimation of
prediction error. - knnTree k-nn classification with variable
selection inside leaves of a tree. - LogitBoost boosting for tree stumps.
- MASS linear and quadratic discriminant analysis
(lda, qda). - mlbench machine learning benchmark problems.
- nnet feed-forward neural networks and
multinomial log-linear models. - pamR prediction analysis for microarrays.
- randomForest random forests.
- rpart classification and regression trees.
- sma diagonal linear and quadratic discriminant
analysis, naïve Bayes (stat.diag.da).
28Getting StartedR and Bioconductor
29About R
- R Project (r-project.org) language and
environment for statistical computing and
graphics. - R is an open-source implementation of the S
language S-Plus is a commercial implementation. - Comprehensive R Archive Network, CRAN
(cran.r-project.org) source code and
pre-compiled binaries for Linux, Windows, MacOS
contributed packages documentation FAQs
mailing lists. - Omega Project (www.omegahat.org) by-directional
intersystem interfaces, e.g., R/Java, R/Perl,
R/Python, R/XML.
30Installation
- Main R software download from CRAN
(cran.r-project.org), use latest release, now
1.9.1. - Bioconductor packages download from Bioconductor
(www.bioconductor.org), use latest release, now
1.4. - Available for Linux/Unix, Windows, and MacOS.
31Installating R
- Latest released is version R 1.9.1.
- From CRAN
- Sources.
- Linux Debian (apt-get), Mandrake, RedHat RPMs,
Suse, Vine. - Windows installer rw1091.exe, double-click on
icon and follow instructions. - MacOS X RAqua.
- To customize installation, see R FAQs.
- May need to set some environment variables,
- e.g., R_HOME, R_LIBS, R_PROFILE.
32Installating Bioconductor
- After installing R, install Bioconductor packages
using getBioC install script. - From R
- gt source("http//www.bioconductor.org/getBioC.R")
- gt getBioC()
- Can customize installation via arguments of
getBioC. - Other packages (biological metadata, experimental
data, courses) can be installed as described
below, using Windows pull-down menus or R
functions install.packages or installDataPackage.
33R Packages
- An R package is a structure collection of code
(R, C, or other), documentation, and/or data for
performing specific types of analyses, - Packages
- Base packages (CRAN) e.g., base, methods, nls,
stats. - Contributed packages (CRAN) e.g., ellipse, XML.
- Bioconductor packages e.g., annotate, affy,
marray, multtest, hu95av2, ALL. - In Linux, have a look at directory
- /usr/lib/R (or wherever youve installed R).
- In Windows, have a look at folders in
c\ProgramFiles\R\rw1091.
34Installing vs. Loading
- Packages only need to be installed once, but
they must be loaded with each new R session. - Installing functions install.packages,
installDataPackage - Unix command R INSTALL
- Windows Packages pull-down menu.
- Loading function library
- Windows Packages pull-down menu.
- gt library(Biobase)
- Updating function update.packages
- Windows Packages pull-down menu.
35Starting and Quitting R
- Start R command.
- Quit q(). Prompted to save workspace image.
- Save
- current environment with save.image (default is
in .RData file) - specific R objects with save.
- Can be read back using load.
- Working directory getwd, setwd.
- List objects ls, objects.
- Search path for R objects search, searchpaths,
attach, detach. - Function arguments e.g., ? lm or args(lm).
- R for Windows provides pull-down menus for the
above actions.
36Documentation and Help
- Manuals, FAQs, and tutorials available from R
and Bioconductor websites and on-line in an R
session. - R on-line help system detailed on-line
documentation, available in text, HTML, PDF, and
LaTeX formats. - gt help.start()
- gt help(lm)
- gt ? hclust
- gt help.search(aproposprint)
- gt apropos(mean)
- gt example(hclust)
- gt demo()
- gt demo(image)
- gt data()
- R and Bioconductor mailing lists search
archives, post. - Short courses lectures notes, computer labs, and
course packages available on WWW for
self-instruction. - Vignettes openVignette(), vExplorer().
- Google.
- All on WWW.
37Vignettes
- Bioconductor has adopted a new documentation
paradigm, the vignette. - A vignette is an executable document consisting
of a collection of code chunks and documentation
text chunks. - Vignettes provide dynamic, integrated, and
reproducible statistical documents that can be
automatically updated if either data or analyses
are changed. - Vignettes can be generated using the Sweave
function from the R tools package.
38Vignettes
- Each Bioconductor package contains at least one
vignette, providing task-oriented descriptions of
the package's functionality. - Vignettes are located in the doc subdirectory of
an installed package and are accessible from the
help browser. - Vignettes can be used interactively.
- Vignettes are also available separately from the
Bioconductor website.
39Vignettes
- Tools are being developed for managing and using
this repository of step-by-step tutorials - Biobase openVignette Menu of available
vignettes and interface for viewing vignettes
(PDF). - tkWidgets vExplorer Interactive use of
vignettes. - reposTools.
40Vignettes
- HowTos Task-oriented descriptions of package
functionality. - Executable documents consisting of documentation
text and code chunks. - Dynamic, integrated, and reproducible
statistical documents. - Can be used interactively vExplorer.
- Generated using Sweave (tools package).
vExplorer
41Hands On!
42Extra Slides
43Annotation
- One of the greatest challenges in analyzing
genomic data is associating the experimental data
with the available biological metadata, e.g.,
sequence, gene annotation, chromosomal maps,
literature. - It is essential to make these data available for
computation. - Bioconductor provides three main packages for
this purpose - annotate (end-user)
- AnnBuilder (developer)
- annaffy (end-user).
44WWW Resources
- Nucleotide databases e.g., GenBank.
- Gene databases e.g., LocusLink, UniGene.
- Protein sequence and structure databases e.g.,
Protein DataBank (PDB), SwissProt. - Literature databases e.g., PubMed, OMIM.
- Chromosome maps e.g., NCBI Map Viewer.
- Pathways e.g., KEGG.
- Entrez is a search and retrieval system that
integrates information from databases at NCBI
(National Center for Biotechnology Information).
45annotate Matching IDs
- Important tasks
- Associate manufacturers or in-house probe
identifiers to other available identifiers. - E.g.
- Affymetrix IDs ? LocusLink LocusID
- Affymetrix IDs ? GenBank accession number.
- Associate probes with biological data such as
chromosomal position, pathway membership. - Associate probes with published literature data
via PubMed (need PMID).
46annotate Matching IDs
47annotate Versioning
- It is important to keep version information for
the mappings. - It is important to allow for new mappings to be
used when they become available. - There are some interesting challenges and
concerns that arise when comparing the strategies
of on-line mappings versus compiled mappings.
48Annotation Data Packages
- The Bioconductor project provides annotation data
packages, that contain many different mappings. - Mappings between Affy IDs and other probe IDs
hgu95av2 for HGU95Av2 GeneChip series, also,
hgu133a, hu6800, mgu74a, rgu34a, YG. - Affy CDF data packages.
- Probe sequence data packages.
- These packages are updated and expanded regularly
as new data become available. - They can be downloaded from the Bioconductor
website and also using installDataPackage. - DPExplorer a widget for interacting with data
packages. - AnnBuilder tools for building annotation data
packages.
49annotate Matching IDs
- Much of what annotate does relies on matching
symbols. - This is basically the role of a hash table in
most programming languages. - In R, we rely on environments.
- The annotation data packages provide R
environment objects containing key and value
pairs for the mappings between two sets of probe
identifiers. - Keys can be accessed using the R ls function.
- Matching values in different environments can be
accessed using the get or multiget functions.
50annotate Matching IDs
- gt library(hgu95av2)
- gt get("41046_s_at", env hgu95av2ACCNUM)
- 1 "X95808
- gt get("41046_s_at", env hgu95av2LOCUSID)
- 1 "9203
- gt get("41046_s_at", env hgu95av2SYMBOL)
- 1 "ZNF261"
- gt get("41046_s_at", env hgu95av2GENENAME)
- 1 "zinc finger protein 261"
- gt get("41046_s_at", env hgu95av2SUMFUNC)
- 1 "Contains a putative zinc-binding motif
(MYM)Proteome" - gt get("41046_s_at", env hgu95av2UNIGENE)
- 1 "Hs.9568"
51annotate Matching IDs
- gt get("41046_s_at", env hgu95av2CHR)
- 1 "X"
- gt get("41046_s_at", env hgu95av2CHRLOC)
- X
- -68692698
- gt get("41046_s_at", env hgu95av2MAP)
- 1 "Xq13.1
- gt get("41046_s_at", env hgu95av2PMID)
- 1 "10486218" "9205841" "8817323"
- gt get("41046_s_at", env hgu95av2GO) TAS
TAS IEA - "GO0003677" "GO0007275" "GO0016021"
52annotate Matching IDs
- Instead of relying on the general R functions for
environments, new user-friendly functions have
been written for accessing and working with
specific identifiers. - E.g. getGO, getGOdesc, getLL, getPMID, getSYMBOL.
53annotate Matching IDs
- gt getSYMBOL("41046_s_at",data"hgu95av2")
- 41046_s_at
- "ZNF261"
- gt gglt- getGO("41046_s_at",data"hgu95av2")
- gt getGOdesc(gg1, "MF")
- "GO0003677"
- "DNA binding activity"
- gt getLL("41046_s_at",data"hgu95av2")
- 41046_s_at
- 9203
- gt getPMID("41046_s_at",data"hgu95av2")
- "41046_s_at"
- 1 10486218 9205841 8817323
54annotate WWW Queries
- The annotate package provides tools for
- Querying and processing information from various
WWW biological databases - GenBank,
- LocusLink,
- PubMed.
- Regular expression searching of PubMed abstracts.
- Generating nice HTML reports of analyses, with
links to biological databases.
55annotate WWW Queries
- Functions for querying WWW databases from R rely
on the browseURL function - browseURL("www.r-project.org")
- Other tools HTMLPage class, getTDRows,
getQueryLink, getQuery4UG, getQuery4LL,
makeAnchor . - The XML package is used to parse query results.
56annotate Querying GenBank www.ncbi.nlm.nih.gov/Ge
nbank/index.html
- Given a vector of GenBank accession numbers or
NCBI UIDs, the genbank function - opens a browser at the URLs for the corresponding
GenBank queries - returns an XMLdoc object with the same data.
- gtgenbank(X95808,dispbrowser)
- http//www.ncbi.nih.gov/entrez/query.fcgi?toolbi
oconductorcmdSearchdbNucleotidetermX95808 - gtgenbank(1430782,dispdata,typeuid)
57annotate Querying LocusLinkwww.ncbi.nlm.nih.gov
/LocusLink/
- locuslinkByID given one or more LocusIDs, the
browser is opened at the URL corresponding to the
first gene - gt locuslinkByID(9203)
- http//www.ncbi.nih.gov/LocusLink/LocRpt.cgi?l92
03 - locuslinkQuery given a search string, the
results of the LocusLink query are displayed in
the browser - gt locuslinkQuery(zinc finger)
- http//www.ncbi.nih.gov/LocusLink/list.cgi?Qzinc
fingerORGHsV0 - getQuery4LL.
58annotate Querying PubMed www.ncbi.nlm.nih.gov
- For any gene there is often a large amount of
data available from PubMed. - The annotate package provides the following tools
for interacting with PubMed - pubMedAbst a class structure for PubMed
abstracts in R. - pubmed the basic engine for talking to PubMed
(pmidQuery).
59annotate pubMedAbst Class
- Class structure for storing and processing
- PubMed abstracts in R
- pmid
- authors
- abstText
- articleTitle
- journal
- pubDate
60annotate High-Level Tools for PubMed
- pm.getabst download the specified PubMed
abstracts (stored in XML) and create a list of
pubMedAbst objects. - pm.titles extract the titles from a list of
PubMed abstracts. - pm.abstGrep regular expression matching on the
abstracts.
61annotate PubMed Example
- gt pmid lt- getPMID("41046_s_at",data"hgu95av2")
- gt pubmed(pmid, dispbrowser)
- http//www.ncbi.nih.gov/entrez/query.fcgi?toolbi
oconductorcmdRetrievedbPubMedlist_uids104862
182c92058412c8817323 - gt absts lt- pm.getabst("41046_s_at",base"hgu95av2"
) - gt pm.titles(absts)
- gt pm.abstGrep("mouse",absts1)
62annotate PubMed Example
63annotate PubMed HTML Report
- The function pmAbst2HTML takes a list of
pubMedAbst objects and generates an HTML report
with the titles of the abstracts and links to
their full page on PubMed. -
- gt pmAbst2HTML(absts1,filename"pm.html")
64pmAbst2html function from annotate package
pm.html
65annotate Analysis Reports
- A simple interface, ll.htmlpage, can be used to
generate an HTML report of analysis results. - The page consists of a table with one row per
gene, with links to LocusLink. - Entries can include various gene identifiers and
statistics.
66ll.htmlpage function from annotate package
genelist.html
67Data Complexity
- Dimensionality.
- Dynamic/evolving data e.g., gene annotation,
sequence, literature. - Multiple data sources and locations in-house,
WWW. - Multiple data types numeric, textual, graphical.
- No longer Xnxp!
- We distinguish between biological metadata and
experimental metadata.
68Experimental Metadata
- Gene expression measures
- scanned images, i.e., raw data
- image quantitation data, i.e., output from image
analysis - normalized expression measures, i.e., log ratios
or Affy expression measures. - Reliability/quality information for the
expression measures. - Information on the probe sequences printed on the
arrays (array layout). - Information on the target samples hybridized to
the arrays. - See Minimum Information About a Microarray
Experiment (MIAME) standards and new MAGEML
package.
69Biological Metadata
- Biological attributes that can be applied to the
experimental data. - E.g. for genes
- chromosomal location
- gene annotation (LocusLink, GO)
- relevant literature (PubMed).
- Biological metadata sets are large, evolving
rapidly, and typically distributed via the WWW. - Tools annotate, annaffy, and AnnBuilder
packages, and annotation data packages.
70OOP
- The Bioconductor project has adopted the
object-oriented programming (OOP) paradigm
proposed in J. M. Chambers (1998). Programming
with Data. - This object-oriented class/method design allows
efficient representation and manipulation of
large and complex biological datasets of multiple
types. - Tools for programming using the class/method
mechanism are provided in the R methods package. - Tutorialwww.omegahat.org/RSMethods/index.html.
71OOP Classes
- A class provides a software abstraction of a real
world object. It reflects how we think of
certain objects and what information these
objects should contain. - Classes are defined in terms of slots which
contain the relevant data. - An object is an instance of a class.
- A class defines the structure, inheritance, and
initialization of objects.
72OOP Methods
- A method is a function that performs an action on
data (objects). - Methods define how a particular function should
behave depending on the class of its arguments. - Methods allow computations to be adapted to
particular data types, i.e., classes. - A generic function is a dispatcher, it examines
its arguments and determines the appropriate
method to invoke. - Examples of generic functions in R include plot,
summary, print.
73exprSet Class
Processed Affymetrix or spotted array data
exprs
Matrix of expression measures, genes x samples
Matrix of SEs for expression measures, genes x
samples
se.exprs
phenoData
Sample level covariates, instance of class
phenoData
annotation
Name of annotation data
description
MIAME information
- Use of object-oriented programming
- to deal with data complexity.
- S4 class/method mechanism
- (methods package).
notes
Any notes
74marrayRaw Class
Pre-normalization intensity data for a batch of
arrays
maRf
maGf
Matrix of red and green foreground intensities
maRb
maGb
Matrix of red and green background intensities
maW
Matrix of spot quality weights
maLayout
Array layout parameters - marrayLayout
Description of spotted probe sequences -
marrayInfo
maGnames
maTargets
Description of target samples - marrayInfo
Any notes
maNotes
75AffyBatch Class
Probe-level intensity data for a batch of arrays
(same CDF)
cdfName
Name of CDF file for arrays in the batch
nrow
ncol
Dimensions of the array
exprs
Matrices of probe-level intensities and SEs rows
? probe cells, columns ? arrays.
se.exprs
phenoData
Sample level covariates, instance of class
phenoData
annotation
Name of annotation data
description
MIAME information
Any notes
notes
76Sweave
- The Sweave system allows the generation of
dynamic, integrated, and reproducible statistical
documents intermixing text, code, and code output
(textual and graphical). - Functions are available in the R tools package.
- See ? Sweave and manual www.ci.tuwien.ac.at/leisc
h/Sweave/.
77Sweave Input
- Input a text file which consists of a sequence
of code chunks and documentation text chunks
(noweb file). - Documentation chunks
- start with _at_
- text in a markup language like LaTeX.
- Code chunks
- start with ltltnamegtgt
- R or S-Plus code.
- File extension .rnw, .Rnw, .snw, .Snw.
78Sweave Output
- Output a single document, e.g., .tex file or
.pdf file containing - the documentation text,
- the R code,
- the code output text and graphs.
- The document can be automatically regenerated
whenever the data, code, or documentation text
change. - Stangle or tangleToR extract only the code
chunks.
79Sweave
main.Rnw
main.R
Stangle
Sweave
main.tex
fig.pdf
fig.eps
latex
pdflatex
main.dvi
main.pdf
dvips
main.ps
80Widgets
- Widgets. Small-scale graphical user interfaces
(GUI), providing point click access for
specific tasks. - E.g. File browsing and selection for data input,
basic analyses. - Packages
- tkWidgets dataViewer, fileBrowser, fileWizard,
importWizard, objectBrowser. - widgetTools.
81Widgets
Reading in phenoData
tkSampleNames
tkphenoData
tkMIAME