Bioconductor - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Bioconductor

Description:

Byron Ellis, Department of Statistics, Harvard University, USA. ... Kurt Hornik, Technische Universitat Wien, Austria. ... Technische Universitat Wien, Austria. ... – PowerPoint PPT presentation

Number of Views:284
Avg rating:3.0/5.0
Slides: 82
Provided by: statBe
Category:

less

Transcript and Presenter's Notes

Title: Bioconductor


1
Bioconductor
  • Sandrine Dudoit
  • Division of Biostatistics, UC Berkeley
  • www.stat.berkeley.edu/sandrine
  • MGED7
  • September 8, 2004
  • Toronto, Canada

2
Core Development Team
  • Douglas Bates, University of Wisconsin,
    Madison,USA.
  • Benjamin Bolstad, Division of
    Biostatistics, UC Berkeley, USA.
  • Vincent Carey, Harvard Medical School,
    USA.
  • Marcel Dettling, Federal Inst. Technology,
    Switzerland.
  • Sandrine Dudoit, Division of
    Biostatistics, UC Berkeley, USA.
  • Byron Ellis, Department of Statistics,
    Harvard University, USA.
  • Laurent Gautier, Technical University of
    Denmark, Denmark.
  • Robert Gentleman, Harvard Medical School,
    USA.
  • Jeff Gentry, Dana-Farber Cancer Institute,
    USA.
  • Kurt Hornik, Technische Universitat Wien,
    Austria.
  • Torsten Hothorn, Institut fuer
    Medizininformatik, Biometrie und Epidemiologie,
    Germany.
  • Wolfgang Huber, DKFZ Heidelberg, Germany.
  • Stefano Iacus, University of Milan, Italy
  • Rafael Irizarry, Department of
    Biostatistics, Johns Hopkins University, USA.
  • Friedrich Leisch, Technische Universitat
    Wien, Austria.
  • James MacDonald, University of Michigan, USA.
  • Martin Maechler, Federal Inst. Technology,
    Switzerland.
  • Crispin Miller, The Paterson Institute
    Bioinformatics Group, UK.
  • Colin Smith, NASA Center for Astrobioinformatics,
    USA.

3
References
  • Bioconductor www.bioconductor.org
  • software, data, and documentation (vignettes)
  • training materials from short courses
  • mailing list.
  • R www.r-project.org, cran.r-project.org
  • software base and contributed (CRAN)
  • documentation
  • newsletter R News
  • mailing list.
  • Bioconductor Project Working Papers
  • www.bepress.com/bioconductor.
  • Personal
  • www.stat.berkeley.edu/sandrine.

4
Outline
  • Overview of the Bioconductor Project.
  • Getting Started R and Bioconductor.
  • Hands On!

5
Overview of the Bioconductor Project
6
Bioconductor
  • Bioconductor is an open-source and
    open-development software project for the
    analysis of biomedical and genomic data.
  • The project was started in the Fall of 2001 and
    includes 25 core developers in the US, Europe,
    and Australia.
  • R and the R package system are used to design and
    distribute software.
  • Semi-annual releases
  • v 1.0 May 2nd, 2002, 15 packages.
  • v 1.4 May 17th, 2004, 81 packages.
  • ArrayAnalyzer Commercial port of Bioconductor
    packages in S-Plus.

7
Goals
  • Provide access to powerful statistical and
    graphical methods for the analysis of biomedical
    and genomic data.
  • Facilitate the integration of biological metadata
    from WWW in the analysis of experimental data.
  • E.g. GenBank, GO, LocusLink, PubMed.
  • Allow the rapid development of extensible,
    interoperable, and scalable software.
  • Promote high-quality documentation and
    reproducible research.
  • Provide training in computational and statistical
    methods.

8
Bioconductor Packages
  • Bioconductor software consists of R add-on
    packages.
  • An R package is a structured collection of code
    (R, C, or other), documentation, and/or data for
    performing specific types of analyses.
  • E.g. affy, cluster, graph, hexbin packages
    provide implementations of specialized
    statistical and graphical methods.

9
Bioconductor Packages
  • Statistical methods cluster analysis, estimation
    and testing for linear and non-linear models
    (with possibly censored continuous and
    polychotomous outcomes), multiple hypothesis
    testing, resampling, visualization, etc.
  • Biological assays cell-based assays, DNA
    microarrays (transcript levels, DNA copy number
    from CGH), proteomics, SAGE, SELDI-TOF, SNP, etc.
  • Biological metadata from WWW GenBank, GO, KEGG,
    PubMed, etc.
  • Interfaces with other languages C, Java, Perl,
    Python, XML, etc. -- Omega Project
    (www.omegahat.org).
  • Interactions with other projects BGL,
    GeneSpring, Graphviz, MAGE-ML, Resourcerer, etc.
  • R as a broker.

10
Bioconductor Packages
  • Data packages
  • Biological metadata mappings between different
    gene identifiers (e.g., AffyID, GO ID, LocusID,
    PMID), CDF and probe sequence information for
    Affy arrays.
  • E.g. hgu95av2, GO, KEGG.
  • Experimental data code, data, and documentation
    for specific experiments or projects.
  • ALL Chiaretti et al. (2004) ALL data.
  • golubEsets Golub et al. (2000) ALL/AML data.
  • yeastCC Spellman et al. (1998) yeast cell
    cycle.
  • Course packages code, data, documentation, and
    labs for the instruction of a particular course.
    E.g. EMBO03 course package.

11
Bioconductor Packages
  • Bioconductor provides two main classes of
    software packages.
  • End-user packages
  • aimed at users unfamiliar with R or computer
    programming
  • polished and easy-to-use interfaces to a wide
    variety of computational and statistical methods
    for the analysis of biomedical and genomic data.
  • Developer packages aimed at software developers,
    in the sense that they provide software to write
    software.

12
Bioconductor PackagesRelease 1.4, May 17th,
2004Over 80 packages!
  • General infrastructure
  • Biobase, Biostrings, DynDoc, reposTools, rhdf5,
    ruuid, tkWidgets, widgetTools.
  • Annotation
  • annotate, AnnBuilder metadata packages.
  • Graphics
  • geneplotter, hexbin.
  • Pre-processing Affymetrix oligonucleotide chip
    data
  • affy, affycomp, affydata, affylmGUI, affyPLM,
    annaffy, gcrma, makecdfenv, vsn.
  • Pre-processing two-color spotted DNA microarray
    data
  • arrayMagic, arrayQuality, limma, limmaGUI,
    marray, vsn.
  • Other assays aCGH, DNAcopy, prada, PROcess,
    RSNPer, SAGElyzer.
  • Differential gene expression
  • EBarrays, edd, factDesign, genefilter, limma,
    limmaGUI, multtest, ROC.
  • Graphs and networks
  • graph, RBGL, Rgraphviz.
  • Gene Ontology GOstats, goTools.
  • MAGE RMAGEML.

N.B. Many new packages in Bioconductor
development version.
13
Ongoing Efforts
Many methods already implemented in CRAN packages.
  • Variable/model selection
  • Prediction
  • Cluster analysis
  • Resampling bootstrap, cross-validation
  • Multiple testing procedures
  • Quality measures for microarray data
  • Other biological data types e.g., proteomics,
    sequence analysis
  • Interactions with other projects
  • Web services.

14
Microarray Data Analysis
.gpr, .Spot
CEL, CDF
marray limma vsn
affy vsn
Pre-processing
exprSet
Annotation
annotate annaffy metadata packages
Differential expression
Graphs networks
Cluster analysis
Prediction
CRAN class e1071 ipred LogitBoost MASS nnet random
Forest rpart
graph RBGL Rgraphviz
edd genefilter limma multtest ROC CRAN
CRAN class cluster MASS mva
Graphics
geneplotter hexbin CRAN
15
Microarray Data Analysis
  • Pre-processing of
  • spotted array data with marray packages
  • Affymetrix array data with affy packages.
  • List of differentially expressed genes from
    genefilter, limma, or multtest packages.
  • Prediction of tumor class using randomForest
    package.
  • Clustering of genes using cluster package.
  • Use of annotate package
  • to retrieve and search PubMed abstracts
  • to generate an HTML report with links to
    LocusLink for each gene.

16
marray
  • Pre-processing two-color spotted array data
  • diagnostic plots,
  • robust adaptive normalization (lowess, loess).

maImage
maBoxplot
maPlot hexbin
17
arrayMagic
R Rb R-Rbcolor scale by rank
another array print-tip
color scale log(G)
color scale rank(G)
Spatial effects
18
affy
  • Pre-processing oligonucleotide chip data
  • diagnostic plots,
  • background correction,
  • probe-level normalization,
  • computation of expression measures.

plotAffyRNADeg
barplot.ProbeSet
image
plotDensity
19
vsn
  • Variance stabilization (shrinkage) more stable
    expression estimates in cases where there are few
    replicates.
  • Model-based normalization parameter estimation
    for affine calibration and additive-multiplicative
    error model.

20
limma
  • LInear Models for MicroArrays pre-processing
    and differential expression
  • Pre-processing background correction,
    normalization.
  • Complex experimental designs, e.g.,
    multifactorial.
  • Empirical Bayes methods for identifying
    differentially expressed genes t-statistics,
    F-statistics, posterior odds.
  • Inference methods for duplicate spots and
    technical replicates.
  • Analysis based on log-ratios or absolute
    log-intensities.
  • Spot quality weights.
  • Graphics heat diagrams, Venn diagrams.

21
limmaGUI
22
aCGH
  • Pre-processing imputation of missing values
    (lowess), filtering.
  • Visualization measured and derived information
    as a function of genomic position.
  • HMM-based algorithm for finding genomic events,
    e.g., copy number transitions and high-level
    amplifications.
  • Perform and interpret tests for associations
    between clinical variables and copy number of
    individual loci as well as collective features of
    genomic profiles

23
Statistics and significance cut-off
Copy number transitions
Frequency plot
Genomic profile
24
annotate, annafy, and AnnBuilder
Metadata package hgu95av2 mappings between
different gene identifiers for hgu95av2 chip.
  • Assemble and process genomic annotation data from
    public repositories.
  • Build annotation data packages or XML data
    documents.
  • Associate experimental data in real time to
    biological metadata from web databases such as
    GenBank, GO, KEGG, LocusLink, and PubMed.
  • Process and store query results e.g., search
    PubMed abstracts.
  • Generate HTML reports of analyses.

GENENAME zinc finger protein 261
LOCUSID 9203
ACCNUM X95808
MAP Xq13.1
AffyID 41046_s_at
SYMBOL ZNF261
PMID 10486218 9205841 8817323
GO GO0003677 GO0007275 GO0016021
many other mappings
25
stats

heatmap
26
R Cluster Analysis Packages
  • cclust convex clustering methods.
  • class self-organizing maps (SOM).
  • cluster
  • AGglomerative NESting (agnes),
  • Clustering LARe Applications (clara),
  • DIvisive ANAlysis (diana),
  • Fuzzy Analysis (fanny),
  • MONothetic Analysis (mona),
  • Partitioning Around Medoids (pam).
  • e1071
  • fuzzy C-means clustering (cmeans),
  • bagged clustering (bclust).
  • flexmix flexible mixture modeling.
  • fpc fixed point clusters, clusterwise regression
    and discriminant plots.
  • GeneSOM self-organizing maps.
  • mclust, mclust98 model-based cluster analysis.
  • mva
  • hierarchical clustering (hclust),
  • k-means (kmeans).

Download these and other packages from CRAN.
27
R Class Prediction Packages
Download these and other packages from CRAN.
  • class
  • k-nearest neighbor (knn),
  • learning vector quantization (lvq).
  • classPP projection pursuit.
  • e1071 support vector machines (svm).
  • ipred bagging, resampling based estimation of
    prediction error.
  • knnTree k-nn classification with variable
    selection inside leaves of a tree.
  • LogitBoost boosting for tree stumps.
  • MASS linear and quadratic discriminant analysis
    (lda, qda).
  • mlbench machine learning benchmark problems.
  • nnet feed-forward neural networks and
    multinomial log-linear models.
  • pamR prediction analysis for microarrays.
  • randomForest random forests.
  • rpart classification and regression trees.
  • sma diagonal linear and quadratic discriminant
    analysis, naïve Bayes (stat.diag.da).

28
Getting StartedR and Bioconductor
29
About R
  • R Project (r-project.org) language and
    environment for statistical computing and
    graphics.
  • R is an open-source implementation of the S
    language S-Plus is a commercial implementation.
  • Comprehensive R Archive Network, CRAN
    (cran.r-project.org) source code and
    pre-compiled binaries for Linux, Windows, MacOS
    contributed packages documentation FAQs
    mailing lists.
  • Omega Project (www.omegahat.org) by-directional
    intersystem interfaces, e.g., R/Java, R/Perl,
    R/Python, R/XML.

30
Installation
  • Main R software download from CRAN
    (cran.r-project.org), use latest release, now
    1.9.1.
  • Bioconductor packages download from Bioconductor
    (www.bioconductor.org), use latest release, now
    1.4.
  • Available for Linux/Unix, Windows, and MacOS.

31
Installating R
  • Latest released is version R 1.9.1.
  • From CRAN
  • Sources.
  • Linux Debian (apt-get), Mandrake, RedHat RPMs,
    Suse, Vine.
  • Windows installer rw1091.exe, double-click on
    icon and follow instructions.
  • MacOS X RAqua.
  • To customize installation, see R FAQs.
  • May need to set some environment variables,
  • e.g., R_HOME, R_LIBS, R_PROFILE.

32
Installating Bioconductor
  • After installing R, install Bioconductor packages
    using getBioC install script.
  • From R
  • gt source("http//www.bioconductor.org/getBioC.R")
  • gt getBioC()
  • Can customize installation via arguments of
    getBioC.
  • Other packages (biological metadata, experimental
    data, courses) can be installed as described
    below, using Windows pull-down menus or R
    functions install.packages or installDataPackage.

33
R Packages
  • An R package is a structure collection of code
    (R, C, or other), documentation, and/or data for
    performing specific types of analyses,
  • Packages
  • Base packages (CRAN) e.g., base, methods, nls,
    stats.
  • Contributed packages (CRAN) e.g., ellipse, XML.
  • Bioconductor packages e.g., annotate, affy,
    marray, multtest, hu95av2, ALL.
  • In Linux, have a look at directory
  • /usr/lib/R (or wherever youve installed R).
  • In Windows, have a look at folders in
    c\ProgramFiles\R\rw1091.

34
Installing vs. Loading
  • Packages only need to be installed once, but
    they must be loaded with each new R session.
  • Installing functions install.packages,
    installDataPackage
  • Unix command R INSTALL
  • Windows Packages pull-down menu.
  • Loading function library
  • Windows Packages pull-down menu.
  • gt library(Biobase)
  • Updating function update.packages
  • Windows Packages pull-down menu.

35
Starting and Quitting R
  • Start R command.
  • Quit q(). Prompted to save workspace image.
  • Save
  • current environment with save.image (default is
    in .RData file)
  • specific R objects with save.
  • Can be read back using load.
  • Working directory getwd, setwd.
  • List objects ls, objects.
  • Search path for R objects search, searchpaths,
    attach, detach.
  • Function arguments e.g., ? lm or args(lm).
  • R for Windows provides pull-down menus for the
    above actions.

36
Documentation and Help
  • Manuals, FAQs, and tutorials available from R
    and Bioconductor websites and on-line in an R
    session.
  • R on-line help system detailed on-line
    documentation, available in text, HTML, PDF, and
    LaTeX formats.
  • gt help.start()
  • gt help(lm)
  • gt ? hclust
  • gt help.search(aproposprint)
  • gt apropos(mean)
  • gt example(hclust)
  • gt demo()
  • gt demo(image)
  • gt data()
  • R and Bioconductor mailing lists search
    archives, post.
  • Short courses lectures notes, computer labs, and
    course packages available on WWW for
    self-instruction.
  • Vignettes openVignette(), vExplorer().
  • Google.
  • All on WWW.

37
Vignettes
  • Bioconductor has adopted a new documentation
    paradigm, the vignette.
  • A vignette is an executable document consisting
    of a collection of code chunks and documentation
    text chunks.
  • Vignettes provide dynamic, integrated, and
    reproducible statistical documents that can be
    automatically updated if either data or analyses
    are changed.
  • Vignettes can be generated using the Sweave
    function from the R tools package.

38
Vignettes
  • Each Bioconductor package contains at least one
    vignette, providing task-oriented descriptions of
    the package's functionality.
  • Vignettes are located in the doc subdirectory of
    an installed package and are accessible from the
    help browser.
  • Vignettes can be used interactively.
  • Vignettes are also available separately from the
    Bioconductor website.

39
Vignettes
  • Tools are being developed for managing and using
    this repository of step-by-step tutorials
  • Biobase openVignette Menu of available
    vignettes and interface for viewing vignettes
    (PDF).
  • tkWidgets vExplorer Interactive use of
    vignettes.
  • reposTools.

40
Vignettes
  • HowTos Task-oriented descriptions of package
    functionality.
  • Executable documents consisting of documentation
    text and code chunks.
  • Dynamic, integrated, and reproducible
    statistical documents.
  • Can be used interactively vExplorer.
  • Generated using Sweave (tools package).

vExplorer
41
Hands On!
42
Extra Slides
43
Annotation
  • One of the greatest challenges in analyzing
    genomic data is associating the experimental data
    with the available biological metadata, e.g.,
    sequence, gene annotation, chromosomal maps,
    literature.
  • It is essential to make these data available for
    computation.
  • Bioconductor provides three main packages for
    this purpose
  • annotate (end-user)
  • AnnBuilder (developer)
  • annaffy (end-user).

44
WWW Resources
  • Nucleotide databases e.g., GenBank.
  • Gene databases e.g., LocusLink, UniGene.
  • Protein sequence and structure databases e.g.,
    Protein DataBank (PDB), SwissProt.
  • Literature databases e.g., PubMed, OMIM.
  • Chromosome maps e.g., NCBI Map Viewer.
  • Pathways e.g., KEGG.
  • Entrez is a search and retrieval system that
    integrates information from databases at NCBI
    (National Center for Biotechnology Information).

45
annotate Matching IDs
  • Important tasks
  • Associate manufacturers or in-house probe
    identifiers to other available identifiers.
  • E.g.
  • Affymetrix IDs ? LocusLink LocusID
  • Affymetrix IDs ? GenBank accession number.
  • Associate probes with biological data such as
    chromosomal position, pathway membership.
  • Associate probes with published literature data
    via PubMed (need PMID).

46
annotate Matching IDs
47
annotate Versioning
  • It is important to keep version information for
    the mappings.
  • It is important to allow for new mappings to be
    used when they become available.
  • There are some interesting challenges and
    concerns that arise when comparing the strategies
    of on-line mappings versus compiled mappings.

48
Annotation Data Packages
  • The Bioconductor project provides annotation data
    packages, that contain many different mappings.
  • Mappings between Affy IDs and other probe IDs
    hgu95av2 for HGU95Av2 GeneChip series, also,
    hgu133a, hu6800, mgu74a, rgu34a, YG.
  • Affy CDF data packages.
  • Probe sequence data packages.
  • These packages are updated and expanded regularly
    as new data become available.
  • They can be downloaded from the Bioconductor
    website and also using installDataPackage.
  • DPExplorer a widget for interacting with data
    packages.
  • AnnBuilder tools for building annotation data
    packages.

49
annotate Matching IDs
  • Much of what annotate does relies on matching
    symbols.
  • This is basically the role of a hash table in
    most programming languages.
  • In R, we rely on environments.
  • The annotation data packages provide R
    environment objects containing key and value
    pairs for the mappings between two sets of probe
    identifiers.
  • Keys can be accessed using the R ls function.
  • Matching values in different environments can be
    accessed using the get or multiget functions.

50
annotate Matching IDs
  • gt library(hgu95av2)
  • gt get("41046_s_at", env hgu95av2ACCNUM)
  • 1 "X95808
  • gt get("41046_s_at", env hgu95av2LOCUSID)
  • 1 "9203
  • gt get("41046_s_at", env hgu95av2SYMBOL)
  • 1 "ZNF261"
  • gt get("41046_s_at", env hgu95av2GENENAME)
  • 1 "zinc finger protein 261"
  • gt get("41046_s_at", env hgu95av2SUMFUNC)
  • 1 "Contains a putative zinc-binding motif
    (MYM)Proteome"
  • gt get("41046_s_at", env hgu95av2UNIGENE)
  • 1 "Hs.9568"

51
annotate Matching IDs
  • gt get("41046_s_at", env hgu95av2CHR)
  • 1 "X"
  • gt get("41046_s_at", env hgu95av2CHRLOC)
  • X
  • -68692698
  • gt get("41046_s_at", env hgu95av2MAP)
  • 1 "Xq13.1
  • gt get("41046_s_at", env hgu95av2PMID)
  • 1 "10486218" "9205841" "8817323"
  • gt get("41046_s_at", env hgu95av2GO) TAS
    TAS IEA
  • "GO0003677" "GO0007275" "GO0016021"

52
annotate Matching IDs
  • Instead of relying on the general R functions for
    environments, new user-friendly functions have
    been written for accessing and working with
    specific identifiers.
  • E.g. getGO, getGOdesc, getLL, getPMID, getSYMBOL.

53
annotate Matching IDs
  • gt getSYMBOL("41046_s_at",data"hgu95av2")
  • 41046_s_at
  • "ZNF261"
  • gt gglt- getGO("41046_s_at",data"hgu95av2")
  • gt getGOdesc(gg1, "MF")
  • "GO0003677"
  • "DNA binding activity"
  • gt getLL("41046_s_at",data"hgu95av2")
  • 41046_s_at
  • 9203
  • gt getPMID("41046_s_at",data"hgu95av2")
  • "41046_s_at"
  • 1 10486218 9205841 8817323

54
annotate WWW Queries
  • The annotate package provides tools for
  • Querying and processing information from various
    WWW biological databases
  • GenBank,
  • LocusLink,
  • PubMed.
  • Regular expression searching of PubMed abstracts.
  • Generating nice HTML reports of analyses, with
    links to biological databases.

55
annotate WWW Queries
  • Functions for querying WWW databases from R rely
    on the browseURL function
  • browseURL("www.r-project.org")
  • Other tools HTMLPage class, getTDRows,
    getQueryLink, getQuery4UG, getQuery4LL,
    makeAnchor .
  • The XML package is used to parse query results.

56
annotate Querying GenBank www.ncbi.nlm.nih.gov/Ge
nbank/index.html
  • Given a vector of GenBank accession numbers or
    NCBI UIDs, the genbank function
  • opens a browser at the URLs for the corresponding
    GenBank queries
  • returns an XMLdoc object with the same data.
  • gtgenbank(X95808,dispbrowser)
  • http//www.ncbi.nih.gov/entrez/query.fcgi?toolbi
    oconductorcmdSearchdbNucleotidetermX95808
  • gtgenbank(1430782,dispdata,typeuid)

57
annotate Querying LocusLinkwww.ncbi.nlm.nih.gov
/LocusLink/
  • locuslinkByID given one or more LocusIDs, the
    browser is opened at the URL corresponding to the
    first gene
  • gt locuslinkByID(9203)
  • http//www.ncbi.nih.gov/LocusLink/LocRpt.cgi?l92
    03
  • locuslinkQuery given a search string, the
    results of the LocusLink query are displayed in
    the browser
  • gt locuslinkQuery(zinc finger)
  • http//www.ncbi.nih.gov/LocusLink/list.cgi?Qzinc
    fingerORGHsV0
  • getQuery4LL.

58
annotate Querying PubMed www.ncbi.nlm.nih.gov
  • For any gene there is often a large amount of
    data available from PubMed.
  • The annotate package provides the following tools
    for interacting with PubMed
  • pubMedAbst a class structure for PubMed
    abstracts in R.
  • pubmed the basic engine for talking to PubMed
    (pmidQuery).

59
annotate pubMedAbst Class
  • Class structure for storing and processing
  • PubMed abstracts in R
  • pmid
  • authors
  • abstText
  • articleTitle
  • journal
  • pubDate

60
annotate High-Level Tools for PubMed
  • pm.getabst download the specified PubMed
    abstracts (stored in XML) and create a list of
    pubMedAbst objects.
  • pm.titles extract the titles from a list of
    PubMed abstracts.
  • pm.abstGrep regular expression matching on the
    abstracts.

61
annotate PubMed Example
  • gt pmid lt- getPMID("41046_s_at",data"hgu95av2")
  • gt pubmed(pmid, dispbrowser)
  • http//www.ncbi.nih.gov/entrez/query.fcgi?toolbi
    oconductorcmdRetrievedbPubMedlist_uids104862
    182c92058412c8817323
  • gt absts lt- pm.getabst("41046_s_at",base"hgu95av2"
    )
  • gt pm.titles(absts)
  • gt pm.abstGrep("mouse",absts1)

62
annotate PubMed Example
63
annotate PubMed HTML Report
  • The function pmAbst2HTML takes a list of
    pubMedAbst objects and generates an HTML report
    with the titles of the abstracts and links to
    their full page on PubMed.
  • gt pmAbst2HTML(absts1,filename"pm.html")

64
pmAbst2html function from annotate package
pm.html
65
annotate Analysis Reports
  • A simple interface, ll.htmlpage, can be used to
    generate an HTML report of analysis results.
  • The page consists of a table with one row per
    gene, with links to LocusLink.
  • Entries can include various gene identifiers and
    statistics.

66
ll.htmlpage function from annotate package
genelist.html
67
Data Complexity
  • Dimensionality.
  • Dynamic/evolving data e.g., gene annotation,
    sequence, literature.
  • Multiple data sources and locations in-house,
    WWW.
  • Multiple data types numeric, textual, graphical.
  • No longer Xnxp!
  • We distinguish between biological metadata and
    experimental metadata.

68
Experimental Metadata
  • Gene expression measures
  • scanned images, i.e., raw data
  • image quantitation data, i.e., output from image
    analysis
  • normalized expression measures, i.e., log ratios
    or Affy expression measures.
  • Reliability/quality information for the
    expression measures.
  • Information on the probe sequences printed on the
    arrays (array layout).
  • Information on the target samples hybridized to
    the arrays.
  • See Minimum Information About a Microarray
    Experiment (MIAME) standards and new MAGEML
    package.

69
Biological Metadata
  • Biological attributes that can be applied to the
    experimental data.
  • E.g. for genes
  • chromosomal location
  • gene annotation (LocusLink, GO)
  • relevant literature (PubMed).
  • Biological metadata sets are large, evolving
    rapidly, and typically distributed via the WWW.
  • Tools annotate, annaffy, and AnnBuilder
    packages, and annotation data packages.

70
OOP
  • The Bioconductor project has adopted the
    object-oriented programming (OOP) paradigm
    proposed in J. M. Chambers (1998). Programming
    with Data.
  • This object-oriented class/method design allows
    efficient representation and manipulation of
    large and complex biological datasets of multiple
    types.
  • Tools for programming using the class/method
    mechanism are provided in the R methods package.
  • Tutorialwww.omegahat.org/RSMethods/index.html.

71
OOP Classes
  • A class provides a software abstraction of a real
    world object. It reflects how we think of
    certain objects and what information these
    objects should contain.
  • Classes are defined in terms of slots which
    contain the relevant data.
  • An object is an instance of a class.
  • A class defines the structure, inheritance, and
    initialization of objects.

72
OOP Methods
  • A method is a function that performs an action on
    data (objects).
  • Methods define how a particular function should
    behave depending on the class of its arguments.
  • Methods allow computations to be adapted to
    particular data types, i.e., classes.
  • A generic function is a dispatcher, it examines
    its arguments and determines the appropriate
    method to invoke.
  • Examples of generic functions in R include plot,
    summary, print.

73
exprSet Class
Processed Affymetrix or spotted array data
exprs
Matrix of expression measures, genes x samples
Matrix of SEs for expression measures, genes x
samples
se.exprs
phenoData
Sample level covariates, instance of class
phenoData
annotation
Name of annotation data
description
MIAME information
  • Use of object-oriented programming
  • to deal with data complexity.
  • S4 class/method mechanism
  • (methods package).

notes
Any notes
74
marrayRaw Class
Pre-normalization intensity data for a batch of
arrays
maRf
maGf
Matrix of red and green foreground intensities
maRb
maGb
Matrix of red and green background intensities
maW
Matrix of spot quality weights
maLayout
Array layout parameters - marrayLayout
Description of spotted probe sequences -
marrayInfo
maGnames
maTargets
Description of target samples - marrayInfo
Any notes
maNotes
75
AffyBatch Class
Probe-level intensity data for a batch of arrays
(same CDF)
cdfName
Name of CDF file for arrays in the batch
nrow
ncol
Dimensions of the array
exprs
Matrices of probe-level intensities and SEs rows
? probe cells, columns ? arrays.
se.exprs
phenoData
Sample level covariates, instance of class
phenoData
annotation
Name of annotation data
description
MIAME information
Any notes
notes
76
Sweave
  • The Sweave system allows the generation of
    dynamic, integrated, and reproducible statistical
    documents intermixing text, code, and code output
    (textual and graphical).
  • Functions are available in the R tools package.
  • See ? Sweave and manual www.ci.tuwien.ac.at/leisc
    h/Sweave/.

77
Sweave Input
  • Input a text file which consists of a sequence
    of code chunks and documentation text chunks
    (noweb file).
  • Documentation chunks
  • start with _at_
  • text in a markup language like LaTeX.
  • Code chunks
  • start with ltltnamegtgt
  • R or S-Plus code.
  • File extension .rnw, .Rnw, .snw, .Snw.

78
Sweave Output
  • Output a single document, e.g., .tex file or
    .pdf file containing
  • the documentation text,
  • the R code,
  • the code output text and graphs.
  • The document can be automatically regenerated
    whenever the data, code, or documentation text
    change.
  • Stangle or tangleToR extract only the code
    chunks.

79
Sweave
main.Rnw
main.R
Stangle
Sweave
main.tex
fig.pdf
fig.eps
latex
pdflatex
main.dvi
main.pdf
dvips
main.ps
80
Widgets
  • Widgets. Small-scale graphical user interfaces
    (GUI), providing point click access for
    specific tasks.
  • E.g. File browsing and selection for data input,
    basic analyses.
  • Packages
  • tkWidgets dataViewer, fileBrowser, fileWizard,
    importWizard, objectBrowser.
  • widgetTools.

81
Widgets
Reading in phenoData
tkSampleNames
tkphenoData
tkMIAME
Write a Comment
User Comments (0)
About PowerShow.com