"R" Programming for Life Sciences - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

"R" Programming for Life Sciences

Description:

use a model where the logic flows from the top of the page to the bottom with ... dudes = c('Larry', 'Moe', 'Curly', 'Shemp') Making a Vector. R is case sensitive. ... – PowerPoint PPT presentation

Number of Views:318
Avg rating:3.0/5.0
Slides: 80
Provided by: rayba
Category:

less

Transcript and Presenter's Notes

Title: "R" Programming for Life Sciences


1
"R" Programming for Life Sciences
  • Raymond R. Balise, Ph.D.
  • Health Research and Policy
  • SPCTRM/SCCTER

2
Roadmap
  • What makes R different for the rest?
  • Setting up R
  • Types of data
  • Working with collections of data
  • Importing and exporting data
  • Writing functions
  • Graphics

3
When to Use R
  • Shoestring budget
  • Cutting edge statistics
  • Developing your own or fine-tuning existing
    methods
  • Local expertise

4
Programming Languages
  • Procedural languages
  • C, Fortran, Cobol, Basic
  • use a model where the logic flows from the top of
    the page to the bottom with calls to goto
    subroutines as needed
  • It is hard to encapsulate the code.
  • Object oriented languages
  • C, Visual Basic, JAVA
  • involve creating objects and then operating on
    them

5
R is Object Oriented (OO)
  • You create objects
  • vector of numbers, a graphic, etc.
  • You call methods/functions to operate on the
    objects.
  • Working with a OO language requires you to learn
    about special methods to create, access, modify,
    or destroy objects and their properties.
  • R hides these processes.
  • It helps a lot if you want to write new
    statistics and methods and is required for making
    new packages.

6
OO Example
  • With R you write code in the editor which I will
    show you in a minute.
  • You can create an object which holds a bunch of
    numbers (a vector, if you remember math)
  • You can then use (aka call) a function (aka
    method) to operate on the object.
  • The summary() function
  • Create and display a numeric summary object
  • The plot() function
  • Create and display a graphic summary object

7
Make the ages object
Call the summary function
Call the plot function
8
OO Programming in R
  • OO programming requires
  • objects
  • classes
  • describe specific properties for groups of
    objects
  • inheritance
  • classes related to eachother (derived from other
    classes) have related properties
  • polymorphism
  • the same function name applied to different
    classes does different things
  • R vs. JAVA R typically has separate classes for
    actions instead of bundling them with the data
    structures
  • JAVA
  • Animal - domesticated - dog (walks)
  • R
  • Animal - domesticated - dog
  • Movement - Walks

9
R3 and R4
  • R (really S) predates object oriented programming
  • There are two OO programming systems in the core
    of R
  • S3 (Chambers and Hastie 1992)
  • did not have ridged class structure
  • easy to program
  • less reliable
  • fine for small-time development
  • S4 (Chambers 1998)
  • stronger support for classes
  • harder to program
  • more reliable
  • good for industrial strength development
  • Knowing they exist helps explain differences you
    see in R code and lets you modify both types of
    objects.

10
Where to Get R
  • R has two main websites. One describes the
    project
  • http//www.r-project.org/
  • The other has most of the stuff you want to
    download
  • http//cran.r-project.org/
  • Because the R project has people working all over
    the globe, the software download site is
    mirrored everywhere. The closest mirror is USA
    CA1 (aka UC Berkeley).

11
http//cran.cnr.berkeley.edu/
  • There is an R installer for all the common
    operating systems
  • cran.cnr.berkeley.edu/bin/windows/base/
  • cran.cnr.berkeley.edu/bin/macosx/
  • cran.cnr.berkeley.edu/bin/linux/
  • Each is basically self explanatory.

12
(No Transcript)
13
Installing on Windows
  • Double click the installer and just push next
    until you get to this screen.

Specify that you want to do customized
startup. This will let you set up R to work with
other programs nicely.
14
Customize
  • Use these options, then hit next a bunch.

15
  • help.start() and push enter to start the help.
  • q() and push enter to quit but dont yet.

16
GUI
17
GUI
18
GUI
Shows the add on packages currently accessible
19
Packages in R
  • User-supplied packages are typically found at one
    of three places
  • CRAN for all kinds of stuff
  • Omegahat for web-based statistics
  • Bioconductor for genomic analysis
  • R packages update often.
  • Your colleagues will recommend task-specific
    packages.
  • Rcmdr is my favorite.

20
GUI
21
GUI
This is useful
22
HTML help
This is useful but not Google.
23
Rseek.org is Google-driven
  • I highly recommend it.

24
Mac Install
  • Download and double click the dmg file.

Click customize and make sure Tcl/Tk is checked
on.
25
X11
  • Some packages for R on the Mac (like Rcmdr)
    require X11 to be installed.
  • I think it is part of the standard Leopard
    installation but was an option with Tiger. If
    you need it, try to install it off of the DVD
    that came with your machine because people have
    reported using the dmg files from Apple.com.

26
X11 and Add-on Packages
To get add on packages use this menu.
Help Search
You can click here to make sure X11 works.
27
Getting or Updating Packages
  • Click Get List, click the package name, be sure
    install dependencies is checked on, then click
    install.

28
Your First Package
  • I suggest you install the Rcmdr package first
    thing.
  • Use the Install packages option on the package
    menu to download Rcmdr
  • To make it available for your R session type
  • library(Rcmdr)
  • Capitalization matters!
  • The first time you run it, it will ask you if it
    can download additional packages.

29
(No Transcript)
30
Data Structures
  • Vectors
  • A bunch of data in a single row or column
  • All of the same type
  • Matrix
  • A row and column arrangement of data
  • All of the same type
  • Data frame
  • A row and column arrangement of data
  • Columns are of different types
  • List
  • Very free form structure
  • A grouping of different types of data

Like a good spreadsheet or relational database
file
31
Types of Data Vectors
  • Numeric
  • Integer, real, and complex are different types
    but you will not need to pay attention to the
    details
  • NA means missing
  • NAN means not a number
  • String
  • Characters of the alphabet
  • Logical
  • TRUE, FALSE or NA

32
Making a Vector
  • Make a sequence
  • OneToThrity seq(1, 30)
  • oneToThrity seq(1, 30, by 2)
  • x1230 130
  • (x1to30 130)
  • c stands for concatenate
  • ages c(9, 11, 40, 41)
  • dudes c("Larry", "Moe", "Curly", "Shemp")

R is case sensitive.
Surround the expression with () to display the
result automatically.
33
Recycling and Vectorizing
  • You can add one to all four ages.
  • ages c(1,1,1,1)
  • If you provide the scalar integer, R will
    temporarily vectorize the 1 by recycling that
    value to match the length of the ages vector.
  • ages 1
  • It will recycle a series also.
  • ages c(1,2)

34
Naming Parts of a Vector
  • You can assign names to the elements of a vector.
    This allows later access to the elements using
    the names instead of the position.
  • names(ages) dudes
  • To erase them
  • names(ages) NULL
  • Notice what happens when the lengths differ
  • dudes c("Larry", "Moe", "Curly")
  • names(ages) dudes
  • ages

35
Getting at Parts of a Vector
  • Specify the element number.
  • heyMoe ages2
  • Specify to drop everything except the element
    number.
  • heyMoe agesc(-1, -3, -4)
  • Specify the name.
  • heyMoe ages"Moe"
  • This only returns the first one if there are
    duplicates.
  • names(ages)4 "Moe"
  • heyMoe names(ages) in "Moe"

36
Getting Parts with Logic
  • You can use the logical values TRUE and FALSE to
    subset also.
  • The vector letters ships with R and includes the
    letters of the alphabet. You can get even
    numbered letters by using recycling
  • lettersc(FALSE, TRUE)

37
Smarter Access to a Vector
  • You can use logic checks to find the record
    numbers in a vector which meet your criteria.
  • ages
  • which(ages
  • You can then subset down your data to the records
    of interest using the subset operator.
  • ageswhich(ages
  • agesages

38
Choosing Values
  • If you need specific values you can use the
    (and) or the (or) operators to get the ordered
    set of TRUE and FALSE values.
  • ages 21 ages
  • ! means not
  • !(ages 21 ages
  • Notice that it is applying the one logic check to
    the vector of ages. How does it do that?

39
Comparing Against Vectors
  • What happens when you try to compare a vector to
    a set of things?
  • gender c(NA, "Male", "Female", "Blue")
  • gender "Male" gender "Female"
  • gender c("Male", "Female")
  • R recycles the shorter vector to be the longer
    length, then does the comparison. Use the in
    operator if you want to compare as if you wrote a
    series of or statements.
  • gender in c("Male", "Female")

This one uses recycling.
40
Recoding Values in a Vector
  • R has functions like if and ifelse to process
    values. Other packages like car have very useful
    functions like recode
  • newAge ifelse(ages
  • newAge ifelse(ages"Old", "Fossilized")
  • library(car)
  • newAge2recode(ages, ' 121"Young" else "Old"
    ')

41
Arrays and Matrices
  • If you add a dimension attribute to a vector, you
    get an array. If the array is more than one
    dimensional, it is a matrix.
  • lets letters110
  • typeof(lets) class(lets) attributes(lets)
  • letsA array(lets)
  • typeof(letsA) class(letsA) attributes(letsA)
  • letsA2 array(lets, dim c(5,2))
  • typeof(letsA2) class(letsA2) attributes(letsA2)
  • letsM matrix(lets, 5, 2)
  • typeof(letsM) class(letsM) attributes(letsM)

42
Out of a Matrix
  • Use the same subset operator logic to get
    information out of a matrix.
  • letsM
  • letsM1,
  • letsM ,2
  • numb array(127, dim list(3,3,3))
  • numb,,2
  • numbc(2,3), c(1,2), 3

43
Categorical Variables
  • R makes a distinction between variables holding a
    bunch of characters from the alphabet and
    variables holding categorical variables. If you
    have a classification/categorical variable, you
    want R to treat it as a factor or an ordered
    factor. Typical factors are treatment or gender.
  • dose c("low", "placebo", "high", "low")
  • dose
  • typeof(dose)

44
Factors
  • To convert a character variable to a factor, use
    the as.factor function.
  • doseF as.factor(dose)
  • typeof(doseF)
  • class(doseF)
  • Behind the scenes, the character variable is
    converted into numbers and the numbers are given
    character strings to display.
  • In R 2.7.1 the levels of the factor are ordered
    alphabetically and the first one is represented
    with the digit 1, the second is 2, etc.

There are is. or as. predicate functions to
check object types or convert between types of
objects.
45
Comparing Factors
  • You can compare a factor vs. a constant value.
  • doseF "high"
  • as.integer(doseF) 1
  • Or you can compare vs. vectors (CAREFULLY).
  • doseF c("high", "low")
  • doseF in c("high", "low")
  • R will stop you from comparing factors that have
    different categories.
  • doseF2 as.factor(c("blah", "placebo", "high",
    "low"))
  • doseF doseF2

Notice wrong answer thanks to recycling.
46
Recoding Factors
  • Often you will want to regroup factor levels.
  • amountas.factor(c("placebo", "10mg", "5mg",
    "10mg"))
  • levels(amount)
  • regroup list(none"placebo", somec("5mg",
    "10mg"))
  • levels(amount) regroup
  • amount

none placebo some 5mg 10mg
47
Numeric Factors
  • If you have numeric factor, be careful converting
    from factors back to numbers.
  • ID c(1000, 1000, 1001, 2)
  • IDf factor(ID)
  • as.integer(IDf)
  • levels(IDf)
  • numbersAgain as.numeric(levels(IDf))IDf

48
Loading Text Data into R
  • Reading text files
  • fakeAllelesread.table("c\\blah\\fakeAlleles.txt"
    , headerTRUE)
  • See if it worked
  • fakeAlleles
  • names(fakeAlleles)
  • summary(fakeAlleles)
  • fakeAllelesdude as.character(fakeAllelesdude)
  • fakeAlleles
  • A better option
  • fakeAlleles read.table("c\\blah\\fakeAlleles.tx
    t", header TRUE, colClasses c("character",
    "factor","factor"))

49
Other Text Formats
  • Other text reading methods
  • read.csv coma separated values
  • read.csv2 semicolon delimited files
  • read.delim read tab delimited files
  • read.fwf read fixed width format files
  • Use same options as read.table
  • If the data has bad or no column headings you may
    also want to include
  • read.table( stuff, col.names c("name1",
    "name2") )
  • To prevent characters from coming in as factors
  • options(stringsAsFactors FALSE)

50
Data Frames
  • The data imported into a data frame.
  • class(fakeAlleles)
  • A data frame really a list of vectors where the
    vectors are all the same length.
  • as.list(fakeAlleles)
  • To select a column you specify the data frame
    variable name.
  • theDudes fakeAllelesdude
  • All the stuff you saw for logic checks on vectors
    can be used on the parts of a data frame.
  • fakeAllelesallele1 "A"

51
Subsetting Vectors (again)
  • Recall that you can subset using the operator
  • ages c(9, 11, 40, 41)
  • heyMoe ages2
  • ages
  • agesages
  • The same voodoo works on the vectors that make up
    data frames!
  • dudeTwo fakeAllelesdude2

52
Subsetting Data Frames
  • Parts (subsets) of data frames are referenced by
    "column numbers comma row numbers"
  • The first record fakeAlleles1,
  • The 2nd and 3rd columns fakeAlleles , c(2,3)
  • The genotype for record 6 fakeAlleles6, c(2,3)
  • or by names
  • fakeAlleles, c("allele1", "allele2")

53
Named Rows
  • You can name your rows also
  • fakeAlleles read.table("c\\blah\\fakeAlleles.tx
    t", header TRUE, colClasses c("character",
    "factor","factor"))
  • row.names(fakeAlleles) fakeAllelesdude
  • fakeAlleles fakeAlleles, -1
  • fakeAlleles"006", c("allele1", "allele2")

54
Subsetting Using Logic
  • You can use logic checks to subset
  • fakeAllelesallele1 "A" fakeAllelesallele2
    "A"
  • fakeAlleles fakeAllelesallele1 "A"
    fakeAllelesallele2 "A",

55
Importing From Excel
  • If you have PERL on your machine, you can use the
    read.xls() function in the gdata library to
    easily get data out of Excel and into a data
    frame.
  • Mac has PERL
  • Windows
  • www.activestate.com/Products/activeperl/index.mhtm
    l

56
Using read.xls
  • Windows
  • library(gdata)
  • diab read.xls("c\\blah\\walkerDiab.xls")
  • Mac
  • library(gdata)
  • read.xls("/users/balise/desktop/walkerDiab.xls")
  • Its that easy Behind the scenes it is
    converting the xls file into a csv so you can use
    the text importing options.
  • Do summary() on the data frame and notice what
    happens to the missing value.

57
RODBC
  • ODBC is a language/convention for accessing
    databases. R allows you to use ODBC connections
    to burrow directly into databases and other data
    containers like Excel.
  • library(RODBC)
  • channels")
  • sqlQuery(channel, "select from Sheet1")
  • odbcCloseAll()

58
SQL
  • If you have to learn one programming language,
    learn SQL.
  • With it you can manipulate data stored in nearly
    every commercial database.
  • You can aggregate, subset and modify data.
  • It is well implemented inside of both R and SAS.
  • SQL with R is nicely documented in Spector's
    (2008) Data Manipulation with R. It is a must
    own for people who want to learn R.

59
Exporting Text Files
  • R can write objects full of data, including data
    frames, into text files.
  • By default it will quote the character string and
    fill in the letters NA where there were
    originally missing values.
  • This code exports back to the original
    appearance.
  • write.table(diab, file "c\\blah\\exported.tab",
    sep "\t", quote FALSE, na "")

60
Creating Programs
  • You can write line-by-line instructions in the R
    console, use the editors built into R, or use a
    third party editor (like Tinn-R for windows).
  • Console
  • Type history() to see the lines you have
    submitted recently and then save to a file and
    re-run it later if needed.
  • Built-in Editor
  • Mac Click the blank page at top of the console
  • Windows File New Script

61
Windows Tinn-R Editor
  • http//www.sciviews.org/Tinn-R/index.html

62
Calling Functions
  • R has a plethora of polymorphic functions to help
    you summarize and visualize your data.
  • mean(diab)
  • If run with a vector of numbers, it returns the
    mean.
  • If run with a data frame, it returns a set of
    means.
  • This is an old S3 method
  • isS4(mean)
  • methods(mean)

63
Polymorphism is Fun
  • Plot does different things depending on the
    function arguments
  • plot(diab)
  • plot(diabWT_KG, diabHT_CM)
  • plot(diabWT_KG)
  • Take a look at how it works
  • isS4(plot)
  • methods(plot)
  • getAnywhere(plot.factor)

64
Arguments
  • Functions try to match arguments in the order
    they appear in the definitions in the help files.
  • ?plot
  • You can explicitly reference the arguments full
    names and they typically allow abbreviations (but
    dont obfuscate your code).
  • The means that other arguments are allowed and
    are passed along the class hierarchy to other
    methods. Look at plot to see this.
  • The makes it easy to write bugged code because
    you if misspell the argument name, it is silently
    passed along.

65
Writing Functions
  • You can easily write functions, but notice that
    the last thing calculated is returned
  • MandM function(x)mean(x) median(x)
  • MandM(diabWT_KG) returns only the median
  • Store the values you want into a list
  • MandM function(x)
  • blah list(theMean0, theMedian0)
  • blahtheMean mean(x)
  • blahtheMedian median(x)
  • return (blah)
  • MandM(diabWT_KG)

66
Other Arguments
  • MandM(diabHT_CM)
  • It points out that we need to deal with missing
    values. Look up mean and median and you will see
    they allow the na.rm parameter to determine if
    missing values are dropped.
  • Using MandM(diabHT_CM, na.rmTRUE) does not work
    because the parameter list does not allow it. We
    want to allow that parameter to be passed along.
    So rewrite the function.

67
M and M Again
  • Recall that an in the argument list means
    "other stuff"
  • MandM function(x, ...)
  • blah list(theMean0, theMedian0)
  • blahtheMean mean(x , ...)
  • blahtheMedian median(x, ...)
  • return (blah)
  • MandM(diabWT_KG)
  • MandM(diabHT_CM, na.rmTRUE)

68
Appling Your Function
  • R does allow you to write loops to iterate over
    records or variables but if you are not writing
    novel math functions, they can generally be
    avoided.
  • R will try to vectorize and process
  • MandM(c(diab WT_KG, diab WT_KG))
  • Use sapply to apply a function to a data frame
  • sapply(diab, MandM, rm.naTRUE)
  • sapply(diab, MandM, na.rmTRUE)

Notice the error handling.
69
Better M and M
  • MandM function(x, ...)
  • blah list(theMean0, theMedian0)
  • if(is.numeric(x) TRUE)
  • blahtheMean mean(x , ...)
  • blahtheMedian median(x, ...)
  • return (blah)
  • sapply(diab, MandM, na.rmTRUE)

70
Yummy M and M
  • MandM function(x, ...)
  • blah list(theMeanNaN, theMedianNaN)
  • if(is.numeric(x) TRUE)
  • blahtheMean mean(x , ...)
  • blahtheMedian median(x, ...)
  • return (blah)
  • MandM(diabWT_KG)
  • MandM(diabHT_CM, na.rmTRUE)
  • sapply(diab, MandM, na.rmTRUE)

71
Writing Novel Functions
  • Look hard on rseek.org before you reinvent the
    wheel.
  • R syntax is very similar to C.
  • Select/Case logic is different (R short-circuits)
    .
  • The R Book by Crawley is too big to buy for just
    this topic but it is good for syntax. Get it
    from the library and read the early chapters.
  • The final chapter of Spector has a few wonderful
    pages.

72
Destroying Efficiency
  • A matrix of data is really a vector with row and
    column attributes added to it. This has profound
    speed issues if you add to the size of a matrix
    because the data has to be shifted all over the
    place.
  • If you plan on writing your own functions to
    manipulate matrices, build an empty matrix of the
    maximum size (or guess bigger) rather than using
    the functions to add rows or columns.

73
Writing Efficient Code
  • R has decent tools for profiling code.
  • The Rprof and summaryRprof functions will help
    you figure out what is bogging down your code
  • Rprof()
  • MandM(rnorm(1000000))
  • Rprof(NULL)
  • summaryRprof()

74
Debugging in R
  • See Chapter 9 in Gentleman's book.
  • The browser() function can be put inside a
    function to pause execution and see what is going
    on.
  • The codetools package is great for tweaking big
    functions
  • findLocals(), findGlobals(),
  • shows you if variables and functions originate
    inside of a function
  • checkUsage(), and checkUsagePackage()
  • shows you what variables are modified or not
    touched in a function

75
Creating Graphs
  • Basic plots are easy but tweaking them for
    publications can be rough because the
    documentation on the function arguments is
    appalling.
  • Data Analysis and Graphics Using R by John
    Maindonald and John Braun is extremely useful.
  • There are myriad graphics built into the core of
    R plus more in the packages.
  • addictedtor.free.fr/graphiques/thumbs.php

76
Test Scores
  • scores read.table("c\\blah\\walkerScores.txt",
    header TRUE)
  • rapply(scores, class)
  • scoresCENTER as.factor(scoresCENTER)
  • scoresPAT as.character(scoresPAT)
  • rapply(scores, class)
  • scoresisSick ifelse(scoresSCORE 0, 1, 0)
  • (scoresSEV with(scores, recode(SCORE, '0
    "None" 130 "Mild" 3169 "Moderate" 70100
    "Severe" else "BAD DATA"')))
  • (scoresSEV factor(scoresSEV, levels
    c("None", "Mild", "Moderate", "Severe"), ordered
    TRUE))

77
Common Plots are Easy
  • attach(scores) to avoid typing scores
  • plot(SEV, main "MainTitle", xlab "xlab", ylab
    "ylab")
  • plot(SCORE)
  • hist (SCORE)
  • boxplot(SCORE)
  • boxplot(SCORE SEX, ylim c(0,100))
  • detach(scores)

78
Use Rcmdr (R Commander)
  • Rcmdr has LOT of great graphics built into the
    point and click interface.
  • library(Rcmdr)
  • Look up my short course (5 talks) covering basic
    statistics to see how to code many graphics.
  • www.stanford.edu/balise/HowToDoBiostatistics.htm

79
You are Going to Need More Help
  • Data Manipulation with R (2008) by Spector.
  • www.springerlink.com/content/t19776/
  • A must-have book on how to read and write data
    with or without SQL, manipulate data with R,
    aggregate data, and reshape datasets easily.
  • R Programming For Bioinformatics (2008) by
    Gentleman.
  • A must-have intermediate level book on how R
    object-oriented programming really works.
  • Data Analysis and Graphics Using R (2007) by
    Maindonald and Braun
  • I constantly use this book to figure out how to
    do graphics.
  • Using R for Introductory Statistics (2005) by
    Verzani.
  • lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DBlocal
    Search_Arg0359L234926Search_CodeCMDCNT10v1
    1
  • This one has fantastic coverage of the R to do
    the common statistics.
  • It also has a nearly useless index at the end of
    book.
  • The R Book (2007) or Statistical Computing (2002)
    by Crawley.
  • lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DBlocal
    Search_Arg0359L252112Search_CodeCMDCNT10v1
    1
  • http//jenson.stanford.edu/uhtbin/cgisirsi/Yl6krJc
    UG8/GREEN/136790128/9
  • These have nicely written intermediate level
    statistics.
  • But they are highly redundant across the two
    books.
Write a Comment
User Comments (0)
About PowerShow.com