Title: "R" Programming for Life Sciences
1"R" Programming for Life Sciences
- Raymond R. Balise, Ph.D.
- Health Research and Policy
- SPCTRM/SCCTER
2Roadmap
- What makes R different for the rest?
- Setting up R
- Types of data
- Working with collections of data
- Importing and exporting data
- Writing functions
- Graphics
3When to Use R
- Shoestring budget
- Cutting edge statistics
- Developing your own or fine-tuning existing
methods - Local expertise
4Programming Languages
- Procedural languages
- C, Fortran, Cobol, Basic
- use a model where the logic flows from the top of
the page to the bottom with calls to goto
subroutines as needed - It is hard to encapsulate the code.
- Object oriented languages
- C, Visual Basic, JAVA
- involve creating objects and then operating on
them
5R is Object Oriented (OO)
- You create objects
- vector of numbers, a graphic, etc.
- You call methods/functions to operate on the
objects. - Working with a OO language requires you to learn
about special methods to create, access, modify,
or destroy objects and their properties. - R hides these processes.
- It helps a lot if you want to write new
statistics and methods and is required for making
new packages.
6OO Example
- With R you write code in the editor which I will
show you in a minute. - You can create an object which holds a bunch of
numbers (a vector, if you remember math) - You can then use (aka call) a function (aka
method) to operate on the object. - The summary() function
- Create and display a numeric summary object
- The plot() function
- Create and display a graphic summary object
7Make the ages object
Call the summary function
Call the plot function
8OO Programming in R
- OO programming requires
- objects
- classes
- describe specific properties for groups of
objects - inheritance
- classes related to eachother (derived from other
classes) have related properties - polymorphism
- the same function name applied to different
classes does different things - R vs. JAVA R typically has separate classes for
actions instead of bundling them with the data
structures - JAVA
- Animal - domesticated - dog (walks)
- R
- Animal - domesticated - dog
- Movement - Walks
9R3 and R4
- R (really S) predates object oriented programming
- There are two OO programming systems in the core
of R - S3 (Chambers and Hastie 1992)
- did not have ridged class structure
- easy to program
- less reliable
- fine for small-time development
- S4 (Chambers 1998)
- stronger support for classes
- harder to program
- more reliable
- good for industrial strength development
- Knowing they exist helps explain differences you
see in R code and lets you modify both types of
objects.
10Where to Get R
- R has two main websites. One describes the
project - http//www.r-project.org/
- The other has most of the stuff you want to
download - http//cran.r-project.org/
- Because the R project has people working all over
the globe, the software download site is
mirrored everywhere. The closest mirror is USA
CA1 (aka UC Berkeley).
11http//cran.cnr.berkeley.edu/
- There is an R installer for all the common
operating systems - cran.cnr.berkeley.edu/bin/windows/base/
- cran.cnr.berkeley.edu/bin/macosx/
- cran.cnr.berkeley.edu/bin/linux/
- Each is basically self explanatory.
12(No Transcript)
13Installing on Windows
- Double click the installer and just push next
until you get to this screen.
Specify that you want to do customized
startup. This will let you set up R to work with
other programs nicely.
14Customize
- Use these options, then hit next a bunch.
15- help.start() and push enter to start the help.
- q() and push enter to quit but dont yet.
16GUI
17GUI
18GUI
Shows the add on packages currently accessible
19Packages in R
- User-supplied packages are typically found at one
of three places - CRAN for all kinds of stuff
- Omegahat for web-based statistics
- Bioconductor for genomic analysis
- R packages update often.
- Your colleagues will recommend task-specific
packages. - Rcmdr is my favorite.
20GUI
21GUI
This is useful
22HTML help
This is useful but not Google.
23Rseek.org is Google-driven
24Mac Install
- Download and double click the dmg file.
Click customize and make sure Tcl/Tk is checked
on.
25X11
- Some packages for R on the Mac (like Rcmdr)
require X11 to be installed. - I think it is part of the standard Leopard
installation but was an option with Tiger. If
you need it, try to install it off of the DVD
that came with your machine because people have
reported using the dmg files from Apple.com.
26X11 and Add-on Packages
To get add on packages use this menu.
Help Search
You can click here to make sure X11 works.
27Getting or Updating Packages
- Click Get List, click the package name, be sure
install dependencies is checked on, then click
install.
28Your First Package
- I suggest you install the Rcmdr package first
thing. - Use the Install packages option on the package
menu to download Rcmdr - To make it available for your R session type
- library(Rcmdr)
- Capitalization matters!
- The first time you run it, it will ask you if it
can download additional packages.
29(No Transcript)
30Data Structures
- Vectors
- A bunch of data in a single row or column
- All of the same type
- Matrix
- A row and column arrangement of data
- All of the same type
- Data frame
- A row and column arrangement of data
- Columns are of different types
- List
- Very free form structure
- A grouping of different types of data
Like a good spreadsheet or relational database
file
31Types of Data Vectors
- Numeric
- Integer, real, and complex are different types
but you will not need to pay attention to the
details - NA means missing
- NAN means not a number
- String
- Characters of the alphabet
- Logical
- TRUE, FALSE or NA
32Making a Vector
- Make a sequence
- OneToThrity seq(1, 30)
- oneToThrity seq(1, 30, by 2)
- x1230 130
- (x1to30 130)
- c stands for concatenate
- ages c(9, 11, 40, 41)
- dudes c("Larry", "Moe", "Curly", "Shemp")
R is case sensitive.
Surround the expression with () to display the
result automatically.
33Recycling and Vectorizing
- You can add one to all four ages.
- ages c(1,1,1,1)
- If you provide the scalar integer, R will
temporarily vectorize the 1 by recycling that
value to match the length of the ages vector. - ages 1
- It will recycle a series also.
- ages c(1,2)
34Naming Parts of a Vector
- You can assign names to the elements of a vector.
This allows later access to the elements using
the names instead of the position. - names(ages) dudes
- To erase them
- names(ages) NULL
- Notice what happens when the lengths differ
- dudes c("Larry", "Moe", "Curly")
- names(ages) dudes
- ages
35Getting at Parts of a Vector
- Specify the element number.
- heyMoe ages2
- Specify to drop everything except the element
number. - heyMoe agesc(-1, -3, -4)
- Specify the name.
- heyMoe ages"Moe"
- This only returns the first one if there are
duplicates. - names(ages)4 "Moe"
- heyMoe names(ages) in "Moe"
36Getting Parts with Logic
- You can use the logical values TRUE and FALSE to
subset also. - The vector letters ships with R and includes the
letters of the alphabet. You can get even
numbered letters by using recycling - lettersc(FALSE, TRUE)
37Smarter Access to a Vector
- You can use logic checks to find the record
numbers in a vector which meet your criteria. - ages
- which(ages
- You can then subset down your data to the records
of interest using the subset operator. - ageswhich(ages
- agesages
38Choosing Values
- If you need specific values you can use the
(and) or the (or) operators to get the ordered
set of TRUE and FALSE values. - ages 21 ages
- ! means not
- !(ages 21 ages
- Notice that it is applying the one logic check to
the vector of ages. How does it do that?
39Comparing Against Vectors
- What happens when you try to compare a vector to
a set of things? - gender c(NA, "Male", "Female", "Blue")
- gender "Male" gender "Female"
- gender c("Male", "Female")
- R recycles the shorter vector to be the longer
length, then does the comparison. Use the in
operator if you want to compare as if you wrote a
series of or statements. - gender in c("Male", "Female")
This one uses recycling.
40Recoding Values in a Vector
- R has functions like if and ifelse to process
values. Other packages like car have very useful
functions like recode - newAge ifelse(ages
- newAge ifelse(ages"Old", "Fossilized")
- library(car)
- newAge2recode(ages, ' 121"Young" else "Old"
')
41Arrays and Matrices
- If you add a dimension attribute to a vector, you
get an array. If the array is more than one
dimensional, it is a matrix. - lets letters110
- typeof(lets) class(lets) attributes(lets)
- letsA array(lets)
- typeof(letsA) class(letsA) attributes(letsA)
- letsA2 array(lets, dim c(5,2))
- typeof(letsA2) class(letsA2) attributes(letsA2)
- letsM matrix(lets, 5, 2)
- typeof(letsM) class(letsM) attributes(letsM)
42Out of a Matrix
- Use the same subset operator logic to get
information out of a matrix. - letsM
- letsM1,
- letsM ,2
- numb array(127, dim list(3,3,3))
- numb,,2
- numbc(2,3), c(1,2), 3
43Categorical Variables
- R makes a distinction between variables holding a
bunch of characters from the alphabet and
variables holding categorical variables. If you
have a classification/categorical variable, you
want R to treat it as a factor or an ordered
factor. Typical factors are treatment or gender. - dose c("low", "placebo", "high", "low")
- dose
- typeof(dose)
44Factors
- To convert a character variable to a factor, use
the as.factor function. - doseF as.factor(dose)
- typeof(doseF)
- class(doseF)
- Behind the scenes, the character variable is
converted into numbers and the numbers are given
character strings to display. - In R 2.7.1 the levels of the factor are ordered
alphabetically and the first one is represented
with the digit 1, the second is 2, etc.
There are is. or as. predicate functions to
check object types or convert between types of
objects.
45Comparing Factors
- You can compare a factor vs. a constant value.
- doseF "high"
- as.integer(doseF) 1
- Or you can compare vs. vectors (CAREFULLY).
- doseF c("high", "low")
- doseF in c("high", "low")
- R will stop you from comparing factors that have
different categories. - doseF2 as.factor(c("blah", "placebo", "high",
"low")) - doseF doseF2
Notice wrong answer thanks to recycling.
46Recoding Factors
- Often you will want to regroup factor levels.
- amountas.factor(c("placebo", "10mg", "5mg",
"10mg")) - levels(amount)
- regroup list(none"placebo", somec("5mg",
"10mg")) - levels(amount) regroup
- amount
none placebo some 5mg 10mg
47Numeric Factors
- If you have numeric factor, be careful converting
from factors back to numbers. - ID c(1000, 1000, 1001, 2)
- IDf factor(ID)
- as.integer(IDf)
- levels(IDf)
- numbersAgain as.numeric(levels(IDf))IDf
48 Loading Text Data into R
- Reading text files
- fakeAllelesread.table("c\\blah\\fakeAlleles.txt"
, headerTRUE) - See if it worked
- fakeAlleles
- names(fakeAlleles)
- summary(fakeAlleles)
- fakeAllelesdude as.character(fakeAllelesdude)
- fakeAlleles
- A better option
- fakeAlleles read.table("c\\blah\\fakeAlleles.tx
t", header TRUE, colClasses c("character",
"factor","factor"))
49Other Text Formats
- Other text reading methods
- read.csv coma separated values
- read.csv2 semicolon delimited files
- read.delim read tab delimited files
- read.fwf read fixed width format files
- Use same options as read.table
- If the data has bad or no column headings you may
also want to include - read.table( stuff, col.names c("name1",
"name2") ) - To prevent characters from coming in as factors
- options(stringsAsFactors FALSE)
50Data Frames
- The data imported into a data frame.
- class(fakeAlleles)
- A data frame really a list of vectors where the
vectors are all the same length. - as.list(fakeAlleles)
- To select a column you specify the data frame
variable name. - theDudes fakeAllelesdude
- All the stuff you saw for logic checks on vectors
can be used on the parts of a data frame. - fakeAllelesallele1 "A"
51Subsetting Vectors (again)
- Recall that you can subset using the operator
- ages c(9, 11, 40, 41)
- heyMoe ages2
- ages
- agesages
- The same voodoo works on the vectors that make up
data frames! - dudeTwo fakeAllelesdude2
52Subsetting Data Frames
- Parts (subsets) of data frames are referenced by
"column numbers comma row numbers" - The first record fakeAlleles1,
- The 2nd and 3rd columns fakeAlleles , c(2,3)
- The genotype for record 6 fakeAlleles6, c(2,3)
- or by names
- fakeAlleles, c("allele1", "allele2")
53Named Rows
- You can name your rows also
- fakeAlleles read.table("c\\blah\\fakeAlleles.tx
t", header TRUE, colClasses c("character",
"factor","factor")) - row.names(fakeAlleles) fakeAllelesdude
- fakeAlleles fakeAlleles, -1
- fakeAlleles"006", c("allele1", "allele2")
54Subsetting Using Logic
- You can use logic checks to subset
- fakeAllelesallele1 "A" fakeAllelesallele2
"A" - fakeAlleles fakeAllelesallele1 "A"
fakeAllelesallele2 "A",
55Importing From Excel
- If you have PERL on your machine, you can use the
read.xls() function in the gdata library to
easily get data out of Excel and into a data
frame. - Mac has PERL
- Windows
- www.activestate.com/Products/activeperl/index.mhtm
l
56Using read.xls
- Windows
- library(gdata)
- diab read.xls("c\\blah\\walkerDiab.xls")
- Mac
- library(gdata)
- read.xls("/users/balise/desktop/walkerDiab.xls")
- Its that easy Behind the scenes it is
converting the xls file into a csv so you can use
the text importing options. - Do summary() on the data frame and notice what
happens to the missing value.
57RODBC
- ODBC is a language/convention for accessing
databases. R allows you to use ODBC connections
to burrow directly into databases and other data
containers like Excel. - library(RODBC)
- channels")
- sqlQuery(channel, "select from Sheet1")
- odbcCloseAll()
58SQL
- If you have to learn one programming language,
learn SQL. - With it you can manipulate data stored in nearly
every commercial database. - You can aggregate, subset and modify data.
- It is well implemented inside of both R and SAS.
- SQL with R is nicely documented in Spector's
(2008) Data Manipulation with R. It is a must
own for people who want to learn R.
59Exporting Text Files
- R can write objects full of data, including data
frames, into text files. - By default it will quote the character string and
fill in the letters NA where there were
originally missing values. - This code exports back to the original
appearance. - write.table(diab, file "c\\blah\\exported.tab",
sep "\t", quote FALSE, na "")
60Creating Programs
- You can write line-by-line instructions in the R
console, use the editors built into R, or use a
third party editor (like Tinn-R for windows). - Console
- Type history() to see the lines you have
submitted recently and then save to a file and
re-run it later if needed. - Built-in Editor
- Mac Click the blank page at top of the console
- Windows File New Script
-
61Windows Tinn-R Editor
- http//www.sciviews.org/Tinn-R/index.html
62Calling Functions
- R has a plethora of polymorphic functions to help
you summarize and visualize your data. - mean(diab)
- If run with a vector of numbers, it returns the
mean. - If run with a data frame, it returns a set of
means. - This is an old S3 method
- isS4(mean)
- methods(mean)
63Polymorphism is Fun
- Plot does different things depending on the
function arguments - plot(diab)
- plot(diabWT_KG, diabHT_CM)
- plot(diabWT_KG)
- Take a look at how it works
- isS4(plot)
- methods(plot)
- getAnywhere(plot.factor)
64Arguments
- Functions try to match arguments in the order
they appear in the definitions in the help files. - ?plot
- You can explicitly reference the arguments full
names and they typically allow abbreviations (but
dont obfuscate your code). - The means that other arguments are allowed and
are passed along the class hierarchy to other
methods. Look at plot to see this. - The makes it easy to write bugged code because
you if misspell the argument name, it is silently
passed along.
65 Writing Functions
- You can easily write functions, but notice that
the last thing calculated is returned - MandM function(x)mean(x) median(x)
- MandM(diabWT_KG) returns only the median
- Store the values you want into a list
- MandM function(x)
- blah list(theMean0, theMedian0)
- blahtheMean mean(x)
- blahtheMedian median(x)
- return (blah)
-
- MandM(diabWT_KG)
66Other Arguments
- MandM(diabHT_CM)
- It points out that we need to deal with missing
values. Look up mean and median and you will see
they allow the na.rm parameter to determine if
missing values are dropped. - Using MandM(diabHT_CM, na.rmTRUE) does not work
because the parameter list does not allow it. We
want to allow that parameter to be passed along.
So rewrite the function.
67M and M Again
- Recall that an in the argument list means
"other stuff" - MandM function(x, ...)
- blah list(theMean0, theMedian0)
- blahtheMean mean(x , ...)
- blahtheMedian median(x, ...)
- return (blah)
-
- MandM(diabWT_KG)
- MandM(diabHT_CM, na.rmTRUE)
68Appling Your Function
- R does allow you to write loops to iterate over
records or variables but if you are not writing
novel math functions, they can generally be
avoided. - R will try to vectorize and process
- MandM(c(diab WT_KG, diab WT_KG))
- Use sapply to apply a function to a data frame
- sapply(diab, MandM, rm.naTRUE)
- sapply(diab, MandM, na.rmTRUE)
Notice the error handling.
69Better M and M
- MandM function(x, ...)
- blah list(theMean0, theMedian0)
- if(is.numeric(x) TRUE)
- blahtheMean mean(x , ...)
- blahtheMedian median(x, ...)
-
- return (blah)
-
- sapply(diab, MandM, na.rmTRUE)
70Yummy M and M
- MandM function(x, ...)
- blah list(theMeanNaN, theMedianNaN)
- if(is.numeric(x) TRUE)
- blahtheMean mean(x , ...)
- blahtheMedian median(x, ...)
-
- return (blah)
-
- MandM(diabWT_KG)
- MandM(diabHT_CM, na.rmTRUE)
- sapply(diab, MandM, na.rmTRUE)
71Writing Novel Functions
- Look hard on rseek.org before you reinvent the
wheel. - R syntax is very similar to C.
- Select/Case logic is different (R short-circuits)
. - The R Book by Crawley is too big to buy for just
this topic but it is good for syntax. Get it
from the library and read the early chapters. - The final chapter of Spector has a few wonderful
pages.
72Destroying Efficiency
- A matrix of data is really a vector with row and
column attributes added to it. This has profound
speed issues if you add to the size of a matrix
because the data has to be shifted all over the
place. - If you plan on writing your own functions to
manipulate matrices, build an empty matrix of the
maximum size (or guess bigger) rather than using
the functions to add rows or columns.
73Writing Efficient Code
- R has decent tools for profiling code.
- The Rprof and summaryRprof functions will help
you figure out what is bogging down your code - Rprof()
- MandM(rnorm(1000000))
- Rprof(NULL)
- summaryRprof()
74Debugging in R
- See Chapter 9 in Gentleman's book.
- The browser() function can be put inside a
function to pause execution and see what is going
on. - The codetools package is great for tweaking big
functions - findLocals(), findGlobals(),
- shows you if variables and functions originate
inside of a function - checkUsage(), and checkUsagePackage()
- shows you what variables are modified or not
touched in a function
75Creating Graphs
- Basic plots are easy but tweaking them for
publications can be rough because the
documentation on the function arguments is
appalling. - Data Analysis and Graphics Using R by John
Maindonald and John Braun is extremely useful. - There are myriad graphics built into the core of
R plus more in the packages. - addictedtor.free.fr/graphiques/thumbs.php
76Test Scores
- scores read.table("c\\blah\\walkerScores.txt",
header TRUE) - rapply(scores, class)
- scoresCENTER as.factor(scoresCENTER)
- scoresPAT as.character(scoresPAT)
- rapply(scores, class)
- scoresisSick ifelse(scoresSCORE 0, 1, 0)
- (scoresSEV with(scores, recode(SCORE, '0
"None" 130 "Mild" 3169 "Moderate" 70100
"Severe" else "BAD DATA"'))) - (scoresSEV factor(scoresSEV, levels
c("None", "Mild", "Moderate", "Severe"), ordered
TRUE))
77Common Plots are Easy
- attach(scores) to avoid typing scores
- plot(SEV, main "MainTitle", xlab "xlab", ylab
"ylab") - plot(SCORE)
- hist (SCORE)
- boxplot(SCORE)
- boxplot(SCORE SEX, ylim c(0,100))
- detach(scores)
78Use Rcmdr (R Commander)
- Rcmdr has LOT of great graphics built into the
point and click interface. - library(Rcmdr)
- Look up my short course (5 talks) covering basic
statistics to see how to code many graphics. - www.stanford.edu/balise/HowToDoBiostatistics.htm
79You are Going to Need More Help
- Data Manipulation with R (2008) by Spector.
- www.springerlink.com/content/t19776/
- A must-have book on how to read and write data
with or without SQL, manipulate data with R,
aggregate data, and reshape datasets easily. - R Programming For Bioinformatics (2008) by
Gentleman. - A must-have intermediate level book on how R
object-oriented programming really works. - Data Analysis and Graphics Using R (2007) by
Maindonald and Braun - I constantly use this book to figure out how to
do graphics. - Using R for Introductory Statistics (2005) by
Verzani. - lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DBlocal
Search_Arg0359L234926Search_CodeCMDCNT10v1
1 - This one has fantastic coverage of the R to do
the common statistics. - It also has a nearly useless index at the end of
book. - The R Book (2007) or Statistical Computing (2002)
by Crawley. - lmldb.stanford.edu/cgi-bin/Pwebrecon.cgi?DBlocal
Search_Arg0359L252112Search_CodeCMDCNT10v1
1 - http//jenson.stanford.edu/uhtbin/cgisirsi/Yl6krJc
UG8/GREEN/136790128/9 - These have nicely written intermediate level
statistics. - But they are highly redundant across the two
books.