Title: Introduction to
1Introduction to
3.11.13
- Dror Hollander
- Gil Ast Lab
- Sackler Medical School
2Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling, plotting and graphic features
- Statistics
- Packages and writing functions
3What is ?
- R is a freely available language and environment
for statistical computing and graphics - Much like , but bette !
4Why use ?
- SPSS and Excel users are limited in their ability
to change their environment. The way they
approach a problem is constrained by how Excel
SPSS were programmed to approach it - The users have to pay money to use the software
5 s Strengths
- Data management manipulation
- Statistics
- Graphics
- Programming language
- Active user community
- Free
6 s Weaknesses
- Not very user friendly at start
- No commercial support
- Substantially slower than programming languages
(e.g. Perl, Java, C)
7Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling, plotting and graphic features
- Statistics
- Packages and writing functions
8Installing
- Go to R homepage http//www.r-project.org/
And just follow the installation instructions
9Installing RStudio
- RStudio is a integrated development environment
(IDE) for R - Install the desktop edition from this link
http//www.rstudio.org/download/
10Using RStudio
Script editor
View variables in workspace and history file
View help, plots files manage packages
R console
11Set Up Your Workspace
- Create your working directory
- Open a new R script file
12Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling plotting and graphic features
- Statistics
- Packages and writing functions
13 - Basic Calculations
- Operators take values (operands), operate on
them, and produce a new value - Basic calculations (numeric operators)
- , - , / , ,
- Lets try an example. Run this
- (170.35)(1/3)Before you do
Script editor
Use to write comments (script lines that are
ignored when run)
Click here / Ctrlenter to run code in RStudio
R console
14 - Basic Functions
- All R operations are performed by functions
- Calling a functiongt function_name(x)
- For examplegt sqrt(9) 1 3
- Reading a functions help file gt ?sqrt Also,
when in doubt Google it!
15 Variables
- A variable is a symbolic name given to stored
information - Variables are assigned using either or lt-
gt xlt-12.6 gt x1 12.6
16 Variables - Numeric Vectors
- A vector is the simplest R data structure. A
numeric vector is a single entity consisting of a
collection of numbers. - It may be created
- Using the c() function (concatenate)
xc(3,7.6,9,11.1)gt x1 3 7.6 9 11.1 - Using the rep(what,how_many_times) function
(replicate)xrep(10.2,3) - Using the operator, signifiying a series of
integersx415
17 Variables - Character Vectors
- Character strings are always double quoted
- Vectors made of character stringsgt
yc("I","want","to","go","home") gt y 1 "I"
"want" "to" "go" "home" - Using rep()gt rep("bye",2) 1 "bye" "bye"
- Notice the difference using paste() (1
element)gt paste("I","want","to","go","home")1
"I want to go home"
18 Variables - Boolean Vectors
- Logical either FALSE or TRUE
- gt 5gt3 1 TRUE
- gt x15gt x1 1 2 3 4 5 gt xlt3 1 TRUE TRUE
FALSE FALSE FALSEzxlt3
19RStudio Workspace History
- Lets review the workspace and history tabs
in RStudio
20Manipulation of Vectors
- Our vector xc(101,102,103,104)
- are used to access elements in x
- Extract 2nd element in xgt x21 102
- Extract 3rd and 4th elements in x gt x34 or
xc(3,4)1 103 104
21Manipulation of Vectors Cont.
- gt x 1 101 102 103 104
- Add 1 to all elements in xgt x1 1 102 103
104 105 - Multiply all elements in x by 2gt x2 1 202
204 206 208
22More Operators
- Comparison operators
- Equal
- Not equal !
- Less / greater than lt / gt
- Less / greater than or equal lt / gt
- Boolean (either FALSE or TRUE)
- And
- Or
- Not !
23Manipulation of Vectors Cont.
- Our vector x100150
- Elements of x higher than 145gt xxgt145 1 146
147 148 149 150 - Elements of x higher than 135 and lower than
140gt x xgt135 xlt140 1 136 137 138 139
24Manipulation of Vectors Cont.
- Our vector gt xc("I","want","to","go","home")
- Elements of x that do not equal wantgt xx !
"want" 1 "I" "to" "go" "home" - Elements of x that equal want and homegt xx
in c("want","home") 1 "want" "home"
Note use for 1 element and in for
several elements
25 Variables Data Frames
- A data frame is simply a table
- Each column may be of a different class (e.g.
numeric, character, etc.) - The number of elements in each
- row must be identical
- Accessing elements in data frame
- xrow,column
- The age columngt xage orgt x,age or
gt x,1 - All male rowsgt xxgenderM,
26 Variables Matrices
- A matrix is a table of a different class
- Each column must be of the same class (e.g.
numeric, character, etc.) - The number of elements in each
- row must be identical
- Accessing elements in matrices
- xrow,column
- The Height columngt x,Height or gt
x,2 - Note you cannot use gt xWeight
27Exe cise
- Construct the character vector pplNames
containing 5 names Srulik, Esti, Shimshon,
Shifra, Ezra - Construct the numeric vector ages that includes
the following numbers 21, 12 (twice), 35
(twice) - Use the data.frame() function to construct the
pplAges table out of pplNames ages - Access the pplAges rows with ages values
greater than 19
28Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling, plotting and graphic features
- Statistics
- Packages and writing functions
29Wo king With a File
- For example analysis of a gene expression file
- Workflow
- Save file in workspace directory
- Read / load file to R
- Analyze the gene expression table
- 305 gene expression reads in 48 tissues (log10
values compared to a mixed tissue pool) - Values gt0 ? over-expressed genes
- Values lt0 ? under-expressed genes
- File includes 306 rows X 49 columns
30File Handling - ead File
- Read file to R
- Use the read.table() function
- Note each function receives input (arguments)
and produces output (return value) - The function returns a data frame
- Rungt geneExprss read.table(file
"geneExprss.txt", sep "\t",header T) - Check tablegt dim(geneExprss) table
dimentionsgt geneExprss1, 1st linegt
class(geneExprss) check variable class - Or double click on variable name in workspace tab
31Plotting - Pie Chart
- What fraction of lung genes are over-expressed?
- What about the under-expressed genes?
- A pie chart can illustrate our findings
32Using the pie() Function
- Lets regard values gt 0.2 as over-expressed
- Lets regard values lt (-0.2) as under-expressed
- Lets use Length() ? retrieves the number of
elements in a vector
gt up length (geneExprssLung geneExprssLunggt0.
2) gt down length (geneExprssLung
geneExprssLunglt(-0.2)) gt mid length
(geneExprssLung geneExprssLunglt0.2
geneExprssLunggt(-0.2)) gt pie (c(up,down,mid)
,labels c("up","down","mid"))
- More on saving plots to files in a few slides
33Plotting - Scatter Plot
- How similar is the gene expression profile of the
Hippocampus (brain) to that of that of the
Thalamus (brain)? - A scatter plot is ideal for the visualization of
the correlation between two variables
34Using the plot() Function
- Plot the gene expression profile of
Hippocampus.brain against that of Thalamus.brain - gt plot ( geneExprssHippocampus.brain,
geneExprssThalamus.brain, xlab"Hippocampus",
ylab"Thalamus")
35File Handling Load File to
- .RData files contain saved R environment data
- Load .RData file to R
- Use the load() function
- Note each function receives input (arguments)
and produces output (return value) - Rungt load (file "geneExprss.RData")
- Check tablegt dim(geneExprss) table
dimentionsgt geneExprss1, 1st linegt
class(geneExprss) check variable class - Or double click on variable name in workspace tab
36Plotting Bar Plot
- How does the expression profile of NOVA1 differ
across several tissues? - A bar plot can be used to compare two or more
categories
37Using the barplot() Function
- Compare NOVA1 expression in Spinalcord, Kidney,
Heart and Skeletal.muscle by plotting a bar plot - Sort the data before plotting using the sort()
function - barplot() works on a variable of a matrix class
- gt tissues c ( "Spinalcord", "Kidney",
"Skeletal.muscle", "Heart")gt barplot ( sort (
geneExprss "NOVA1",tissues ) )
38More Graphic Functions to Keep in Mind
- hist()
- boxplot()
- plotmeans()
- scatterplot()
39Exe cise
- Use barplot() to compare PTBP1 PTBP2 gene
expression in Hypothalamus.brain - Use barplot() to compare PTBP1 PTBP2 gene
expression in Lung - What are the differences between the two plots
indicative of?
40Save Plot to File - RStudio
41Save Plot to File in
- Before running the visualizing function, redirect
all plots to a file of a certain type - jpeg(filename)
- png(filename)
- pdf(filename)
- postscript(filename)
- After running the visualization function, close
graphic device using dev.off() or graphcis.off()
- For example
- gt load(file"geneExprss.RData")
- gt Tissues c ("Spinalcord", "Kidney",
"Skeletal.muscle", "Heart") - gt pdf("Nova1BarPlot.PDF")
- gt barplot ( sort (geneExprss "NOVA1", tissues )
) - gt graphics.off()
42Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling, plotting and graphic features
- Statistics
- Packages and writing functions
43 Statistics cor.test()
gt geneExprss read.table (file
"geneExprss.txt", sep "\t", header T) gt
cor.test ( geneExprssHippocampus.brain,
geneExprssThalamus.brain, method "pearson") gt
cor.test ( geneExprssHippocampus.brain,
geneExprssThalamus.brain, method "spearman")
- A few slides back we compared the expression
profiles of the Hippocampus.brain and the
Thalamus.brain - But is that correlation statistically
significant? - R can help with this sort of question as well
- To answer that specific question well use the
cor.test() function
44 Statistics More Testing, FYI
- t.test() Student t test
- wilcox.test() Mann-Whitney test
- kruskal.test() Kruskal-Wallis rank sum test
- chisq.test() chi squared test
- cor.test() pearson / spearman correlations
-
- lm(), glm() linear and generalized linear
models - p.adjust() adjustment of P-values for multiple
testing (multiple testing correction) using FDR,
bonferroni, etc.
45 Statistics Examine the Distribution of
Your Data
- Use the summary() function
- gt geneExprss read.table (file
"geneExprss.txt", sep "\t", header T) - gt summary(geneExprssLiver)
- Min. -1.844001st Qu. -0.17290 Median -0.05145
Mean -0.08091 3rd Qu. 0.05299 Max. 0.63950
46 Statistics More Distribution Functions
- mean()
- median()
- var()
- min()
- max()
- When using most of these functions remember to
use argument na.rmT
47Lecture Overview
- What is R and why use it?
- Setting up R RStudio for use
- Calculations, functions and variable classes
- File handling, plotting and graphic features
- Statistics
- Packages and writing functions
48 Functions Packages
- All operations are performed by functions
- All R functions are stored in packages
- Base packages are installed along with R
- Packages including additional functions can by
downloaded by user - Functions can also be written by user
49Install Load Packages - RStudio
50Install Load Packages -
- Use the functions
- Install.packages(package_name)
- update.packages(package_name)
- library(package_name) Load a package
51Final Tips
- Reading the functions help file (gt
?function_name) - Run the help file examples
- Use http//www.rseek.org/
- Google what youre looking for
- Post on the R forum webpage
- And most importantly play with it, get the hang
of it, and do NOT despair ?