Title: ETSUMath Department Colloquium Got
1ETSU-Math Department ColloquiumGot ?
- Edith Seier
- 10/31/05
- If you wish go to
- http//www.etsu.edu/math/seier/Rtalk.htm
-
2- What is R?
- What can we do with R?
- How to work with R
- Who can use R?
- Who really needs R?
- Learning R.
- Should we teach R at ETSU?
3What is R ?
- Its developers (Robert Gentleman and Ross,
Statistics Department of the University of
Auckland) define it as an environment in which
statistical techniques are implemented. The
programming language R started as a free teaching
version of the S language (developed by Chambers,
Becker and Wilks at Bell Laboratories), it is
considered a dialect of S and in 1995 was put in
the public domain. (The commercial version of S
of use in the academia is S-Plus) - New packages of functions are being created and
its development is an international effort. All
the information about the R project is available
from -
http//www.r-project.org - Its use in advanced statistical topics,
bioinformatics and biostatistics is constantly
growing and - It is free!!!
- Notes.-
- There are versions for Mac, Windows Linux
- It can be linked to C or Fortran for
computationally intensive tasks - The language has some affinities to Lisp
(Lisp-Stat?) and APL -
4Where can we get R?
- The web page shown at the right is
- http//www.r-project.org
- There we can find
- The basic program
- Additional packages of functions
- Manuals in several languages
5What can we do with R?
- Basic statistical calculations and graphs
- Write functions that are not included in
commercial statistical software. - Use it to teach upper division/graduate courses
when we want the students to be aware of the
steps needed to solve a problem instead of using
a totally menu (point and click) menu driven
program. (at least some times) - To use packages written in R for specialized
areas such as spatial statistics, survival
analysis, microarray data analysis, generalized
linear models, etcetera.
6How to work with R
- Using R
- To start data entry
- Operations, transformations.
- Descriptive statistics graphs
- More graphs and calculations
- Writing our own functions in R
- Interactive Statistics with R (Motoya Machida
TNTechU) - Using packages in R
7 Starting and entering data
- 1)Once you have downloaded R from
- http//www.r-project.org
- click on the icon
- You can
- Use data sets that come with R, data()
- Type your own data
- Read ascii data files
- Import data from other software
- Generate random numbers and sequences
8Typing data in R
- Decide the name of the object where you will
store the data - name lt- c( , , , , )
- Example Cigarette consumption per capita in 1930
(Freeman et al (1970) Statistics ) - Categorical data
- countrylt-c("Australia" , "Canada", Denmark"
,"Finland", England" , "Island" , Netherlands"
, Norway", - Sweden", "Switzerland", USA" )
- Numerical data
- cigarettelt-c(480,500,380,1100,1100,230,490,250,300
,510,1300) - Labels for the data
- If you want to put labels to the data use the
command names. - names(cigarette)lt-country
- Now type cigarettes to see the data, in the
screen you will see - Australia Canada Denmark Finland England
Island Netherland - 480 500 380 1100
1100 230 490 - Norway Sweden Switzerland USA
- 250 300 510 1300
9Generating data
- a) Sequences
- ilt-seq(0,10,by2) will create the sequence
- 0,2,4,6,8,10
- b) Evaluating functions. First we need to create
the argument of the function - tlt-seq(0,60,by0.01)
- Y lt-cos(t)
- c) Generating random numbers
- xlt-rnorm(500)
10 Operations, transformations
- The arithmetic operations are done with the usual
symbols - / . For example, if we want to
convert weights in pounds into kilos
weightklweightlb/2.2 - Some useful transformations are
- exp( ) , for power
- log( ) for natural logarithms
- log10( ) for base 10 logarithms
- Trigonometric functions
- sin(), cos(), tan(), asin(), atan(), acos()
-
11Descriptive statisticstasamortlt-c(180,150,170,350
,460,60,240,90,110,250,200)
- gt length(tasamort)
- 1 11
- gt sum(tasamort)
- 1 2260
- gt mean(tasamort)
- 1 205.4545
- gt median(tasamort)
- 1 180
- gt sd(tasamort)
- 1 117.2488
- gt var(tasamort)
- 1 13747.27
gt quantile(tasamort,0.5) 50 180 gt
min(tasamort) 1 60 gt max(tasamort) 1
460 summary(tasamort) Min. 1st Qu. Median
Mean 3rd Qu. Max. 60.0 130.0 180.0
205.5 245.0 460.0 cor(x,y)
12Statistical graphs and plots of functions
hist(variable)
plot(cigarette,deathrate) abline(67.56 , 0.22844)
boxplot(variable)
13- Among the graphs for which R has commands are
- Practically all the graphs from Multivariate
Analysis - Mosaic graphs (for categorical variables)
- Smoothing in regression
- Time series plots
- Plots from Microarray data analysis
- If in the menu of R we click on Help, and then
Search help and write plot, a list of graphs in
R will appear. - In the next slide we have copied and pasted the
plot displays from R .screenshots in
http//www.r-project.org
14- R can be use to do statistical calculations
related to - Test of hypotheses
- Analysis of Variance and Covariance
- Probability Distributions
- Tests for two-way tables (Chi-cuadrado, McNemar
etc.) - Multiple regression
- Calculation of sample size
- Logistic regression
- Survival analysis
- All these topics are explained in
- Dalgaard, P.(2002)Introductory Statistics with
R Springer Verlag. - There are also several manuals and other books on
specific topics such as Linear Models,
Bioinformatics, etcetera. Also some general
methods books have instructions in R. For
example - Heiberger Holland (2004) Statistical Analysis
and Data Display, An Intermediate Course with
Examples in S-Plus, R, and SAS. Springer Verlag
- In http//www.r-project.org you can find manuals
and tutorials that can be downloaded for free
(not only in English but in some other languages) -
15Writing our own functions in R Example
calculating a confidence interval for the mean
absolute deviation
- In a sample, MAD is calculated as the average of
the distances of the values to the median. -
- In Bonett Seier (2003), Confidence Intervals
for Mean Absolute Deviations The American
Statistical Association, Vol 57 4 the following
formula for the confidence interval for the
population mean absolute deviation was derived
16 Once we have copied and pasted the function citau
into R, we simply write citau(x,1.96) To
calculate the CI for the data in x
17- In Bonett, D.G. and Seier, E. (2005)
Confidence Interval for a Coefficient of
Dispersion in Non-normal Distributions.
Biometrical Journal 47 (5) pp 1-5, we included
the following program to calculate the Confidence
interval for COD we were defining in the paper - CODCIlt-function(x,z)mdmedian(x) mmean(x)
vvar(x) taumean(abs(x-md)) - del(m-md)/tau nlength(x) cn/(n-1)
gamv/(tau2) - codtau/median(x)a1round(((n1)/2)-sqrt(n))
a2n-a11 - sxsort(x) l1log(sxa1) u1log(sxa2)
se1(u1-l1)/4 se2sqrt((gam(del2)-1)/n) - fmsqrt(1/(4n(se12)))covtm(m-md)/(2nfmtau)
ksqrt(se12se22-2covtm)/(se1se2) - b1round((n1)/2-kzsqrt(n/4))
b1round((n1)/2-kzsqrt(n/4)) b2n-b11 - l1log(sxb1) u1log(sxb2)
l2log(ctau)-kzse2 u2log(ctau)kzse2 - ll1exp(l2-u1) ul1exp(u2-l1) cilt-
c(ll1,cod,ul1) ci -
18The output of a function could be a plot.
Example periodogram
- this function calculates the periodogram and
displays its graph - perioplotlt-function(x)
- adjxx-mean(x) substracts the mean of
the series - tffft(adjx) calculates
FFT nflength(tf) n2nf/21 decides
the number of frequencies - pritflt-tfc(1n2) takes the elements of
the FFT - intensitylt-(abs(pritf2))/nf calculates the
ordinates of periodogram - nyquist1/2 pfreqlt-seq(0,nf/2,by1)
preparation for frequencies - freqlt-pfreq/(length(pfreq)-1)nyquist
calculates frequencies - plot(freq,intensity,type"l")
- After reading a data set, for example sunspots,
we type - perioplot(sunspots)
19Things I am still avoiding in R
- Loops (it can be done..) but there are ways to
get around that sometimes by using
matrices.Example
estspeclt-function(au) Mlt-length(au)
counts how many autocorrelations (M) we
read jlt-seq(1,M,by1) creates sub-indexes
j1M lam0.5(1cos(jpi/M)) calculates
Tukeys weights wlt-seq(0,pi,bypi/50)
calculates angular frequencies lact(lam)au
multiplies each weight by the corresponding
autocorrelation flt-function(j,w) cos(jw)
zlt-outer(j,w,f)
calculates cos j w for all values of w and
j szlt-lacz obtains the
sum of weights correlations cos
jw hlt-(1/(2pi))(12sz) calculates
h(w) plot(w,h ,type"l",mainEstimated spectral
density )
20A user friendly version for teaching STATS
without writing commands
- Interactive Statistics with R developed by
- Motoya Machida, Math Department, Tennessee
Technological University - http//www.math.tntech.edu/ISR/index.html
21Packages in RExample Microarray data analysis
- The package DNAMR and DNAMRWeb developed by
J.Cabrera at Rutgers University can be found in.
At the left there is a program - http//www.rci.rutgers.edu/cabrera/DNAMR/
- The graph at the right was obtained with
DNAMRWeb for clustering the most important genes
for the Kahn data (each row is a gene and each
column is a microarray)
22Other Packages of functions written in R
- nlme Linear and nonlinear
mixed effects models - nnet Feed-forward Neural
Networks and Multinomial - Log-Linear Models
- rpart Recursive Partitioning
- spatial Functions for Kriging and
Point Pattern - Analysis
- splines Regression Spline
Functions and Classes - stats The R Stats Package
- stats4 Statistical functions
using S4 classes - survival Survival analysis,
including penalised - likelihood.
- tcltk Tcl/Tk Interface
- tools Tools for Package
Development - utils The R Utils Package
- Packages in library 'C/PROGRA1/R/rw2001/library'
- base The R Base Package
- boot Bootstrap R (S-Plus)
Functions (Canty) - class Functions for
Classification - cluster Functions for clustering
(by Rousseeuw et al.) - datasets The R Datasets Package
- foreign Read Data Stored by
Minitab, S, SAS, SPSS, - Stata, Systat, ...
- graphics The R Graphics Package
- grDevices The R Graphics Devices
and Support for Colours - and Fonts
- grid The Grid Graphics Package
- KernSmooth Functions for kernel
smoothing for Wand - Jones (1995)
- lattice Lattice Graphics
- MASS Main Package of Venables
and Ripley's MASS - methods Formal Methods and
Classes - mgcv GAMs with GCV smoothness
estimation and GAMMs - by REML/PQL
23Who can use R?
- Anybody who wants to have a free statistical
software at home or the office - Schools that can not afford to buy licenses of
statistical software - When we teach courses outside campus in places
without the software we have in campus for
instance to teachers (I prefer R to Excel) - Statisticians to write programs to do
calculations that are not included in commercial
software
24Who really needs R?
- If we are in an area of Statistics that has been
developed more in R than in commercial software
(spatial statistics etc.) - People working in Bioinformatics, Microarray
data analysis etc.
25- Will R push out commercial
- statistical software?
- Probably not in general but its use will increase
in the academic environment. - Minitab is user friendly for classroom use in
Intro STATS courses but R could be an option for
schools that can not afford to have MTB. - People in social sciences are too used to SPSS.
- SAS has very good data management options for
large data sets so probably in the big
corporations SAS will prevail but in the academic
environment R could be an option for advanced
courses (SAS is quite expensive) . - At least for a few years R probably will be used
in environments with statistical sophistication
and with some programming knowledge statistics
departments of universities and research
institutes until more user friendly versions are
developed. Maybe the commercial software will
take a turn more into the business world and R
will take more of the academic/research world.
26Should we teach R at ETSU?
- Yes!! Maybe not in Math 1530 but we could
introduce it in upper division courses. Why? - Students planning for a grad school in
Statistics, Bioinformatics, Biostatistics,
Biomath would benefit from it. - They would benefit of being able to work at home
without buying or renting a statistical program. - They would get familiar with an object oriented
programming language - Programming is good for you ? ! Why? Because you
really have to understand a problem before
writing a program to solve it
27Learning R
- There are several free manuals uploaded in the
page of the R project - http//www.r-project.org
- Books about R
- Introductory Statistics with R Peter Dalgaard
- Statistical Analysis and Data Display- An
Intermediate Course with Examples in S-Plus, R
and SAS - A Handbook of Statistical Analysis using
R-Everitt Hothorn (to appear in 2006) - Bioinformatics for R Wiley 2006
- There is a tutorial in
- http//www.etsu.edu/math/seier/R.htm
- In Module 1 we give the basic commands to do
calculations and graphs, the reader can search
for more information using the HELP of R. - In Module 2 we will learn how to write functions,
i.e. we will learn how to program in R. - You are welcome to use it.
28- Now we will browse the manual. Then we can Open R
( it is already installed in Room 205) - see some demos
- and try some of the commands in
- http//www.etsu.edu/math/seier/commandsRtalk.doc
-