Title: R basics
1R basics
2Ultra-short R introduction
- Most life scientists use spreadsheet programs for
analysis - like excel or statistica. Why? - Ease of use
- - click buttons, select data by hand
- you see the data in front of you
- you can do limited programming
3- Disadvantages
- Hard to handle large dataset (gt100 points)
- Inflexible, few analyses available
- Dumb-down click button effect
- Hard to repeat analyses systematically with new
data
4R
- R is a computational environment - somewhere
between a program and a programming language - No buttons, no wizards only a command line
interface - Is a professional statistics toolset - likely has
all the analyses you will ever need in your
career - Is also a programming language
- Can handle large datasets
- Very powerful graphics
- The state-of-the-art statistics program for
bioinformatics, especially array analysis - Free, and open source!
5R in this course
- In this course, we will use R for
- Exploring data
- Select subsets of data
- Plotting data
- Statistical tests on data
- etc
- almost everything.
- It is the most important tool in this course!
6R is hard
- R is challenging to learn
- No visual cues
- Very different from spreadsheets
- Hard to find the method to use
- Cryptic manual pages (at times)
- It requires a lot of effort to get inside the
system, but bear in mind that it is worth it -
you gain an edge to all spread-sheet users
7Reading material
- Keep the reference sheet handy
- Save code from lecture exercises in a text file
- Powerpoint slides are available after lectures at
the web page - Can also recommend the tutorials on the CRAN
website this is how I learned R - Dahlgaard book is very good, also for future
reference
8Expected
- You will have R open at all times during this
course - We will make MANY smaller exercises during class,
most often in pairs - Understanding these exercises is the key to
succeeding in the exam - We will have at least two teachers present at al
lectures - use us to get help when needed
9- NB It is YOUR responsibility to that you
understand the examples/exercises - if not, you have to ask for clarification during
the walkthrough of the problems
10First challenge of the day
- Start R
- Type
- demo(graphics)
- Hit enter a few times
11Starting with R
- We will now walk through the most basic R
concepts - We will add successive detail on this with many
exercises - and in a while get over to plotting,
basic statistics, and finally some programming
12Challenging?
- Those who know R or have a math background might
find most things for today no-so-challenging - We cover this to keep everyone on the same ground
13Vectors
- In statistics, we are almost always dealing with
several data points - A vector is an collection of numbers and/or
strings - (albin, sanne, jens)
- (1, 5.2, 10, 7, 2, 21)
- (3)
- The last example is a vector of length 1
14- In R, we make a vector by the c() command (for
concatenate) - gt c(1,5,10, 7, 2, 1)
- 1 1 5 10 7 2 1
- gt c('albin', 'sanne', 'jens')
- 1 "albin" "sanne" "jens"
- When we input strings or characters, we have to
surround them with or - If making vectors of size 1, we can skip c()
- gt 3
- 1 3
Method name
Method arguments
Method output is also called return value(s)
15- Challenge
- 1) Make the following vector
- 45,5,12,10
- 2) What happens with the following command?
- c(1100)
- c(502)
- A vector is a data structure, and the most
fundamental in R almost everything in R is some
kind of vector, although sometimes in several
dimensions vectors within vectors
16The reference sheet is your friend!
- You will get over-whelmed by different command
names fast - Use the reference sheet to remind yourself in all
exercises
17Assignment to memory
- The c() command is almost useless in itself - we
want to keep the vector for other analyses - The assignment concept
gt 45 add 4 and 5 1 9 gt alt-4
store 4 as a gt blt-5 store 5 as
b gt a just checking 1 4 gt b 1 5 gt ab
add ab (45) 1 9 yes,
works
Anything after a will be disregarded by R. Used
for making comments
18Expanding this to a whole vector my_vectorlt-
c(1,5,10, 7, 2) Note that there is no return
value now - this is caught by the my_vector.
What happens if you type my_vector after
this? my_vector is a variable, with the variable
name my_vector. Variable names are totally
arbitrary! The anatomy of the vector
Name my_vector
Values 1 5 10 7 2
Index 1 2 3 4 5
19We can access part of the vector like
thismy_vector5 will give the 5th item in the
vectorWhat happens if you do this?
my_vectorlt- c(1,5,10, 7, 2) define the
vectormy_vector c(1,3,5)my_vector14my_v
ector41
20gt my_vectorc(1,3,5) 1 1 10 2 gt
my_vector14 1 1 5 10 7
Making a series of integers AB will give a
vector of A,A1, A2B
gtmy_vector41 also works backwards 1 7 10
5 1
21Challenge
- Using the reference sheet, figure out at least
three ways of making R print your vector in the
other direction -
my_vectorlt- c(1,5,10, 7, 2) define the vector
22- gt my_vectorc(5,4,3,2,1)
- 1 2 7 10 5 1
- gt my_vectorc(51)
- 1 2 7 10 5 1
- gt rev(my_vector)
- 1 2 7 10 5 1
- gt
23Interlude naming rules and the danger of
over-writing
- Naming We can name vectors to almost anything.
The most basic rule is Never start a vector
name with a number - gt alt- c(1,5,4,2) OK
- gt 1alt- c(1,5,4,2) NOT OK
- Error syntax error
- gt a1lt- c(1,5,4,2) OK
- Over-writing
- my_vectorlt- c(1,5,10, 7, 2)
- my_vectorlt- c(10,5,2, 3, 1)
- what does my_vector contain now?
24Analyzing vectors
- Many functions work directly on vectors - most
have logical names - For instance, length(my_vector) gives the number
of items in the vector ( 5) - Challenge
- Make a vector big_vector with values 1 to 10000
using the ab construct - What is the
- Length of the vector
- Sum of all items in vector the sum() function
- Mean( average) of all items in the vector the
mean() function
25gt big_vectorlt-(110000) gt length(big_vector) 1
10000 gt sum(big_vector) 1 50005000 gt
mean(big_vector) 1 5000.5
26Interlude the R help system
- R has hundreds of functions where most have a
non-trivial syntax The easiest way of retrieving
documentation is by saying - ?function_name OR
- help(function_name) OR
- help.search(something) for fuzzy searches
- Look at the help for sample() and sort() and then
try them out on big_vector - Also try out RSiteSearch(sample) what happens
27Adding and multiplying a number to a vector
- Sometimes we want to add a number, like 10, to
each element in the vector - gt big_vector 10
- Test this
- big_vector2lt-big_vector 10
- Also test
- min(big_vector)
- max(big_vector)
- min(big_vector2)
- max(big_vector2)
- What happens?
28gt big_vector2lt-big_vector 10 gt
min(big_vector) 1 1 gt max(big_vector) 1
10000 gt min(big_vector2) 1 11 gt
max(big_vector2) 1 10010
29Adding vectors
- We can also add one vector to another vector
- Say that we have the three vectors
- Alt-c(10, 20, 30, 50)
- Blt-c(1,4,2,3)
- Clt-c(2.5, 3.5)
- Test what happens and explain the outcome
- AB
- AC
30gt AB1 11 24 32 53gt AC1 12.5 23.5 32.5
53.5
Alt-c(10, 20, 30, 50) Blt-c(1, 4, 2, 3
) Clt-c(2.5,3.5 )
- AB is easy to understand A1B1 , etc
- AC is trickier - the C vector is just of length
2. It is re-used! So - A1C112.5
- A2C223.5
- A3C132.5
- A4C253.5
- Actually, this is what is happening also with
A10 the 10 is used many times.
31Plotting vectors
- Lets make up some semi-random data
- datlt-rnorm (100) draw 100 random, normal
distributed data points - Test the following
- plot(dat)
- plot(dat,typel)
- barplot(dat)
- hist(dat)
- What is the difference?
32Why are your three first plots different than
mine? Why is your last plot more similar to mine?
33Graph options
- Generally, you can give extra options to
graphical commands like this - plot (some_vector, colblue, typel)
- In plot try to vary the following options - note
that you can use several at once (and figure out
what they do) - typel, b
- col hotpink
- mainimportant plot
- typeh
- typeS
- These options are really arguments to the plot()
function
34More about functions
- In almost all cases, a function needs some input,
like plot(dat). - dat here is an unnamed argument, and this works
because plot() assumes we mean x values
dat. - We could also say plot(xdat) - a named
argument. If you have many arguments, most of
them are named - such as - plot (some_vector, colblue, types)
- The help pages will tell you what type of
arguments you can use
35The danger of unnamed arguments
- is that the order of them will make a big
difference. Try this out - what is the difference
between the plot commands? - alt-rnorm(100)
- blt-rnorm(100)2
- plot(a,b)
- plot(b,a)
- plot(xb, ya)
36b
plot(a,b) plot(b,a) plot(xb, ya)
a
a
b
a
b
37Some generic R arguments to plots - the par()
function
- The par() function is used to set general plot
properties. It has hundreds of possible arguments
- see ?par - Two very handy par() arguments is mfrow() and
mfcol() - these will allow many plots in one page - You give these functions a vector of length 2 -
this gives the number of cells in the page (see
example)
38Example
- gt par( mfrowc(3,1) )
- gt plot (a,b)
- gt plot (b,a)
- gt plot(xb, ya)
- gt par( mfrowc(2,2) )
- gt plot (a,b)
- gt plot (b,a)
- gt plot(xb, ya)
39Challenge - can you get the three plots in a row
using mfrow?
40gt par( mfrowc(1,3) ) gt plot (a,b) gt plot (b,a) gt
plot(xb, ya)
41Overlaying plots
- Sometimes we want to put may data sets within one
graph, on top of each other - This is often made by the lines() or points()
command, like this
42gt plot(b, type"l", col"blue") gt lines(a,
col"red")
Why did I start with plotting b? What would have
happed if using points() instead of lines()?
43Sizing graphs
- Simple concept, but awkward to write
- Change X scale xlimc(start_value, end_value)
- Change Y scale ylimc(start_value, end_value)
44gt plot(a, type"l", col"blue") gt plot(a,
type"l", col"blue", ylimc(-5,5))
45Saving graphs
- Different on different systems!
- All systems can use the original tricky way with
device() see introduction PDFs if you have an
interest. We wont use this.
46OSX
- Click on the graph, and just copy it. Will become
a pdf or a bitmap when pasting. - On my mac you have top open a new file in Preview
and paste there first. On others you can paste it
into other programs directly - Or, just have the graphics window active and do
file-gtsave
47Windows
- Right-click on graph, copy as metafile or bitmap,
paste
48Linux
- Use awkward device() system, like pdf() or
- Use GIMP and just make a screen capture from
there by acquire
49Statistics part 1
- Describing the data descriptive statistics
- What is the center point of the data?
- What is the spread of the data?
- How does the distribution look?
50Summary statistics
- hist() ( Histogram) is a graphical way of
summarizing distributions it creates a number
of bins and calculates how many of the data
points that fall into each bin. - We can also summarize by the center points in
the data - mean()
51- median()
- Sort all the data, pick the number in the center.
If the number of data points is even, take the
mean of the two center points - Challenge
- We make another vector
- dat2lt-rnorm(10)
- And add a few extra points to it
- dat2lt-c(dat2, 10, 10.5, 30 )
- Test mean() and median() on dat2. Are they the
same? Can you explain the differences by plotting
a histogram? What is the advantage/disadvantage
of each measure?
52gt dat2lt-rnorm(10) gt dat2lt-c(dat2, 10, 10.5, 30
) gt median(dat2) 1 0.11125 gt mean(dat2) 1
3.636139 gt hist(dat2)
Means are sensitive to outliers! Very common
situation in genomics
53Percentiles
- An extension of the median concept
- Best explained by example
- the 20th percentile is the value (or score) below
which 20 percent of the observations may be
found. - The median is the same as the 50th percentile
- The first quartile is the 25th percentile, the
third is the 75th - Try summary(dat) and summary(dat2)
54 summary(dat) Min. 1st Qu. Median Mean
3rd Qu. Max. -2.20600 -0.60480 -0.06930
-0.01333 0.68030 3.49600 gt summary(dat2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000
The command ecdf() (empirical cumulative
distribution) calculates all percentiles in
your data and also understands plot() Try plot
(ecdf(dat2))
55What fraction of the data that has been covered
at point X
Sorted values of dat2
56Boxplots
- An easier representation of ECDFs. Is based on
both making boxes that tell us about both center
point and spread of the data - First, calculate the first quartile, the median
and the third quartile - Calculate the inter-quartile range (IQR) 3rd
quartile -1st quartile - These will be used to draw a box
- gtboxplot(dat)
- gt rug(dat, side2)
57Interquartile range (IQR)
3rd quartile
Median
1st quartile
58continued
- Sounds more complicated than it is
- Any data observation which lies more than
1.5IQR lower than the first quartile is
considered an outlier. - Indicate where the smallest value that is not an
outlier is by a vertical tic mark or "whisker",
and connect the whisker to the box via a
horizontal line. - Do the same for higher values
59The data point that is highest but within 1.5 IQR
from the center
More extreme data - outliers
3rd quartile
Median
1st quartile
60(No Transcript)
61Variance, standard deviation and data spread
- What is the difference between these
distributions?
62- Same mean and median, but different spread over
the x axis - This can be measured by the variance of the data
- It is basically the difference between each point
and the mean, squared
63- Standard deviation is simply variance squared.
- This gives nice statistical features that you
will get to know if you study statistics ?
64Why is variance and standard deviation important?
- Variance tells you something about the quality of
measurements - The higher the variance, the harder it is to with
certainty say that two measurements are different - (whiteboard demo)
65- What are the R functions for variance and
standard deviation? - If we make some random data
- smallsetlt-rnorm(100)
- largesetlt-rnorm(10000)
- What is the variance and standard deviation for
these? - Is the standard deviation really the square root
of the variance (what is the function for square
root?)
66- gtsmallsetlt-rnorm(100)
- gtlargesetlt-rnorm(10000)
- gtvar(smallset)
- 1 0.9504884
- gtvar(largeset)
- 1 0.98416
- gtsd(largeset)
- 1 0.9920484
- gtsd(smallset)
- 1 0.97493
- gtsqrt(var(smallset))
- 1 0.97493
Why do we get about the same variance?
67(No Transcript)
68Dealing with real data
- involves interacting with messy data from other
people - Almost all interesting data is real
- We need methods to get the data into R to start
with.
69Reading data from files
- Typically, you have data in a .txt or .xls file -
how do you get it into R for analysis? - Four steps
- 1 Format it to tab-separated text file
- 2 Tell R what directory to work in
- 3 Read in the file
- 4 Attach data to memory (optional)
70What R likes
- A data frame(table) with columns. The first
row can be column names, like this
Lecturer height age Poker_games_won
Albin 181 34 14
Jens 189 31 10
Sanne 170 NA 21
Name row, header
First data row
String data
Numerical data
Code for missing data Do not ask ladies for age
71Reading data(1)
- General rules
- Use tabs as separators
- Each row has to have the same number of columns
- Missing data is NA, not empty
- As .txt The ideal format!
- As .xls save as tabbed text
72(No Transcript)
73Get some data to play withUpdate old file is
wrong! Please download the data again
- See course web page!
- http//people.binf.ku.dk/albin/teaching/htbinf/R/
- Download the zipped file save this at a good
place that you know where it is - I will refer to this place on your hard drive as
the data directory
74Reading data(2)
- Tell R where to work.
- In R, you always work in a given directory,
called working directory. - You can set this graphically in Windows or MacOS
(next pages) - Or use setwd(some directory)
- In Linux, it pays to start R in the same
directory - It pays to have one directory per larger
analysis, and call the data files informative
things (not Data.txt)
75In OSX
Change working directory
Current working directory
76In Windows
77Reading in data (3)
- my_datalt- read.table(filename, headerT)
Only if you have column names I always have
that
Data frame object holding all the data
The actual command
78Lets try
- In the data directory, there is a file
flowers.txt. Have a look at it (it is a normal
text file can be opened in almost anything).
Now - my_datalt-read.table(flowers.txt, headerT)
- names(my_data)
- What happens what does names do?
79- gt my_datalt-read.table("flowers.txt", hT)
- note to self show name completion feature
- gt names(my_data)
- 1 "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width" "Species"
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa Etc
80Accessing the columns
- There are three ways of getting the data in a
data frame - Get the column by saying
- dataframe_namecolumn_name
- Get the columns as separate vectors, named as
their header - the attach command - More complex queries, using index operations
(more in a few slides)
81Columns by names
- The attach(some_data_frame) function will put all
the columns of some_data_frame as vectors in
working memory - So, in our case
- attach(my_data)
- Will give the 5 vectors
- "Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width" and "Species" - (which we then can analyze)
82Dangers of the attach() function
- Attach will make a vector out each column, named
after the column name - What happens if we already have a vector with
that name? - What does ls() and detach() do? Why would we
want to do this?
83Challenge
- Using the flowers file,
- Plot a histogram of the length of sepals
(Sepal.Length) - Plot the correlation between sepal length and
sepal width (an x-y plot) - What is the mean of Sepal.Length and Sepal.Width?
- Why do you get strange results on Sepal.Length?
- Can you solve the problem by using the
appropriate help file?
84- gt hist(Sepal.Length)
- gt plot(Sepal.Length, Sepal.Width)
85gt mean(Sepal.Width) 1 3.057333 gt
mean(Sepal.Length) 1 NA gt
strange!
If we look at the file Sepal.Length Sepal.Width P
etal.Length Petal.Width Species 5.1 3.5 1.4 0.2 se
tosa 4.9 3.0 1.4 0.2 setosa NA 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4
3.9 1.7 0.4 setosa 4.6 3.4 1.4 0.3 setosa
86Missing values
- In real data, we often have missing values. In R,
we label these NA - Any examples of missing data?
- Some functions can by default deal with these -
for instance hist() - Some functions will need to be told what to do
with them, like mean()
87gt mean(Sepal.Length) 1 NA na.rm is the option
for mean to remove all NAs first gt
mean(Sepal.Length, na.rmT) 1 5.851007 With
new files, you should check whether there are any
NA values first.
88What if we wanted just the 5 first rows? Or just
all values from the species versicolor?
- Sepal.Length Sepal.Width Petal.Length Petal.Width
Species - 5.1 3.5 1.4 0.2 setosa
- 4.9 3.0 1.4 0.2 setosa
- 4.7 3.2 1.3 0.2 setosa
- 4.6 3.1 1.5 0.2 setosa
- 5.0 3.6 1.4 0.2 setosa
- 6.2 2.9 4.3 1.3 versicolor
- 5.1 2.5 3.0 1.1 versicolor
- 5.7 2.8 4.1 1.3 versicolor
- 6.3 3.3 6.0 2.5 virginica
- 5.8 2.7 5.1 1.9 virginica
- 7.1 3.0 5.9 2.1 virginica
- 6.3 2.9 5.6 1.8 virginica
89The index concept
- Recap
- In a vector, we can select individual values by
saying - Sepal.Widthsome position
- gt Sepal.Width5
- 1 3.6
- Sepal.Widthsome_positions
- gt Sepal.Width210
- 1 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
- Sepal.Widthc(2,4,6,8)
- 1 3.0 3.1 3.9 3.4
- The first position will always have the number 1,
the last will have the length of the vector. - Challenge what is the mean of the first 100
values of Sepal.Width?
90Albins code
- gt alt-Sepal.Width1100
- gt mean(a)
- 1 3.099
- or, in just one line
- gt mean(Sepal.Width1100)
- 1 3.099
91Selecting parts of vectors depending on values
- Often, we want parts of vector due to the values
of the vector -for instance all values lt10, or
all strings equaling versicolor. - This is made by a slightly odd syntax(it will
make sense later on when we have many vectors) - To get all values over 10 in Sepal.Length
- Sepal.Length Sepal.Length gt10
- So, we are using a criterion instead of the
index
92Try out
- Make a boxplot of all values in Sepal.Length that
are higher than 5 - How many vales are there that satisfy this
criterion?
93Albins code
- alt-Sepal.LengthSepal.Length gt5
- boxplot (a)
- length(a)
- 1 119
94criteria in R
- lt Smaller than
- gt Bigger than
- gt Bigger or equal to
- lt smaller than or equal to
- equal to (not )
- ! not equal to
- Challenge
- how many rows for the virginica species?
95Albins try
- gt length(SpeciesSpeciesvirginica )
- Error in NextMethod("") object "virginica" not
found - what happens here is that R thinks virginica is
an object - a vector or something else - gt length(SpeciesSpecies"virginica" )
- 1 50
For string comparisons, we need to enclose the
string with or
96Selection using two vectors
- Often we have two vectors from a file, and we
want to use the value of one vector to select
values from the other. - For instance, sepal length broken up by species
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor 5.7 2.8 4.1 1.3 versico
lor 6.3 3.3 6.0 2.5 virginica 5.8 2.7 5.1 1.9 virg
inica 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 vi
rginica
97- Use the same syntax as before - but change the
vetor names! - some_vector other_vector something
- Challenge
- select the Sepal.Width values that correspond to
the setosa species - What is the mean of these values?
- What is the distribution of these values?
98Albins code
gt alt-Sepal.WidthSpecies"setosa" gt mean(a) 1
3.428 gt hist(a)
99Fancy way of breaking up data (only works with
some functions)
- boxplot (Sepal.WidthSpecies)
- Try it out
- This is a formula construction - break up width
in the levels of species
100Even more complex multiple criteria
- and combine two criteria
- or either this or that has to be true
- So,
- Species Sepal.Width gt4 Sepal.Length lt5
- will give the species rows where width gt4 AND
Length is lt5. - and, Species Sepal.Width gt4 Sepal.Length lt5
will give? - Challenge Make a boxplot of Sepal.Width where
Sepal.Length gt3 AND the species is virginica
101- gtboxplot (Sepal.WidthSepal.Length gt3
Species"virginica" )
102More complex data selections
- The dataframe is really a two-dimensional vector
- a vector that contains vectors
103- We can use almost the same commands to select
sub-sets of the data frame, without using the
attach() function - This is very handy at times
104- The data frame is really a big table - a 2D
vector. - Just as some_vector5 gives use the fifth value
in a vector, - some_dataframe10, 2 gives us the cell in the
10th row and the 2nd column - gtmy_data10, 2
- 1 3.1
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa 4.6
3.4 1.4 0.3 setosa 5.0 3.4 1.5 0.2 setosa 4.4 2.9
1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5
0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1
setosa
105Selecting rows
- If we instead say
- my_data10,
- We will get the WHOLE 10th row back
- One can think of it as saying
- my_data10, give me everything
- So, what will happen if we say
- my_data15,
- Or
- my_data51,
- Try it out
106Indexing exercise
- In the data directory, there is a file called
Rindexing.pdf it is a set of small exercises
designed to help you understand indexing - It works with a very small data file, just so
that you can check your results by eye - Indexing is very important for the rest of the
course
107Nota Bene 1)Students on the reserve list
- Please talk to me to get signed on to the course
(I need your student number or similar) - Otherwise you cannot go to the exam gt no credit
- 2) A watch was found in the last lecture will
be given back if you can describe it well
108Sorting one vector by another
- Very common situation with experimental data
what are the genes with the highest expression in
my data set? - Recap what happened if you said
- my_data15,
- Or
- my_data51,
- ?
109 - get some data to play with
- dlt-read.table("mtcar.txt", hT)
- attach(d)
Just by looking at the data, d what car has
the highest and lowest horsepower (hp)?
110- car_name mpg cyldisp hp drat wt qsecvs am
gear carb - 1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4 - 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4 - 3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1 - 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1 - 5 Hornet Sportabout 18.7 8 360.0 175 3.15
3.440 17.02 0 0 3 2 - 6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1 - 7 Duster 360 14.3 8 360.0 245 3.21
3.570 15.84 0 0 3 4 - 8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2 - 9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2 - 10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4 - 11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4 - 12 Merc 450SE 16.4 8 275.8 180 3.07
4.070 17.40 0 0 3 3 - 13 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3 - 14 Merc 450SLC 15.2 8 275.8 180 3.07
3.780 18.00 0 0 3 3 - 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4 - 16 Lincoln Continental 10.4 8 460.0 215 3.00
5.424 17.82 0 0 3 4 - 17 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4 - 18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1
111Permutations
- sort.list(x) will give the permutation of x that
sorts x - Is much easier to explain with a small example
- gt alt-c(3,1,10)
- gt sort.list(a)
- 1 2 1 3
112- gt alt-c(3,1,10)
- gt sort.list(a)
- 1 2 1 3
- gt blt-sort.list(a)
- gt ab
- 1 1 3 10
- sort.list(hp)
- 1 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28
10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29
31
113- car_name mpg cyl disp hp drat wt qsec vs
am gear carb - 1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4 - 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4 - 3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1 - 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1 - 5 Hornet Sportabout 18.7 8 360.0 175 3.15
3.440 17.02 0 0 3 2 - 6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1 - 7 Duster 360 14.3 8 360.0 245 3.21
3.570 15.84 0 0 3 4 - 8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2 - 9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2 - 10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4 - 11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4 - 12 Merc 450SE 16.4 8 275.8 180 3.07
4.070 17.40 0 0 3 3 - 13 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3 - 14 Merc 450SLC 15.2 8 275.8 180 3.07
3.780 18.00 0 0 3 3 - 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4 - 16 Lincoln Continental 10.4 8 460.0 215 3.00
5.424 17.82 0 0 3 4 - 17 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4 - 18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1
19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10
11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31
114Thought exercise
- What happens if we now say
- hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31) - ?
115We get hp sorted
- hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31) - 1 52 62 65 66 66 91 93 95 97 105 109
110 110 110 113 123 123 150 150 175 175 175 180
180 180 205 215 230 245 245 264 335
116So, sort.list really gives an instruction how to
sort
- So, we can also say
- sort_orderlt-sort.list(hp)
- hpsort_order
- What will happen if we say
- car_namesort_order
- ?
117- car_namesort_order
- 1 Honda Civic Merc 240D
Toyota Corolla Fiat 128 Fiat X1-9
Porsche 914-2 Datsun 710 - 8 Merc 230 Toyota Corona
Valiant Volvo 142E Mazda RX4
Mazda RX4 Wag Hornet 4 Drive - 15 Lotus EuropaMerc 280 Merc 280C
Dodge Challenger AMC Javelin
Hornet Sportabout Pontiac Firebird - 22 Ferrari Dino Merc 450SE Merc
450SL Merc 450SLC Cadillac
Fleetwood Lincoln Continental Chrysler Imperial
- 29 Duster 360 Camaro Z28 Ford
Pantera L Maserati Bora - 32 Levels AMC Javelin Cadillac Fleetwood Camaro
Z28 Chrysler Imperial Datsun 710 Dodge Challenger
Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E - So, we use the hp permutation to reorder the
cars! - We can also do this with the whole dataframe, in
the same way - gtdsort_order,
118gt dsort_order, car_name mpg cyl
disp hp drat wt qsec vs am gear carb 19
Honda Civic 30.4 4 75.7 52 4.93 1.615
18.52 1 1 4 2 8 Merc 240D 24.4
4 146.7 62 3.69 3.190 20.00 1 0 4 2 20
Toyota Corolla 33.9 4 71.1 65 4.22 1.835
19.90 1 1 4 1 18 Fiat 128 32.4
4 78.7 66 4.08 2.200 19.47 1 1 4 1 26
Fiat X1-9 27.3 4 79.0 66 4.08 1.935
18.90 1 1 4 1 27 Porsche 914-2 26.0
4 120.3 91 4.43 2.140 16.70 0 1 5 2 3
Datsun 710 22.8 4 108.0 93 3.85 2.320
18.61 1 1 4 1 9 Merc 230 22.8
4 140.8 95 3.92 3.150 22.90 1 0 4 2 21
Toyota Corona 21.5 4 120.1 97 3.70 2.465
20.01 1 0 3 1 6 Valiant 18.1
6 225.0 105 2.76 3.460 20.22 1 0 3 1 32
Volvo 142E 21.4 4 121.0 109 4.11 2.780
18.60 1 1 4 2 1 Mazda RX4 21.0
6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
17.02 0 1 4 4 4 Hornet 4 Drive 21.4
6 258.0 110 3.08 3.215 19.44 1 0 3 1 28
Lotus Europa 30.4 4 95.1 113 3.77 1.513
16.90 1 1 5 2 10 Merc 280 19.2
6 167.6 123 3.92 3.440 18.30 1 0 4 4 11
Merc 280C 17.8 6 167.6 123 3.92 3.440
18.90 1 0 4 4 22 Dodge Challenger 15.5
8 318.0 150 2.76 3.520 16.87 0 0 3 2 23
AMC Javelin 15.2 8 304.0 150 3.15 3.435
17.30 0 0 3 2 5 Hornet Sportabout 18.7
8 360.0 175 3.15 3.440 17.02 0 0 3 2 25
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
17.05 0 0 3 2 30 Ferrari Dino 19.7
6 145.0 175 3.62 2.770 15.50 0 1 5 6 12
Merc 450SE 16.4 8 275.8 180 3.07 4.070
17.40 0 0 3 3 13 Merc 450SL 17.3
8 275.8 180 3.07 3.730 17.60 0 0 3 3 14
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
18.00 0 0 3 3 15 Cadillac Fleetwood 10.4
8 472.0 205 2.93 5.250 17.98 0 0 3 4 16
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
17.82 0 0 3 4 17 Chrysler Imperial 14.7
8 440.0 230 3.23 5.345 17.42 0 0 3 4 7
Duster 360 14.3 8 360.0 245 3.21 3.570
15.84 0 0 3 4 24 Camaro Z28 13.3
8 350.0 245 3.73 3.840 15.41 0 0 3 4 29
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
14.50 0 1 5 4 31 Maserati Bora 15.0
8 301.0 335 3.54 3.570 14.60 0 1 5 8
- gt dsort_order,
- car_name mpg cyl disp hp drat
wt qsec vs am gear carb - 19 Honda Civic 30.4 4 75.7 52 4.93
1.615 18.52 1 1 4 2 - 8 Merc 240D 24.4 4 146.7 62 3.69
3.190 20.00 1 0 4 2 - 20 Toyota Corolla 33.9 4 71.1 65 4.22
1.835 19.90 1 1 4 1 - 18 Fiat 128 32.4 4 78.7 66 4.08
2.200 19.47 1 1 4 1 - 26 Fiat X1-9 27.3 4 79.0 66 4.08
1.935 18.90 1 1 4 1 - 27 Porsche 914-2 26.0 4 120.3 91 4.43
2.140 16.70 0 1 5 2 - 3 Datsun 710 22.8 4 108.0 93 3.85
2.320 18.61 1 1 4 1 - 9 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2 - 21 Toyota Corona 21.5 4 120.1 97 3.70
2.465 20.01 1 0 3 1 - 6 Valiant 18.1 6 225.0 105 2.76
3.460 20.22 1 0 3 1 - 32 Volvo 142E 21.4 4 121.0 109 4.11
2.780 18.60 1 1 4 2 - 1 Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 0 1 4 4 - 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
2.875 17.02 0 1 4 4 - 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
3.215 19.44 1 0 3 1 - 28 Lotus Europa 30.4 4 95.1 113 3.77
1.513 16.90 1 1 5 2 - 10 Merc 280 19.2 6 167.6 123 3.92
3.440 18.30 1 0 4 4 - 11 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4
119What type of data are we looking at?
- Type here refers to what kind of data
representation - for example strings or
numbers - R usually does a good job at interpreting this
for us when we read data - This becomes important later on, as some
functions only take vectors of a certain data
type - for instance, you cannot take the mean()
of a vector with words!
120Simple types
- Numbers
- num_vectorlt- c(1,5,2,10)
- Strings(or characters)
- String_vectorlt-c( a, trial1)
- How can we know?
- class(num_vector)
- 1 "numeric"
121Factors/groups
- Factors slightly more tricky. A factor is often
a grouping variable - such as the gender or
experiment group, like - genderlt-c(f, f, m, m, f)
- experiment_numberlt-c(1,2,4,5,5,6)
- Problem here is that R will think that the first
vector is a string vector, and the last a
numerical vector. - gt class(gender)
- 1 "character"
- gt class(experiment_number)
- 1 "numeric"
122- This is resolved by
- gt experiment_numberlt-as.factor( c(1,2,4,5,5,6))
- gt class(experiment_number)
- 1 "factor"
123Over to analysis of data!
- boxplot (Sepal.WidthSpecies)
Is there a real difference?
124We now move over from pure R to using R to
make statistical tests
- Focus is what typical tests are there and how to
use them - We skip almost all statistical theory(!) -
covered in the course in fall
125Population and sample
- Population refers to a set of potential
measurements or values, including not only cases
actually observed but those that are potentially
observable. Really depends on the question at
hand! - Sample is the data we actually have - generally
this is very small compared to the whole
population, but we use it to draw inferences
about the population - So, the sample is a small part of a much larger
population, in most cases - Can samplepopulation? Examples?
126Significance tests
- In real life, we always get some difference
between datasets, because we sample limited data,
and there is always noise - So, is the difference in sepal width we see
between species something that is just due to
luck - or is it real? - Challenge What three features in the dataset
will be important for finding this out
(significant difference or just difference due to
random noise) ? Discuss with your sideman for a
minute
127Albins take
- The observed difference
- Obviously, the bigger observed difference, the
less random it is - The data depth
- The more data you have, the less likely it is
that strange things happens due to random effects - Data spread (variance)
- Slightly less intuitive. The less spread, the
less it can be explained by random events.
128The concept of null hypothesis and P-values
- Always in statistics, we set a null-hypothesis.
This is basically what we want to disprove! - Usually, this is something like there is no
difference in mean in these two samples. The
alternative hypothesis is then, by default,
There is a difference - Then, we test how likely this is, with certain
assumptions (depending on what test)
129- The P-value of a test can be interpreted as the
chance that the observed difference occurs if the
null hypothesis is true - This only makes sense if your test is relevant!
- So, a small P-value means that the null
hypothesis is not a good model - which means
that there is a difference - We often want P-values to be as small as possible
- the standard cutoff for significance is 5,
or 0.05 - However, this is just tradition it is
completely arbitrary
130Two-sample t-test
- deliberate skip of math details Jens will
cover some later on - The t-test tests if two distributions come from
distributions with the same mean. - The assumption is that both distributions are
normal-shaped - we should always check this
first
131Mean
Standard deviation a measure of spread
132Normality - brief check
- There are many tests for normality. Now we will
do the simplest by eye. Make a histogram and see
if the data looks reasonably bell-shaped - Challenge Test to make histograms for the sepal
length of each species
133Albins take
- warning wrong approach
- hist(Sepal.Length)
- not very normal-shaped!
- why is this the wrong approach?
134Albins take, again
- now for each species
- par (mfrowc(1,3))
- hist(Sepal.LengthSpecies"setosa")
- hist(Sepal.LengthSpecies"versicolor")
- hist(Sepal.LengthSpecies"virginica")
- IMHO, these are not very far away from
normal-shaped given the amount of data
135Slightly more advanced way the
quartile-quartile plot
- If two distributions have the same shape, their
cumulative distributions should match well
136 qqnorm(Sepal.Length)
137 qqnorm(Sepal.LengthSpecies"setosa")
138So, are they different?
gt setosalt-Sepal.LengthSpecies"setosa" gt
versicolorlt-Sepal.LengthSpecies"versicolor" gt
t.test(setosa, versicolor) Welch Two Sample
t-test data setosa and versicolor t
-10.521, df 86.538, p-value lt 2.2e-16
Very! alternative hypothesis true difference in
means is not equal to 0 95 percent confidence
interval -1.1057074 -0.7542926 sample
estimates mean of x mean of y 5.006
5.936
139Lets test it with some real data
- In your data directory, there are two files with
binding sites lengths, from estrogen receptor
alfa and beta ERA_widths.txt, ERB_widths.txt - 1) Read these into two different vectors, called
era, erb - 2) Have they significantly different means?
140Albins code
- gt dlt-read.table("ERA_widths.txt", hT)
- gt attach(d)
- gt dlt-read.table("ERB_widths.txt", hT)
- gt attach(d)
- gt
gt t.test(ERA_width, ERB_width) Welch Two Sample
t-test data ERA_width and ERB_width t
2.7288, df 1066.975, p-value
0.00646 alternative hypothesis true difference
in means is not equal to 0 95 percent confidence
interval 11.18740 68.45565 sample
estimates mean of x mean of y 514.1370
474.3155
141Hey, we forgot to check for normality!
gt par(mfrowc(1,2)) gt hist (ERA_width) gt hist
(ERB_width) NOT really normal shaped! t.test is
not valid!
142The two-sample Wilcoxon test
- The t-test is assuming that samples are
normal-distributed, which is often not true with
biological data - The wilcoxon test wilcox.test() is a less
powerful test which does not make such
assumptions. - Tests of this kind are called non-parametric
- Use when the normality assumption is doubtful
143Challenge
- Repeat the analysis of the ERA/beta sites using
wilcox.test. Is the difference significant?
144gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0
The AHH, so close to significance situation
145Different types of alternative hypothesis
- Null hypothesisthere is no difference
- Default alternative hypothesis there is
difference. This is called two-sided, as the
differences can be in either direction - Optional We can make the alternative hypothesis
one-sided - like There is a difference -
distribution A has a greater mean than
distribution B. This increases the statistical
power 2
146In R
gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0 gt wilcox.test(ERA_width,
ERB_width, alternative"greater") Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.02776 alternative hypothesis true location
shift is greater than 0 what happens if
alternativeless?
147gt wilcox.test(ERA_width, ERB_width,
alternative"less") Wilcoxon rank sum test with
continuity correction data ERA_width and
ERB_width W 151243.5, p-value
0.9723 alternative hypothesis true location
shift is less than 0
148Dangers with hypothesis fiddling
- This should never be used to get twice as good
P-value without some justification - Homework for Tuesday read BMJ 1994309248
(linked at the web page) for discussion on when
this is appropriate
149Testing many samples at once
- Sometimes, we have a table of groups, and we want
to see if any of the groups are different in
means - treatment1 treatment2 treatment3
- plant1
- plant2
- plant3
- etc
- This is the typical one-way Anova test situation
150boxplot (Sepal.WidthSpecies)
Is there a real difference?
Last time we checked this, we just tested setosa
vs versicolor
151- In R, oneway anova tests can be made easy by the
oneway.test() command - The oneway.test() works by giving it a formula,
in our case - Sepal.WidthSpecies.
- This will basically give the function a table
with three columns of values, one for each
species. - It will then test if the means of any of these
columns are different from the others
152Test flowers set
gt dlt-read.table("flowers.txt", hT) gt
attach(d) oneway.test(Sepal.WidthSpecies) One-w
ay analysis of means (not assuming equal
variances) data Sepal.Width and Species F
45.012, num df 2.000, denom df 97.402,
p-value 1.433e-14
As expected, very, very significant. Is the
difference in length of sepals (not widths) even
more significant? Test it!
153Non-parametric alternative to Anova
- As with t-tests, Anova tests assume that each of
the columns has a normal-like distribution. This
is often not the case - The Kruskal-Wallis test is analogous to the
wilcoxon test - a non-paramatric variant.
Practically, it works exactly in the same way as
the oneway.test().
1542 data sets gt2 data sets
Normal distribution t.test() oneway.test()
Not normal distribution (non-parametric tests) wilcox.test() kruskal.test()
155Larger exercise
- In the data folder, there is file
gene_lengths.txt, showing gene lengths from three
different chromosomes. - Is there a significant difference in the center
points (means or medians) of gene lengths for the
different chromosomes? What is the method of
choice? What two chromosomes have the most/least
significant difference?
156gt dlt-read.table("gene_lengths.txt", hT) gt
attach(d) gt names(d) 1 "id" "chr"
"gene_length first - is the gene length
distribution normal-shaped? gt hist(gene_length)
nope! gt nonparametric tests needed
157 many more than two groups, and
non-parametric test needed gtkruskal wallis
test gt kruskal.test(gene_lengthchr) Kruskal-Wal
lis rank sum test data gene_length by chr
Kruskal-Wallis chi-squared 131.7681, df 2,
p-value lt 2.2e-16 very significant difference!
158are all chromosome pairs significant? test
case use wilcox.test on two chromosomes gt
wilcox.test(gene_lengthchr"chr2" ,
gene_lengthchr"chr3") Wilcoxon rank sum
test with continuity correction data
gene_lengthchr "chr2" and gene_lengthchr
"chr3" W 1068584, p-value
0.1822 alternative hypothesis true location
shift is not equal to 0 so, all pairs are not
significantly different!
159chr1 chr2 chr3
chr1 1 2.2E-16 2.2E-16
chr2 1 0.1822
chr3 1
160for us who are lazy - we can do all vs all tests
like this gt pairwise.wilcox.test(gene_length,
gchr ) Pairwise comparisons using Wilcoxon
rank sum test data gene_length and chr
chr1 chr2 chr2 lt2e-16 - chr3 lt2e-16 0.18 P
value adjustment method holm
161Small homework I Reading about one-and two-sided
tests What is horribly wrong in this (published)
study?
162Small homework (II)
- What is the statistical problem with making these
pairwise tests en masse, for instance making 1000
t-tests and reporting all cases where Plt0.05? - Hint Go back to the slides about what the 5 P
value cutoff really means
163Homework and Examination
Big homework for examination goes online this
evening!
164Examination
- Three home works during the course
- These will together give 50 of the final points
- Main focus here is technical skill (how to do
analysis x) - Theoretically optional. Very strict deadlines.
- One large home work at the end of the course
- Will give 50 of final points
- Overlaps substantially with the three first, but
is more high-level - Main focus here is to merge technical skill with
biologically motivated analysis - real science - Do the maththe three smaller home works pays off
- Note High level of activity needed from the
start of the course. Will be demanding! The home
works are the examination!
165- The points with homework
- Encourage study during the course
- It is easy to think that you understood
something, but when meeting the real thing - Only realistic way of training
- Evaluation
166Rules for homework
- Working together in 2-3 perons teams is allowed
and encouraged, BUT - Each person hands in a separate report
- which must be unique. This is usually not a
problem if you write it yourself - If you work with others, write who you work with
on the report ( I worked with X and Y in this
homework!) - Cases of copying between students WILL be
reported. The BEST thing that can happen is that
you will get lower scores
167(No Transcript)
168(No Transcript)
169Using material from others
- Copying text/images from other sources
including fellow students is considered
cheating - You can quote sources, given that you cite them,
but this has a limit
170Questions?
171Correlations
- Sometimes, we do not want to show that two things
are different, but that they are correlated. - For instance, weight and height for individuals
are clearly different, but also highly correlated - This is a positive correlation - but we could
also have a negative correlation - Correlations are dangerous things - it depends on
how we ask questions .Easily the most misused and
misunderstood part of statistics - Correlation ! Causation
172Consider Why is it that
- Truth Compared to the general population, there
is a strong correlation between being an American
football player and risk for testicle cancer - Interpretation American football leads to
cancer!
173Consider Why is it that
- Truth There is an inverse correlation between
yearly income and running speed - Interpretation carrying money makes you slow!
174Time to explore
- Open the height.txt file in R, and look at the
different data types. Try to plot various columns
against each other, like - plot (mean_height, Year)
- What columns correlate? Can we break up the data
even more? - Try it out!
175gtplot (mean_weight, year ) gt plot (mean_height,
year ) gt plot (mean_height, mean_weight )
Hmmmobviously there are TWO datasets here! How
can we break this up?
176Visualize the groups with color
plot (mean_weight, year ) lines
(mean_weightgender"F", yeargender"F",
col"red" ) lines (mean_weightgender"M",
yeargender"M", col"blue" ) a typical case
of OVER-plotting
177Assessing correlations
- First plot a standard x-y plot
- Second, make a correlation test
- cor(x, y) gives the Pearson correlation
coefficient between z an y, which goes from
-1(perfect negative correlation) to 1(prefect
correlation). 0no correlation
178Test it
- gt cor(mean_weight, year)
- 1 0.1688499
- BAD correlation - close to zero
- What happens if we break it up by gender as in
the last plot? Test it!
179- gt cor (mean_weightgender"M",
yeargender"M" ) - 1 0.9527568
- gt cor (mean_weightgender"F",
yeargender"F" ) - 1 0.931249
- very, very correlated!
180Important Pearson correlation gives a score,
not a significance value. Obviously, it is easier
to get a good correlation if you only have a few
values! Often, we want to also want to TEST if
the correlation is there. The null hypothesis is
then 0 correlation This is done with
the cor.test() command
181gt cor.test (mean_weight, year ) Pearson's
product-moment correlation data mean_weight
and year t 0.5934, df 12, p-value
0.5639 alternative hypothesis true correlation
is not equal to 0 95 percent confidence
interval -0.3973253 0.6419208 sample
estimates cor 0.1688499
182gt cor.test (mean_weightgender"F",
yeargender"F" ) Pearson's product-moment
correlation data mean_weightgender "F"
and yeargender "F" t 5.7147, df 5,
p