R basics - PowerPoint PPT Presentation

1 / 244
About This Presentation
Title:

R basics

Description:

Fishers Exact Test Tests proportions in a 2x2 table by ... while chi-squre is an approximation Chi-square has problems with 0 count bins Chi-square is ... – PowerPoint PPT presentation

Number of Views:345
Avg rating:3.0/5.0
Slides: 245
Provided by: AlbinSa4
Category:
Tags: basics | squre | test

less

Transcript and Presenter's Notes

Title: R basics


1
R basics
  • Albin Sandelin

2
Ultra-short R introduction
  • Most life scientists use spreadsheet programs for
    analysis - like excel or statistica. Why?
  • Ease of use
  • - click buttons, select data by hand
  • you see the data in front of you
  • you can do limited programming

3
  • Disadvantages
  • Hard to handle large dataset (gt100 points)
  • Inflexible, few analyses available
  • Dumb-down click button effect
  • Hard to repeat analyses systematically with new
    data

4
R
  • R is a computational environment - somewhere
    between a program and a programming language
  • No buttons, no wizards only a command line
    interface
  • Is a professional statistics toolset - likely has
    all the analyses you will ever need in your
    career
  • Is also a programming language
  • Can handle large datasets
  • Very powerful graphics
  • The state-of-the-art statistics program for
    bioinformatics, especially array analysis
  • Free, and open source!

5
R in this course
  • In this course, we will use R for
  • Exploring data
  • Select subsets of data
  • Plotting data
  • Statistical tests on data
  • etc
  • almost everything.
  • It is the most important tool in this course!

6
R is hard
  • R is challenging to learn
  • No visual cues
  • Very different from spreadsheets
  • Hard to find the method to use
  • Cryptic manual pages (at times)
  • It requires a lot of effort to get inside the
    system, but bear in mind that it is worth it -
    you gain an edge to all spread-sheet users

7
Reading material
  • Keep the reference sheet handy
  • Save code from lecture exercises in a text file
  • Powerpoint slides are available after lectures at
    the web page
  • Can also recommend the tutorials on the CRAN
    website this is how I learned R
  • Dahlgaard book is very good, also for future
    reference

8
Expected
  • You will have R open at all times during this
    course
  • We will make MANY smaller exercises during class,
    most often in pairs
  • Understanding these exercises is the key to
    succeeding in the exam
  • We will have at least two teachers present at al
    lectures - use us to get help when needed

9
  • NB It is YOUR responsibility to that you
    understand the examples/exercises
  • if not, you have to ask for clarification during
    the walkthrough of the problems

10
First challenge of the day
  • Start R
  • Type
  • demo(graphics)
  • Hit enter a few times

11
Starting with R
  • We will now walk through the most basic R
    concepts
  • We will add successive detail on this with many
    exercises - and in a while get over to plotting,
    basic statistics, and finally some programming

12
Challenging?
  • Those who know R or have a math background might
    find most things for today no-so-challenging
  • We cover this to keep everyone on the same ground

13
Vectors
  • In statistics, we are almost always dealing with
    several data points
  • A vector is an collection of numbers and/or
    strings
  • (albin, sanne, jens)
  • (1, 5.2, 10, 7, 2, 21)
  • (3)
  • The last example is a vector of length 1

14
  • In R, we make a vector by the c() command (for
    concatenate)
  • gt c(1,5,10, 7, 2, 1)
  • 1 1 5 10 7 2 1
  • gt c('albin', 'sanne', 'jens')
  • 1 "albin" "sanne" "jens"
  • When we input strings or characters, we have to
    surround them with or
  • If making vectors of size 1, we can skip c()
  • gt 3
  • 1 3

Method name
Method arguments
Method output is also called return value(s)
15
  • Challenge
  • 1) Make the following vector
  • 45,5,12,10
  • 2) What happens with the following command?
  • c(1100)
  • c(502)
  • A vector is a data structure, and the most
    fundamental in R almost everything in R is some
    kind of vector, although sometimes in several
    dimensions vectors within vectors

16
The reference sheet is your friend!
  • You will get over-whelmed by different command
    names fast
  • Use the reference sheet to remind yourself in all
    exercises

17
Assignment to memory
  • The c() command is almost useless in itself - we
    want to keep the vector for other analyses
  • The assignment concept

gt 45 add 4 and 5 1 9 gt alt-4
store 4 as a gt blt-5 store 5 as
b gt a just checking 1 4 gt b 1 5 gt ab
add ab (45) 1 9 yes,
works
Anything after a will be disregarded by R. Used
for making comments
18
Expanding this to a whole vector my_vectorlt-
c(1,5,10, 7, 2) Note that there is no return
value now - this is caught by the my_vector.
What happens if you type my_vector after
this? my_vector is a variable, with the variable
name my_vector. Variable names are totally
arbitrary! The anatomy of the vector
Name my_vector
Values 1 5 10 7 2
Index 1 2 3 4 5
19
We can access part of the vector like
thismy_vector5 will give the 5th item in the
vectorWhat happens if you do this?
my_vectorlt- c(1,5,10, 7, 2) define the
vectormy_vector c(1,3,5)my_vector14my_v
ector41
20
gt my_vectorc(1,3,5) 1 1 10 2 gt
my_vector14 1 1 5 10 7
Making a series of integers AB will give a
vector of A,A1, A2B
gtmy_vector41 also works backwards 1 7 10
5 1
21
Challenge
  • Using the reference sheet, figure out at least
    three ways of making R print your vector in the
    other direction -

my_vectorlt- c(1,5,10, 7, 2) define the vector
22
  • gt my_vectorc(5,4,3,2,1)
  • 1 2 7 10 5 1
  • gt my_vectorc(51)
  • 1 2 7 10 5 1
  • gt rev(my_vector)
  • 1 2 7 10 5 1
  • gt

23
Interlude naming rules and the danger of
over-writing
  • Naming We can name vectors to almost anything.
    The most basic rule is Never start a vector
    name with a number
  • gt alt- c(1,5,4,2) OK
  • gt 1alt- c(1,5,4,2) NOT OK
  • Error syntax error
  • gt a1lt- c(1,5,4,2) OK
  • Over-writing
  • my_vectorlt- c(1,5,10, 7, 2)
  • my_vectorlt- c(10,5,2, 3, 1)
  • what does my_vector contain now?

24
Analyzing vectors
  • Many functions work directly on vectors - most
    have logical names
  • For instance, length(my_vector) gives the number
    of items in the vector ( 5)
  • Challenge
  • Make a vector big_vector with values 1 to 10000
    using the ab construct
  • What is the
  • Length of the vector
  • Sum of all items in vector the sum() function
  • Mean( average) of all items in the vector the
    mean() function

25
gt big_vectorlt-(110000) gt length(big_vector) 1
10000 gt sum(big_vector) 1 50005000 gt
mean(big_vector) 1 5000.5
26
Interlude the R help system
  • R has hundreds of functions where most have a
    non-trivial syntax The easiest way of retrieving
    documentation is by saying
  • ?function_name OR
  • help(function_name) OR
  • help.search(something) for fuzzy searches
  • Look at the help for sample() and sort() and then
    try them out on big_vector
  • Also try out RSiteSearch(sample) what happens

27
Adding and multiplying a number to a vector
  • Sometimes we want to add a number, like 10, to
    each element in the vector
  • gt big_vector 10
  • Test this
  • big_vector2lt-big_vector 10
  • Also test
  • min(big_vector)
  • max(big_vector)
  • min(big_vector2)
  • max(big_vector2)
  • What happens?

28
gt big_vector2lt-big_vector 10 gt
min(big_vector) 1 1 gt max(big_vector) 1
10000 gt min(big_vector2) 1 11 gt
max(big_vector2) 1 10010
29
Adding vectors
  • We can also add one vector to another vector
  • Say that we have the three vectors
  • Alt-c(10, 20, 30, 50)
  • Blt-c(1,4,2,3)
  • Clt-c(2.5, 3.5)
  • Test what happens and explain the outcome
  • AB
  • AC

30
gt AB1 11 24 32 53gt AC1 12.5 23.5 32.5
53.5
Alt-c(10, 20, 30, 50) Blt-c(1, 4, 2, 3
) Clt-c(2.5,3.5 )
  • AB is easy to understand A1B1 , etc
  • AC is trickier - the C vector is just of length
    2. It is re-used! So
  • A1C112.5
  • A2C223.5
  • A3C132.5
  • A4C253.5
  • Actually, this is what is happening also with
    A10 the 10 is used many times.

31
Plotting vectors
  • Lets make up some semi-random data
  • datlt-rnorm (100) draw 100 random, normal
    distributed data points
  • Test the following
  • plot(dat)
  • plot(dat,typel)
  • barplot(dat)
  • hist(dat)
  • What is the difference?

32
Why are your three first plots different than
mine? Why is your last plot more similar to mine?
33
Graph options
  • Generally, you can give extra options to
    graphical commands like this
  • plot (some_vector, colblue, typel)
  • In plot try to vary the following options - note
    that you can use several at once (and figure out
    what they do)
  • typel, b
  • col hotpink
  • mainimportant plot
  • typeh
  • typeS
  • These options are really arguments to the plot()
    function

34
More about functions
  • In almost all cases, a function needs some input,
    like plot(dat).
  • dat here is an unnamed argument, and this works
    because plot() assumes we mean x values
    dat.
  • We could also say plot(xdat) - a named
    argument. If you have many arguments, most of
    them are named - such as
  • plot (some_vector, colblue, types)
  • The help pages will tell you what type of
    arguments you can use

35
The danger of unnamed arguments
  • is that the order of them will make a big
    difference. Try this out - what is the difference
    between the plot commands?
  • alt-rnorm(100)
  • blt-rnorm(100)2
  • plot(a,b)
  • plot(b,a)
  • plot(xb, ya)

36
b
plot(a,b) plot(b,a) plot(xb, ya)
a
a
b
a
b
37
Some generic R arguments to plots - the par()
function
  • The par() function is used to set general plot
    properties. It has hundreds of possible arguments
    - see ?par
  • Two very handy par() arguments is mfrow() and
    mfcol() - these will allow many plots in one page
  • You give these functions a vector of length 2 -
    this gives the number of cells in the page (see
    example)

38
Example
  • gt par( mfrowc(3,1) )
  • gt plot (a,b)
  • gt plot (b,a)
  • gt plot(xb, ya)
  • gt par( mfrowc(2,2) )
  • gt plot (a,b)
  • gt plot (b,a)
  • gt plot(xb, ya)

39
Challenge - can you get the three plots in a row
using mfrow?
40
gt par( mfrowc(1,3) ) gt plot (a,b) gt plot (b,a) gt
plot(xb, ya)
41
Overlaying plots
  • Sometimes we want to put may data sets within one
    graph, on top of each other
  • This is often made by the lines() or points()
    command, like this

42
gt plot(b, type"l", col"blue") gt lines(a,
col"red")
Why did I start with plotting b? What would have
happed if using points() instead of lines()?
43
Sizing graphs
  • Simple concept, but awkward to write
  • Change X scale xlimc(start_value, end_value)
  • Change Y scale ylimc(start_value, end_value)

44
gt plot(a, type"l", col"blue") gt plot(a,
type"l", col"blue", ylimc(-5,5))
45
Saving graphs
  • Different on different systems!
  • All systems can use the original tricky way with
    device() see introduction PDFs if you have an
    interest. We wont use this.

46
OSX
  • Click on the graph, and just copy it. Will become
    a pdf or a bitmap when pasting.
  • On my mac you have top open a new file in Preview
    and paste there first. On others you can paste it
    into other programs directly
  • Or, just have the graphics window active and do
    file-gtsave

47
Windows
  • Right-click on graph, copy as metafile or bitmap,
    paste

48
Linux
  • Use awkward device() system, like pdf() or
  • Use GIMP and just make a screen capture from
    there by acquire

49
Statistics part 1
  • Describing the data descriptive statistics
  • What is the center point of the data?
  • What is the spread of the data?
  • How does the distribution look?

50
Summary statistics
  • hist() ( Histogram) is a graphical way of
    summarizing distributions it creates a number
    of bins and calculates how many of the data
    points that fall into each bin.
  • We can also summarize by the center points in
    the data
  • mean()

51
  • median()
  • Sort all the data, pick the number in the center.
    If the number of data points is even, take the
    mean of the two center points
  • Challenge
  • We make another vector
  • dat2lt-rnorm(10)
  • And add a few extra points to it
  • dat2lt-c(dat2, 10, 10.5, 30 )
  • Test mean() and median() on dat2. Are they the
    same? Can you explain the differences by plotting
    a histogram? What is the advantage/disadvantage
    of each measure?

52
gt dat2lt-rnorm(10) gt dat2lt-c(dat2, 10, 10.5, 30
) gt median(dat2) 1 0.11125 gt mean(dat2) 1
3.636139 gt hist(dat2)
Means are sensitive to outliers! Very common
situation in genomics
53
Percentiles
  • An extension of the median concept
  • Best explained by example
  • the 20th percentile is the value (or score) below
    which 20 percent of the observations may be
    found.
  • The median is the same as the 50th percentile
  • The first quartile is the 25th percentile, the
    third is the 75th
  • Try summary(dat) and summary(dat2)

54
summary(dat) Min. 1st Qu. Median Mean
3rd Qu. Max. -2.20600 -0.60480 -0.06930
-0.01333 0.68030 3.49600 gt summary(dat2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000
The command ecdf() (empirical cumulative
distribution) calculates all percentiles in
your data and also understands plot() Try plot
(ecdf(dat2))
55
What fraction of the data that has been covered
at point X
Sorted values of dat2
56
Boxplots
  • An easier representation of ECDFs. Is based on
    both making boxes that tell us about both center
    point and spread of the data
  • First, calculate the first quartile, the median
    and the third quartile
  • Calculate the inter-quartile range (IQR) 3rd
    quartile -1st quartile
  • These will be used to draw a box
  • gtboxplot(dat)
  • gt rug(dat, side2)

57
Interquartile range (IQR)
3rd quartile
Median
1st quartile
58
continued
  • Sounds more complicated than it is
  • Any data observation which lies more than
    1.5IQR lower than the first quartile is
    considered an outlier.
  • Indicate where the smallest value that is not an
    outlier is by a vertical tic mark or "whisker",
    and connect the whisker to the box via a
    horizontal line.
  • Do the same for higher values

59
The data point that is highest but within 1.5 IQR
from the center
More extreme data - outliers
3rd quartile
Median
1st quartile
60
(No Transcript)
61
Variance, standard deviation and data spread
  • What is the difference between these
    distributions?

62
  • Same mean and median, but different spread over
    the x axis
  • This can be measured by the variance of the data
  • It is basically the difference between each point
    and the mean, squared

63
  • Standard deviation is simply variance squared.
  • This gives nice statistical features that you
    will get to know if you study statistics ?

64
Why is variance and standard deviation important?
  • Variance tells you something about the quality of
    measurements
  • The higher the variance, the harder it is to with
    certainty say that two measurements are different
  • (whiteboard demo)

65
  • What are the R functions for variance and
    standard deviation?
  • If we make some random data
  • smallsetlt-rnorm(100)
  • largesetlt-rnorm(10000)
  • What is the variance and standard deviation for
    these?
  • Is the standard deviation really the square root
    of the variance (what is the function for square
    root?)

66
  • gtsmallsetlt-rnorm(100)
  • gtlargesetlt-rnorm(10000)
  • gtvar(smallset)
  • 1 0.9504884
  • gtvar(largeset)
  • 1 0.98416
  • gtsd(largeset)
  • 1 0.9920484
  • gtsd(smallset)
  • 1 0.97493
  • gtsqrt(var(smallset))
  • 1 0.97493

Why do we get about the same variance?
67
(No Transcript)
68
Dealing with real data
  • involves interacting with messy data from other
    people
  • Almost all interesting data is real
  • We need methods to get the data into R to start
    with.

69
Reading data from files
  • Typically, you have data in a .txt or .xls file -
    how do you get it into R for analysis?
  • Four steps
  • 1 Format it to tab-separated text file
  • 2 Tell R what directory to work in
  • 3 Read in the file
  • 4 Attach data to memory (optional)

70
What R likes
  • A data frame(table) with columns. The first
    row can be column names, like this

Lecturer height age Poker_games_won
Albin 181 34 14
Jens 189 31 10
Sanne 170 NA 21
Name row, header
First data row
String data
Numerical data
Code for missing data Do not ask ladies for age
71
Reading data(1)
  • General rules
  • Use tabs as separators
  • Each row has to have the same number of columns
  • Missing data is NA, not empty
  • As .txt The ideal format!
  • As .xls save as tabbed text

72
(No Transcript)
73
Get some data to play withUpdate old file is
wrong! Please download the data again
  • See course web page!
  • http//people.binf.ku.dk/albin/teaching/htbinf/R/
  • Download the zipped file save this at a good
    place that you know where it is
  • I will refer to this place on your hard drive as
    the data directory

74
Reading data(2)
  • Tell R where to work.
  • In R, you always work in a given directory,
    called working directory.
  • You can set this graphically in Windows or MacOS
    (next pages)
  • Or use setwd(some directory)
  • In Linux, it pays to start R in the same
    directory
  • It pays to have one directory per larger
    analysis, and call the data files informative
    things (not Data.txt)

75
In OSX
Change working directory
Current working directory
76
In Windows
77
Reading in data (3)
  • my_datalt- read.table(filename, headerT)

Only if you have column names I always have
that
Data frame object holding all the data
The actual command
78
Lets try
  • In the data directory, there is a file
    flowers.txt. Have a look at it (it is a normal
    text file can be opened in almost anything).
    Now
  • my_datalt-read.table(flowers.txt, headerT)
  • names(my_data)
  • What happens what does names do?

79
  • gt my_datalt-read.table("flowers.txt", hT)
  • note to self show name completion feature
  • gt names(my_data)
  • 1 "Sepal.Length" "Sepal.Width" "Petal.Length"
    "Petal.Width" "Species"

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa Etc
80
Accessing the columns
  • There are three ways of getting the data in a
    data frame
  • Get the column by saying
  • dataframe_namecolumn_name
  • Get the columns as separate vectors, named as
    their header - the attach command
  • More complex queries, using index operations
    (more in a few slides)

81
Columns by names
  • The attach(some_data_frame) function will put all
    the columns of some_data_frame as vectors in
    working memory
  • So, in our case
  • attach(my_data)
  • Will give the 5 vectors
  • "Sepal.Length" "Sepal.Width" "Petal.Length"
    "Petal.Width" and "Species"
  • (which we then can analyze)

82
Dangers of the attach() function
  • Attach will make a vector out each column, named
    after the column name
  • What happens if we already have a vector with
    that name?
  • What does ls() and detach() do? Why would we
    want to do this?

83
Challenge
  • Using the flowers file,
  • Plot a histogram of the length of sepals
    (Sepal.Length)
  • Plot the correlation between sepal length and
    sepal width (an x-y plot)
  • What is the mean of Sepal.Length and Sepal.Width?
  • Why do you get strange results on Sepal.Length?
  • Can you solve the problem by using the
    appropriate help file?

84
  • gt hist(Sepal.Length)
  • gt plot(Sepal.Length, Sepal.Width)

85
gt mean(Sepal.Width) 1 3.057333 gt
mean(Sepal.Length) 1 NA gt
strange!
If we look at the file Sepal.Length Sepal.Width P
etal.Length Petal.Width Species 5.1 3.5 1.4 0.2 se
tosa 4.9 3.0 1.4 0.2 setosa NA 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa 5.0 3.6 1.4 0.2 setosa 5.4
3.9 1.7 0.4 setosa 4.6 3.4 1.4 0.3 setosa
86
Missing values
  • In real data, we often have missing values. In R,
    we label these NA
  • Any examples of missing data?
  • Some functions can by default deal with these -
    for instance hist()
  • Some functions will need to be told what to do
    with them, like mean()

87
gt mean(Sepal.Length) 1 NA na.rm is the option
for mean to remove all NAs first gt
mean(Sepal.Length, na.rmT) 1 5.851007 With
new files, you should check whether there are any
NA values first.
88
What if we wanted just the 5 first rows? Or just
all values from the species versicolor?
  • Sepal.Length Sepal.Width Petal.Length Petal.Width
    Species
  • 5.1 3.5 1.4 0.2 setosa
  • 4.9 3.0 1.4 0.2 setosa
  • 4.7 3.2 1.3 0.2 setosa
  • 4.6 3.1 1.5 0.2 setosa
  • 5.0 3.6 1.4 0.2 setosa
  • 6.2 2.9 4.3 1.3 versicolor
  • 5.1 2.5 3.0 1.1 versicolor
  • 5.7 2.8 4.1 1.3 versicolor
  • 6.3 3.3 6.0 2.5 virginica
  • 5.8 2.7 5.1 1.9 virginica
  • 7.1 3.0 5.9 2.1 virginica
  • 6.3 2.9 5.6 1.8 virginica

89
The index concept
  • Recap
  • In a vector, we can select individual values by
    saying
  • Sepal.Widthsome position
  • gt Sepal.Width5
  • 1 3.6
  • Sepal.Widthsome_positions
  • gt Sepal.Width210
  • 1 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
  • Sepal.Widthc(2,4,6,8)
  • 1 3.0 3.1 3.9 3.4
  • The first position will always have the number 1,
    the last will have the length of the vector.
  • Challenge what is the mean of the first 100
    values of Sepal.Width?

90
Albins code
  • gt alt-Sepal.Width1100
  • gt mean(a)
  • 1 3.099
  • or, in just one line
  • gt mean(Sepal.Width1100)
  • 1 3.099

91
Selecting parts of vectors depending on values
  • Often, we want parts of vector due to the values
    of the vector -for instance all values lt10, or
    all strings equaling versicolor.
  • This is made by a slightly odd syntax(it will
    make sense later on when we have many vectors)
  • To get all values over 10 in Sepal.Length
  • Sepal.Length Sepal.Length gt10
  • So, we are using a criterion instead of the
    index

92
Try out
  • Make a boxplot of all values in Sepal.Length that
    are higher than 5
  • How many vales are there that satisfy this
    criterion?

93
Albins code
  • alt-Sepal.LengthSepal.Length gt5
  • boxplot (a)
  • length(a)
  • 1 119

94
criteria in R
  • lt Smaller than
  • gt Bigger than
  • gt Bigger or equal to
  • lt smaller than or equal to
  • equal to (not )
  • ! not equal to
  • Challenge
  • how many rows for the virginica species?

95
Albins try
  • gt length(SpeciesSpeciesvirginica )
  • Error in NextMethod("") object "virginica" not
    found
  • what happens here is that R thinks virginica is
    an object - a vector or something else
  • gt length(SpeciesSpecies"virginica" )
  • 1 50

For string comparisons, we need to enclose the
string with or
96
Selection using two vectors
  • Often we have two vectors from a file, and we
    want to use the value of one vector to select
    values from the other.
  • For instance, sepal length broken up by species

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor 5.7 2.8 4.1 1.3 versico
lor 6.3 3.3 6.0 2.5 virginica 5.8 2.7 5.1 1.9 virg
inica 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 vi
rginica
97
  • Use the same syntax as before - but change the
    vetor names!
  • some_vector other_vector something
  • Challenge
  • select the Sepal.Width values that correspond to
    the setosa species
  • What is the mean of these values?
  • What is the distribution of these values?

98
Albins code
gt alt-Sepal.WidthSpecies"setosa" gt mean(a) 1
3.428 gt hist(a)
99
Fancy way of breaking up data (only works with
some functions)
  • boxplot (Sepal.WidthSpecies)
  • Try it out
  • This is a formula construction - break up width
    in the levels of species

100
Even more complex multiple criteria
  • and combine two criteria
  • or either this or that has to be true
  • So,
  • Species Sepal.Width gt4 Sepal.Length lt5
  • will give the species rows where width gt4 AND
    Length is lt5.
  • and, Species Sepal.Width gt4 Sepal.Length lt5
    will give?
  • Challenge Make a boxplot of Sepal.Width where
    Sepal.Length gt3 AND the species is virginica

101
  • gtboxplot (Sepal.WidthSepal.Length gt3
    Species"virginica" )

102
More complex data selections
  • The dataframe is really a two-dimensional vector
    - a vector that contains vectors

103
  • We can use almost the same commands to select
    sub-sets of the data frame, without using the
    attach() function
  • This is very handy at times

104
  • The data frame is really a big table - a 2D
    vector.
  • Just as some_vector5 gives use the fifth value
    in a vector,
  • some_dataframe10, 2 gives us the cell in the
    10th row and the 2nd column
  • gtmy_data10, 2
  • 1 3.1

Sepal.Length Sepal.Width Petal.Length Petal.Width
Species 5.1 3.5 1.4 0.2 setosa 4.9 3.0 1.4 0.2 set
osa 4.7 3.2 1.3 0.2 setosa 4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa 5.4 3.9 1.7 0.4 setosa 4.6
3.4 1.4 0.3 setosa 5.0 3.4 1.5 0.2 setosa 4.4 2.9
1.4 0.2 setosa 4.9 3.1 1.5 0.1 setosa 5.4 3.7 1.5
0.2 setosa 4.8 3.4 1.6 0.2 setosa 4.8 3.0 1.4 0.1
setosa
105
Selecting rows
  • If we instead say
  • my_data10,
  • We will get the WHOLE 10th row back
  • One can think of it as saying
  • my_data10, give me everything
  • So, what will happen if we say
  • my_data15,
  • Or
  • my_data51,
  • Try it out

106
Indexing exercise
  • In the data directory, there is a file called
    Rindexing.pdf it is a set of small exercises
    designed to help you understand indexing
  • It works with a very small data file, just so
    that you can check your results by eye
  • Indexing is very important for the rest of the
    course

107
Nota Bene 1)Students on the reserve list
  • Please talk to me to get signed on to the course
    (I need your student number or similar)
  • Otherwise you cannot go to the exam gt no credit
  • 2) A watch was found in the last lecture will
    be given back if you can describe it well

108
Sorting one vector by another
  • Very common situation with experimental data
    what are the genes with the highest expression in
    my data set?
  • Recap what happened if you said
  • my_data15,
  • Or
  • my_data51,
  • ?

109
  • get some data to play with
  • dlt-read.table("mtcar.txt", hT)
  • attach(d)

Just by looking at the data, d what car has
the highest and lowest horsepower (hp)?
110
  • car_name mpg cyldisp hp drat wt qsecvs am
    gear carb
  • 1 Mazda RX4 21.0 6 160.0 110 3.90
    2.620 16.46 0 1 4 4
  • 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
    2.875 17.02 0 1 4 4
  • 3 Datsun 710 22.8 4 108.0 93 3.85
    2.320 18.61 1 1 4 1
  • 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
    3.215 19.44 1 0 3 1
  • 5 Hornet Sportabout 18.7 8 360.0 175 3.15
    3.440 17.02 0 0 3 2
  • 6 Valiant 18.1 6 225.0 105 2.76
    3.460 20.22 1 0 3 1
  • 7 Duster 360 14.3 8 360.0 245 3.21
    3.570 15.84 0 0 3 4
  • 8 Merc 240D 24.4 4 146.7 62 3.69
    3.190 20.00 1 0 4 2
  • 9 Merc 230 22.8 4 140.8 95 3.92
    3.150 22.90 1 0 4 2
  • 10 Merc 280 19.2 6 167.6 123 3.92
    3.440 18.30 1 0 4 4
  • 11 Merc 280C 17.8 6 167.6 123 3.92
    3.440 18.90 1 0 4 4
  • 12 Merc 450SE 16.4 8 275.8 180 3.07
    4.070 17.40 0 0 3 3
  • 13 Merc 450SL 17.3 8 275.8 180 3.07
    3.730 17.60 0 0 3 3
  • 14 Merc 450SLC 15.2 8 275.8 180 3.07
    3.780 18.00 0 0 3 3
  • 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
    5.250 17.98 0 0 3 4
  • 16 Lincoln Continental 10.4 8 460.0 215 3.00
    5.424 17.82 0 0 3 4
  • 17 Chrysler Imperial 14.7 8 440.0 230 3.23
    5.345 17.42 0 0 3 4
  • 18 Fiat 128 32.4 4 78.7 66 4.08
    2.200 19.47 1 1 4 1

111
Permutations
  • sort.list(x) will give the permutation of x that
    sorts x
  • Is much easier to explain with a small example
  • gt alt-c(3,1,10)
  • gt sort.list(a)
  • 1 2 1 3

112
  • gt alt-c(3,1,10)
  • gt sort.list(a)
  • 1 2 1 3
  • gt blt-sort.list(a)
  • gt ab
  • 1 1 3 10
  • sort.list(hp)
  • 1 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28
    10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29
    31

113
  • car_name mpg cyl disp hp drat wt qsec vs
    am gear carb
  • 1 Mazda RX4 21.0 6 160.0 110 3.90
    2.620 16.46 0 1 4 4
  • 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
    2.875 17.02 0 1 4 4
  • 3 Datsun 710 22.8 4 108.0 93 3.85
    2.320 18.61 1 1 4 1
  • 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
    3.215 19.44 1 0 3 1
  • 5 Hornet Sportabout 18.7 8 360.0 175 3.15
    3.440 17.02 0 0 3 2
  • 6 Valiant 18.1 6 225.0 105 2.76
    3.460 20.22 1 0 3 1
  • 7 Duster 360 14.3 8 360.0 245 3.21
    3.570 15.84 0 0 3 4
  • 8 Merc 240D 24.4 4 146.7 62 3.69
    3.190 20.00 1 0 4 2
  • 9 Merc 230 22.8 4 140.8 95 3.92
    3.150 22.90 1 0 4 2
  • 10 Merc 280 19.2 6 167.6 123 3.92
    3.440 18.30 1 0 4 4
  • 11 Merc 280C 17.8 6 167.6 123 3.92
    3.440 18.90 1 0 4 4
  • 12 Merc 450SE 16.4 8 275.8 180 3.07
    4.070 17.40 0 0 3 3
  • 13 Merc 450SL 17.3 8 275.8 180 3.07
    3.730 17.60 0 0 3 3
  • 14 Merc 450SLC 15.2 8 275.8 180 3.07
    3.780 18.00 0 0 3 3
  • 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93
    5.250 17.98 0 0 3 4
  • 16 Lincoln Continental 10.4 8 460.0 215 3.00
    5.424 17.82 0 0 3 4
  • 17 Chrysler Imperial 14.7 8 440.0 230 3.23
    5.345 17.42 0 0 3 4
  • 18 Fiat 128 32.4 4 78.7 66 4.08
    2.200 19.47 1 1 4 1

19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10
11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31
114
Thought exercise
  • What happens if we now say
  • hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
    1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
    13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)
  • ?

115
We get hp sorted
  • hpc(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32,
    1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12,
    13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)
  • 1 52 62 65 66 66 91 93 95 97 105 109
    110 110 110 113 123 123 150 150 175 175 175 180
    180 180 205 215 230 245 245 264 335

116
So, sort.list really gives an instruction how to
sort
  • So, we can also say
  • sort_orderlt-sort.list(hp)
  • hpsort_order
  • What will happen if we say
  • car_namesort_order
  • ?

117
  • car_namesort_order
  • 1 Honda Civic Merc 240D
    Toyota Corolla Fiat 128 Fiat X1-9
    Porsche 914-2 Datsun 710
  • 8 Merc 230 Toyota Corona
    Valiant Volvo 142E Mazda RX4
    Mazda RX4 Wag Hornet 4 Drive
  • 15 Lotus EuropaMerc 280 Merc 280C
    Dodge Challenger AMC Javelin
    Hornet Sportabout Pontiac Firebird
  • 22 Ferrari Dino Merc 450SE Merc
    450SL Merc 450SLC Cadillac
    Fleetwood Lincoln Continental Chrysler Imperial
  • 29 Duster 360 Camaro Z28 Ford
    Pantera L Maserati Bora
  • 32 Levels AMC Javelin Cadillac Fleetwood Camaro
    Z28 Chrysler Imperial Datsun 710 Dodge Challenger
    Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E
  • So, we use the hp permutation to reorder the
    cars!
  • We can also do this with the whole dataframe, in
    the same way
  • gtdsort_order,

118
gt dsort_order, car_name mpg cyl
disp hp drat wt qsec vs am gear carb 19
Honda Civic 30.4 4 75.7 52 4.93 1.615
18.52 1 1 4 2 8 Merc 240D 24.4
4 146.7 62 3.69 3.190 20.00 1 0 4 2 20
Toyota Corolla 33.9 4 71.1 65 4.22 1.835
19.90 1 1 4 1 18 Fiat 128 32.4
4 78.7 66 4.08 2.200 19.47 1 1 4 1 26
Fiat X1-9 27.3 4 79.0 66 4.08 1.935
18.90 1 1 4 1 27 Porsche 914-2 26.0
4 120.3 91 4.43 2.140 16.70 0 1 5 2 3
Datsun 710 22.8 4 108.0 93 3.85 2.320
18.61 1 1 4 1 9 Merc 230 22.8
4 140.8 95 3.92 3.150 22.90 1 0 4 2 21
Toyota Corona 21.5 4 120.1 97 3.70 2.465
20.01 1 0 3 1 6 Valiant 18.1
6 225.0 105 2.76 3.460 20.22 1 0 3 1 32
Volvo 142E 21.4 4 121.0 109 4.11 2.780
18.60 1 1 4 2 1 Mazda RX4 21.0
6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875
17.02 0 1 4 4 4 Hornet 4 Drive 21.4
6 258.0 110 3.08 3.215 19.44 1 0 3 1 28
Lotus Europa 30.4 4 95.1 113 3.77 1.513
16.90 1 1 5 2 10 Merc 280 19.2
6 167.6 123 3.92 3.440 18.30 1 0 4 4 11
Merc 280C 17.8 6 167.6 123 3.92 3.440
18.90 1 0 4 4 22 Dodge Challenger 15.5
8 318.0 150 2.76 3.520 16.87 0 0 3 2 23
AMC Javelin 15.2 8 304.0 150 3.15 3.435
17.30 0 0 3 2 5 Hornet Sportabout 18.7
8 360.0 175 3.15 3.440 17.02 0 0 3 2 25
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845
17.05 0 0 3 2 30 Ferrari Dino 19.7
6 145.0 175 3.62 2.770 15.50 0 1 5 6 12
Merc 450SE 16.4 8 275.8 180 3.07 4.070
17.40 0 0 3 3 13 Merc 450SL 17.3
8 275.8 180 3.07 3.730 17.60 0 0 3 3 14
Merc 450SLC 15.2 8 275.8 180 3.07 3.780
18.00 0 0 3 3 15 Cadillac Fleetwood 10.4
8 472.0 205 2.93 5.250 17.98 0 0 3 4 16
Lincoln Continental 10.4 8 460.0 215 3.00 5.424
17.82 0 0 3 4 17 Chrysler Imperial 14.7
8 440.0 230 3.23 5.345 17.42 0 0 3 4 7
Duster 360 14.3 8 360.0 245 3.21 3.570
15.84 0 0 3 4 24 Camaro Z28 13.3
8 350.0 245 3.73 3.840 15.41 0 0 3 4 29
Ford Pantera L 15.8 8 351.0 264 4.22 3.170
14.50 0 1 5 4 31 Maserati Bora 15.0
8 301.0 335 3.54 3.570 14.60 0 1 5 8
  • gt dsort_order,
  • car_name mpg cyl disp hp drat
    wt qsec vs am gear carb
  • 19 Honda Civic 30.4 4 75.7 52 4.93
    1.615 18.52 1 1 4 2
  • 8 Merc 240D 24.4 4 146.7 62 3.69
    3.190 20.00 1 0 4 2
  • 20 Toyota Corolla 33.9 4 71.1 65 4.22
    1.835 19.90 1 1 4 1
  • 18 Fiat 128 32.4 4 78.7 66 4.08
    2.200 19.47 1 1 4 1
  • 26 Fiat X1-9 27.3 4 79.0 66 4.08
    1.935 18.90 1 1 4 1
  • 27 Porsche 914-2 26.0 4 120.3 91 4.43
    2.140 16.70 0 1 5 2
  • 3 Datsun 710 22.8 4 108.0 93 3.85
    2.320 18.61 1 1 4 1
  • 9 Merc 230 22.8 4 140.8 95 3.92
    3.150 22.90 1 0 4 2
  • 21 Toyota Corona 21.5 4 120.1 97 3.70
    2.465 20.01 1 0 3 1
  • 6 Valiant 18.1 6 225.0 105 2.76
    3.460 20.22 1 0 3 1
  • 32 Volvo 142E 21.4 4 121.0 109 4.11
    2.780 18.60 1 1 4 2
  • 1 Mazda RX4 21.0 6 160.0 110 3.90
    2.620 16.46 0 1 4 4
  • 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90
    2.875 17.02 0 1 4 4
  • 4 Hornet 4 Drive 21.4 6 258.0 110 3.08
    3.215 19.44 1 0 3 1
  • 28 Lotus Europa 30.4 4 95.1 113 3.77
    1.513 16.90 1 1 5 2
  • 10 Merc 280 19.2 6 167.6 123 3.92
    3.440 18.30 1 0 4 4
  • 11 Merc 280C 17.8 6 167.6 123 3.92
    3.440 18.90 1 0 4 4

119
What type of data are we looking at?
  • Type here refers to what kind of data
    representation - for example strings or
    numbers
  • R usually does a good job at interpreting this
    for us when we read data
  • This becomes important later on, as some
    functions only take vectors of a certain data
    type - for instance, you cannot take the mean()
    of a vector with words!

120
Simple types
  • Numbers
  • num_vectorlt- c(1,5,2,10)
  • Strings(or characters)
  • String_vectorlt-c( a, trial1)
  • How can we know?
  • class(num_vector)
  • 1 "numeric"

121
Factors/groups
  • Factors slightly more tricky. A factor is often
    a grouping variable - such as the gender or
    experiment group, like
  • genderlt-c(f, f, m, m, f)
  • experiment_numberlt-c(1,2,4,5,5,6)
  • Problem here is that R will think that the first
    vector is a string vector, and the last a
    numerical vector.
  • gt class(gender)
  • 1 "character"
  • gt class(experiment_number)
  • 1 "numeric"

122
  • This is resolved by
  • gt experiment_numberlt-as.factor( c(1,2,4,5,5,6))
  • gt class(experiment_number)
  • 1 "factor"

123
Over to analysis of data!
  • boxplot (Sepal.WidthSpecies)

Is there a real difference?
124
We now move over from pure R to using R to
make statistical tests
  • Focus is what typical tests are there and how to
    use them
  • We skip almost all statistical theory(!) -
    covered in the course in fall

125
Population and sample
  • Population refers to a set of potential
    measurements or values, including not only cases
    actually observed but those that are potentially
    observable. Really depends on the question at
    hand!
  • Sample is the data we actually have - generally
    this is very small compared to the whole
    population, but we use it to draw inferences
    about the population
  • So, the sample is a small part of a much larger
    population, in most cases
  • Can samplepopulation? Examples?

126
Significance tests
  • In real life, we always get some difference
    between datasets, because we sample limited data,
    and there is always noise
  • So, is the difference in sepal width we see
    between species something that is just due to
    luck - or is it real?
  • Challenge What three features in the dataset
    will be important for finding this out
    (significant difference or just difference due to
    random noise) ? Discuss with your sideman for a
    minute

127
Albins take
  • The observed difference
  • Obviously, the bigger observed difference, the
    less random it is
  • The data depth
  • The more data you have, the less likely it is
    that strange things happens due to random effects
  • Data spread (variance)
  • Slightly less intuitive. The less spread, the
    less it can be explained by random events.

128
The concept of null hypothesis and P-values
  • Always in statistics, we set a null-hypothesis.
    This is basically what we want to disprove!
  • Usually, this is something like there is no
    difference in mean in these two samples. The
    alternative hypothesis is then, by default,
    There is a difference
  • Then, we test how likely this is, with certain
    assumptions (depending on what test)

129
  • The P-value of a test can be interpreted as the
    chance that the observed difference occurs if the
    null hypothesis is true
  • This only makes sense if your test is relevant!
  • So, a small P-value means that the null
    hypothesis is not a good model - which means
    that there is a difference
  • We often want P-values to be as small as possible
    - the standard cutoff for significance is 5,
    or 0.05
  • However, this is just tradition it is
    completely arbitrary

130
Two-sample t-test
  • deliberate skip of math details Jens will
    cover some later on
  • The t-test tests if two distributions come from
    distributions with the same mean.
  • The assumption is that both distributions are
    normal-shaped - we should always check this
    first

131
Mean
Standard deviation a measure of spread
132
Normality - brief check
  • There are many tests for normality. Now we will
    do the simplest by eye. Make a histogram and see
    if the data looks reasonably bell-shaped
  • Challenge Test to make histograms for the sepal
    length of each species

133
Albins take
  • warning wrong approach
  • hist(Sepal.Length)
  • not very normal-shaped!
  • why is this the wrong approach?

134
Albins take, again
  • now for each species
  • par (mfrowc(1,3))
  • hist(Sepal.LengthSpecies"setosa")
  • hist(Sepal.LengthSpecies"versicolor")
  • hist(Sepal.LengthSpecies"virginica")
  • IMHO, these are not very far away from
    normal-shaped given the amount of data

135
Slightly more advanced way the
quartile-quartile plot
  • If two distributions have the same shape, their
    cumulative distributions should match well

136
qqnorm(Sepal.Length)
137
qqnorm(Sepal.LengthSpecies"setosa")
138
So, are they different?
gt setosalt-Sepal.LengthSpecies"setosa" gt
versicolorlt-Sepal.LengthSpecies"versicolor" gt
t.test(setosa, versicolor) Welch Two Sample
t-test data setosa and versicolor t
-10.521, df 86.538, p-value lt 2.2e-16
Very! alternative hypothesis true difference in
means is not equal to 0 95 percent confidence
interval -1.1057074 -0.7542926 sample
estimates mean of x mean of y 5.006
5.936
139
Lets test it with some real data
  • In your data directory, there are two files with
    binding sites lengths, from estrogen receptor
    alfa and beta ERA_widths.txt, ERB_widths.txt
  • 1) Read these into two different vectors, called
    era, erb
  • 2) Have they significantly different means?

140
Albins code
  • gt dlt-read.table("ERA_widths.txt", hT)
  • gt attach(d)
  • gt dlt-read.table("ERB_widths.txt", hT)
  • gt attach(d)
  • gt

gt t.test(ERA_width, ERB_width) Welch Two Sample
t-test data ERA_width and ERB_width t
2.7288, df 1066.975, p-value
0.00646 alternative hypothesis true difference
in means is not equal to 0 95 percent confidence
interval 11.18740 68.45565 sample
estimates mean of x mean of y 514.1370
474.3155
141
Hey, we forgot to check for normality!
gt par(mfrowc(1,2)) gt hist (ERA_width) gt hist
(ERB_width) NOT really normal shaped! t.test is
not valid!
142
The two-sample Wilcoxon test
  • The t-test is assuming that samples are
    normal-distributed, which is often not true with
    biological data
  • The wilcoxon test wilcox.test() is a less
    powerful test which does not make such
    assumptions.
  • Tests of this kind are called non-parametric
  • Use when the normality assumption is doubtful

143
Challenge
  • Repeat the analysis of the ERA/beta sites using
    wilcox.test. Is the difference significant?

144
gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0
The AHH, so close to significance situation
145
Different types of alternative hypothesis
  • Null hypothesisthere is no difference
  • Default alternative hypothesis there is
    difference. This is called two-sided, as the
    differences can be in either direction
  • Optional We can make the alternative hypothesis
    one-sided - like There is a difference -
    distribution A has a greater mean than
    distribution B. This increases the statistical
    power 2

146
In R
gt wilcox.test(ERA_width, ERB_width) Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.05551 alternative hypothesis true location
shift is not equal to 0 gt wilcox.test(ERA_width,
ERB_width, alternative"greater") Wilcoxon
rank sum test with continuity correction data
ERA_width and ERB_width W 151243.5, p-value
0.02776 alternative hypothesis true location
shift is greater than 0 what happens if
alternativeless?
147
gt wilcox.test(ERA_width, ERB_width,
alternative"less") Wilcoxon rank sum test with
continuity correction data ERA_width and
ERB_width W 151243.5, p-value
0.9723 alternative hypothesis true location
shift is less than 0
148
Dangers with hypothesis fiddling
  • This should never be used to get twice as good
    P-value without some justification
  • Homework for Tuesday read BMJ 1994309248
    (linked at the web page) for discussion on when
    this is appropriate

149
Testing many samples at once
  • Sometimes, we have a table of groups, and we want
    to see if any of the groups are different in
    means
  • treatment1 treatment2 treatment3
  • plant1
  • plant2
  • plant3
  • etc
  • This is the typical one-way Anova test situation

150
boxplot (Sepal.WidthSpecies)
Is there a real difference?
Last time we checked this, we just tested setosa
vs versicolor
151
  • In R, oneway anova tests can be made easy by the
    oneway.test() command
  • The oneway.test() works by giving it a formula,
    in our case
  • Sepal.WidthSpecies.
  • This will basically give the function a table
    with three columns of values, one for each
    species.
  • It will then test if the means of any of these
    columns are different from the others

152
Test flowers set
gt dlt-read.table("flowers.txt", hT) gt
attach(d) oneway.test(Sepal.WidthSpecies) One-w
ay analysis of means (not assuming equal
variances) data Sepal.Width and Species F
45.012, num df 2.000, denom df 97.402,
p-value 1.433e-14
As expected, very, very significant. Is the
difference in length of sepals (not widths) even
more significant? Test it!
153
Non-parametric alternative to Anova
  • As with t-tests, Anova tests assume that each of
    the columns has a normal-like distribution. This
    is often not the case
  • The Kruskal-Wallis test is analogous to the
    wilcoxon test - a non-paramatric variant.
    Practically, it works exactly in the same way as
    the oneway.test().

154
2 data sets gt2 data sets
Normal distribution t.test() oneway.test()
Not normal distribution (non-parametric tests) wilcox.test() kruskal.test()
155
Larger exercise
  • In the data folder, there is file
    gene_lengths.txt, showing gene lengths from three
    different chromosomes.
  • Is there a significant difference in the center
    points (means or medians) of gene lengths for the
    different chromosomes? What is the method of
    choice? What two chromosomes have the most/least
    significant difference?

156
gt dlt-read.table("gene_lengths.txt", hT) gt
attach(d) gt names(d) 1 "id" "chr"
"gene_length first - is the gene length
distribution normal-shaped? gt hist(gene_length)
nope! gt nonparametric tests needed
157
many more than two groups, and
non-parametric test needed gtkruskal wallis
test gt kruskal.test(gene_lengthchr) Kruskal-Wal
lis rank sum test data gene_length by chr
Kruskal-Wallis chi-squared 131.7681, df 2,
p-value lt 2.2e-16 very significant difference!
158
are all chromosome pairs significant? test
case use wilcox.test on two chromosomes gt
wilcox.test(gene_lengthchr"chr2" ,
gene_lengthchr"chr3") Wilcoxon rank sum
test with continuity correction data
gene_lengthchr "chr2" and gene_lengthchr
"chr3" W 1068584, p-value
0.1822 alternative hypothesis true location
shift is not equal to 0 so, all pairs are not
significantly different!
159
chr1 chr2 chr3
chr1 1 2.2E-16 2.2E-16
chr2 1 0.1822
chr3 1
160
for us who are lazy - we can do all vs all tests
like this gt pairwise.wilcox.test(gene_length,
gchr ) Pairwise comparisons using Wilcoxon
rank sum test data gene_length and chr
chr1 chr2 chr2 lt2e-16 - chr3 lt2e-16 0.18 P
value adjustment method holm
161
Small homework I Reading about one-and two-sided
tests What is horribly wrong in this (published)
study?
162
Small homework (II)
  • What is the statistical problem with making these
    pairwise tests en masse, for instance making 1000
    t-tests and reporting all cases where Plt0.05?
  • Hint Go back to the slides about what the 5 P
    value cutoff really means

163
Homework and Examination
Big homework for examination goes online this
evening!
164
Examination
  • Three home works during the course
  • These will together give 50 of the final points
  • Main focus here is technical skill (how to do
    analysis x)
  • Theoretically optional. Very strict deadlines.
  • One large home work at the end of the course
  • Will give 50 of final points
  • Overlaps substantially with the three first, but
    is more high-level
  • Main focus here is to merge technical skill with
    biologically motivated analysis - real science
  • Do the maththe three smaller home works pays off
  • Note High level of activity needed from the
    start of the course. Will be demanding! The home
    works are the examination!

165
  • The points with homework
  • Encourage study during the course
  • It is easy to think that you understood
    something, but when meeting the real thing
  • Only realistic way of training
  • Evaluation

166
Rules for homework
  • Working together in 2-3 perons teams is allowed
    and encouraged, BUT
  • Each person hands in a separate report
  • which must be unique. This is usually not a
    problem if you write it yourself
  • If you work with others, write who you work with
    on the report ( I worked with X and Y in this
    homework!)
  • Cases of copying between students WILL be
    reported. The BEST thing that can happen is that
    you will get lower scores

167
(No Transcript)
168
(No Transcript)
169
Using material from others
  • Copying text/images from other sources
    including fellow students is considered
    cheating
  • You can quote sources, given that you cite them,
    but this has a limit

170
Questions?
171
Correlations
  • Sometimes, we do not want to show that two things
    are different, but that they are correlated.
  • For instance, weight and height for individuals
    are clearly different, but also highly correlated
  • This is a positive correlation - but we could
    also have a negative correlation
  • Correlations are dangerous things - it depends on
    how we ask questions .Easily the most misused and
    misunderstood part of statistics
  • Correlation ! Causation

172
Consider Why is it that
  • Truth Compared to the general population, there
    is a strong correlation between being an American
    football player and risk for testicle cancer
  • Interpretation American football leads to
    cancer!

173
Consider Why is it that
  • Truth There is an inverse correlation between
    yearly income and running speed
  • Interpretation carrying money makes you slow!

174
Time to explore
  • Open the height.txt file in R, and look at the
    different data types. Try to plot various columns
    against each other, like
  • plot (mean_height, Year)
  • What columns correlate? Can we break up the data
    even more?
  • Try it out!

175
gtplot (mean_weight, year ) gt plot (mean_height,
year ) gt plot (mean_height, mean_weight )
Hmmmobviously there are TWO datasets here! How
can we break this up?
176
Visualize the groups with color
plot (mean_weight, year ) lines
(mean_weightgender"F", yeargender"F",
col"red" ) lines (mean_weightgender"M",
yeargender"M", col"blue" ) a typical case
of OVER-plotting
177
Assessing correlations
  • First plot a standard x-y plot
  • Second, make a correlation test
  • cor(x, y) gives the Pearson correlation
    coefficient between z an y, which goes from
    -1(perfect negative correlation) to 1(prefect
    correlation). 0no correlation

178
Test it
  • gt cor(mean_weight, year)
  • 1 0.1688499
  • BAD correlation - close to zero
  • What happens if we break it up by gender as in
    the last plot? Test it!

179
  • gt cor (mean_weightgender"M",
    yeargender"M" )
  • 1 0.9527568
  • gt cor (mean_weightgender"F",
    yeargender"F" )
  • 1 0.931249
  • very, very correlated!

180
Important Pearson correlation gives a score,
not a significance value. Obviously, it is easier
to get a good correlation if you only have a few
values! Often, we want to also want to TEST if
the correlation is there. The null hypothesis is
then 0 correlation This is done with
the cor.test() command
181
gt cor.test (mean_weight, year ) Pearson's
product-moment correlation data mean_weight
and year t 0.5934, df 12, p-value
0.5639 alternative hypothesis true correlation
is not equal to 0 95 percent confidence
interval -0.3973253 0.6419208 sample
estimates cor 0.1688499
182
gt cor.test (mean_weightgender"F",
yeargender"F" ) Pearson's product-moment
correlation data mean_weightgender "F"
and yeargender "F" t 5.7147, df 5,
p
Write a Comment
User Comments (0)
About PowerShow.com