Introduction to R - Lecture 5: More loops - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to R - Lecture 5: More loops

Description:

... .04000 29.77200 28.77680 28.20880 29.52240 30.24960 30.90160 poodle 30.03063 29.76306 28.77117 28.20631 29.51892 30.23874 30.89910 husky 30.12301 29 ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 55

Provided by: andrewj

Learn more at: https://www.biostat.jhsph.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to R - Lecture 5: More loops

1
Introduction to R - Lecture 5 More loops

Andrew Jaffe
10/4/2010

2
Overview

Review For Loop
Lists
Aside Patterns
Application

3
Review For Loop

The syntax is for(var in seq) code
The seq determines what values var will take in
the loop
The loop is performed length(seq) times
On the nth iteration of the loop, var takes the
value seqn
var is a completely new variable and not directly
related to anything other variable

4
Review For Loop

Setting up your loop requires determining the
correct seq to loop over usually easy
The real challenge of looping is relating the
values of seq to the dimensions/ indices of your
data

5
Review For Loop

From last lecture were relating seq to the
columns of the data
var is indirectly related to the data, as it
links/relates to the column indices but it has
only has the values 1-12

Index 415 mean_wt lt- rep(0, length(Index)) for(
i in 1length(Index)) ind Indexi column
index mean_wti mean(dog_dat,ind)
6
Overview

Review For Loop
Lists
Aside Patterns
Application

7
Lists

"An R list is an object consisting of an ordered
collection of objects known as its components."
"Components are always numbered and may always be
referred to as such" double brackets can subset
lists

CRAN. Intro to R
8
Lists
gt L list() empty list gt L1 14 gt L2
27 gt L3 c("a","b","c") gt L4
matrix(rnorm(4), nrow 2) gt L 1 1 1 2 3
4 2 1 2 3 4 5 6 7 3 1 "a" "b"
"c" 4 ,1 ,2 1,
-1.43944849 -0.4801696 2, 0.09923108 1.0783053
9
Lists
gt names(L) c("seq1","seq2","letters","mat") gt
L seq1 1 1 2 3 4 seq2 1 2 3 4 5 6
7 letters 1 "a" "b" "c" mat ,1
,2 1, 1.824487 0.3431034 2, -0.533006
0.9406285
10
Lists
gt L1 1 1 2 3 4 gt str(L) List of 4 seq1
int 14 1 2 3 4 seq2 int 16 2 3 4 5
6 7 letters chr 13 "a" "b" "c" mat
num 12, 12 1.824 -0.533 0.343 0.941
11
Lists

Why know lists?
Can store data of different lengths and types
Some functions return lists

12
Lists

Load back in the lecture 4 data
We still have one problem to solve - the averages
of weight, length, and food for each dog type at
each visit

13
Lists

First we can create a list containing each group
we care about

Indexes list() Indexes1 415 weight
Indexes2 1627 length Indexes3
2839 food names(Indexes) c("weight", "length
", "food")
14
Lists
gt Indexes weight 1 4 5 6 7 8 9 10 11 12
13 14 15 length 1 16 17 18 19 20 21 22 23 24
25 26 27 food 1 28 29 30 31 32 33 34 35 36
37 38 39
15
Lists

Next, we can create an output list for our
results, and recreate the unique dog list for our
loop

out lt- list() dogs unique(dog_datdog_type)
16
Lists

We want to loop over the different covariates
(wt, len, food) and within each, the different
dog types
For looping over the groups, either works

gt seq(along Indexes) 1 1 2 3 gt
1length(Indexes) 1 1 2 3
17
Lists
for(i in seq(along Indexes)) 13 take
the i'th index from the list Index
Indexesi for that variable, create a new
matrix tmp matrix(nrow length(dogs), ncol
length(Index)) ...
18
Lists

We can then fill in that temporary matrix with an
inner 'for' loop
Note that this is the exact same loop as last
week (note the j's)

Index from the outer loop
for(j in 1length(dogs)) hold
dog_datdog_datdog_type dogsj,Index tmpj,
colMeans(hold) rownames(tmp)
dogs colnames(tmp) paste("month",112,sep"_")
19
Lists

Lastly, we save that tmp matrix in our output
list

outi tmp
20
for(i in seq(along Indexes)) groups
Index Indexesi tmp matrix(nrow
length(dogs), ncol length(Index)) for(j
in 1length(dogs)) dogs hold
dog_datdog_datdog_type dogsj,Index tmpj,
colMeans(hold) rownames(tmp)
dogs colnames(tmp) paste("month",112,sep"_
") outi tmp names(out) lt-
c("weight","length","food")
21
gt out weight month_1 month_2
month_3 month_4 month_5 month_6 month_7 lab
49.81840 48.69200 49.03360 50.26560 50.17600
49.67280 48.41600 poodle 49.40090 48.27297
48.61892 49.84414 49.76126 49.25856
47.99820 husky 49.26372 48.13097 48.48142
49.70088 49.61858 49.11327 47.86195 retriever
50.19474 49.06466 49.40602 50.62632 50.54361
50.04135 48.79248 month_8 month_9
month_10 month_11 month_12 lab 46.54640
44.68640 45.15040 44.30640 45.88240 poodle
46.12613 44.26577 44.73243 43.89009
45.46306 husky 45.98761 44.12832 44.59469
43.75221 45.31858 retriever 46.91278 45.05263
45.51654 44.68496 46.24586 length
month_1 month_2 month_3 month_4 month_5
month_6 month_7 lab 19.91840 20.16800
20.28720 20.49600 20.57840 20.86400
20.96800 poodle 20.63964 20.88198 21.00090
21.20991 21.29189 21.58108 21.68198 husky
20.29115 20.54159 20.65575 20.86195 20.94867
21.23805 21.34071 retriever 20.47068 20.71955
20.83233 21.04135 21.12556 21.41729 21.51880
month_8 month_9 month_10 month_11
month_12 lab 21.10400 21.20880 21.40720
21.57440 21.87440 poodle 21.82072 21.92432
22.12342 22.29009 22.58919 husky 21.47699
21.58142 21.77876 21.94779 22.24779 retriever
21.64962 21.75414 21.95263 22.12406
22.42406 food month_1 month_2
month_3 month_4 month_5 month_6 month_7 lab
30.04000 29.77200 28.77680 28.20880 29.52240
30.24960 30.90160 poodle 30.03063 29.76306
28.77117 28.20631 29.51892 30.23874
30.89910 husky 30.12301 29.85221 28.85841
28.29646 29.60973 30.33363 30.98584 retriever
29.89248 29.62556 28.63008 28.06617 29.37744
30.10075 30.75564 month_8 month_9
month_10 month_11 month_12 lab 29.20880
30.03200 29.89120 29.54240 30.89520 poodle
29.20631 30.02613 29.88739 29.53243
30.89550 husky 29.29646 30.11770 29.97345
29.62389 30.98053 retriever 29.06617 29.88722
29.74887 29.39248 30.75338
22
Overview

Review For Loop
Lists
Aside Patterns
Application

23
Aside

This step is potentially dangerous
Indexes1 415 weight
Indexes2 1627 length
Indexes3 2839 food
Is there a better way? YES! Each group shares a
common term in the name
wt, len, food

24
Aside

grep(pattern, x) matches "pattern" in vector x

gt grep("wt", names(dog_dat)) 1 4 5 6 7 8
9 10 11 12 13 14 15 gt grep("len",
names(dog_dat)) 1 16 17 18 19 20 21 22 23 24
25 26 27 gt grep("food", names(dog_dat)) 1 28
29 30 31 32 33 34 35 36 37 38 39
25
Aside
gt Indexes list() gt Indexes1 grep("wt",
names(dog_dat)) gt Indexes2 grep("len",
names(dog_dat)) gt Indexes3 grep("food",
names(dog_dat)) gt Indexes 1 1 4 5 6 7
8 9 10 11 12 13 14 15 2 1 16 17 18 19 20
21 22 23 24 25 26 27 3 1 28 29 30 31 32
33 34 35 36 37 38 39
26
Aside

grep can be a lot more powerful when combined
with 'regular expression' but we're not going to
get into that

27
Aside

Opposite of paste strsplit(x, split) splits
term 'x' on 'split' character or pattern
Returns a list

gt x paste("month",112,sep"_") gt
head(strsplit(x,"_"),3) 1 1 "month" "1"
2 1 "month" "2" 3 1 "month"
"3"
28
Aside

If you want one element (in this case, the
number), easiest to just use a 'for' loop
If you split each element separately, the output
list only has 1 element 1
You then need to figure out which slot you want
using the single bracket

29
Aside
x paste("month",112,sep"_") num
rep(0,length(x)) for(i in 1length(x))
numi strsplit(xi,"_")12 gt i 1 gt
strsplit(xi,"_") list 1 1 "month" "1"
gt strsplit(xi,"_")1 vector 1 "month"
"1" gt strsplit(xi,"_")12
element 1 "1"
30
Overview

Review For Loop
Lists
Aside Patterns
Application

31
Applied Example

Load in "lec5_data.rda" from the course website
These are the people from "lec2_data.rda" that
did not have a dog at baseline
Over monthly follow-up, some of these people
borrowed dogs over the past month

32
Applied Example

dog_0 baseline dog ownership all of these
people should have "no"
dog_1 - dog_12 did you borrow a dog over the
past month?

33
Applied Example

Determine person-time at risk for dog borrowing
Create a "survival" dataset from this data with
columns ID, start, end
Note that there is missing data

34
Applied Example

We want to convert each person's wide data into
two numbers start and end
Because of missing data, some people might have
more than 1 row people aren't at risk for dog
borrowing if they did not report (/are missing)

35
Applied Example

Take person 1

gt dat1, id age sex height weight dog_0 dog_1
dog_2 1 1 40 F 63.5 134.5 no no
yes dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9 1
yes no no yes yes no yes
dog_10 dog_11 dog_12 1 ltNAgt no no
36
Applied Example

Person 1 in the new dataset should be
ID start end
1 0 9
1 11 12

37
Applied Example

Basic premise write a for-loop that passes over
each person and determines their non-missing
follow-up time
Caveat how many rows do we make our output
matrix?
Perfect opportunity for using rbind()

38
Applied Example

Create a matrix with 0 rows and 3 columns
Within the body of the loop, using rbind to
append new rows (this is slow though)

gt out matrix(nr 0, nc 3) gt dim(out) 1 0
3 gt p1 c(1,0,9) gt out rbind(out, p1) gt out
,1 ,2 ,3 p1 1 0 9
39
Applied Example
out matrix(nrow 0, ncol 3) cols
grep("dog", names(dat)) for(i in 1nrow(dat))
hold as.numeric(dati,cols) ...
40
Applied Example

Here, the follow-up results are factors, which
have numerical values

gt dati,cols dog_0 dog_1 dog_2 dog_3 dog_4
dog_5 dog_6 1 no no yes yes no no
yes dog_7 dog_8 dog_9 dog_10 dog_11 dog_12 1
yes no yes ltNAgt no no gt
as.numeric(dati,cols) 1 1 1 2 2 1 1 2
2 1 2 NA 1 1
41
Applied Example

Now a cool little trick rle() run length
encoding
Compute the lengths and values of runs of equal
values in a vector
We're going to combine this with is.na()

42
Applied Example

This says that there are 10 FALSE in a row, then
1 TRUE, then 2 FALSE
We need to get this in a better format

gt x rle(is.na(hold)) gt x Run Length Encoding
lengths int 13 10 1 2 values logi 13
FALSE TRUE FALSE
43
Applied Example
gt x data.frame(cbind(xvalues, xlength)) gt
names(x) lt- c("missing", "length") gt x missing
length 1 0 10 2 1 1 3
0 2
44
Applied Example

cumsum() returns the cumulative sum of a vector

gt xend lt- cumsum(xlength) gt xstart lt- xend -
xlength 1 gt gt x missing length end start 1
0 10 10 1 2 1 1 11
11 3 0 2 13 12
45
Applied Example

Note that we actually want all of the values to
be less one, since our time starts at 0

gt xend lt- cumsum(xlength) - 1 gt xstart lt-
xend - xlength 1 gt x missing length end
start 1 0 10 9 0 2 1 1
10 10 3 0 2 12 11
46
Applied Example

Quick rearrangement

gt x lt- x,c(1,2,4,3) gt x missing length start
end 1 0 10 0 9 2 1 1
10 10 3 0 2 11 12
47
Applied Example

We want the last two columns of the non-missing
visits

gt tmp xwhich(xmissing 0),34 gt tmp
start end 1 0 9 3 11 12
48
Applied Example

We then want to add a column of the individual ID
to the front

id dati,1 tmp cbind(rep(id,nrow(tmp)),
tmp) names(tmp)1 "ID" gt tmp ID start end 1
1 0 9 3 1 11 12
49
Applied Example

Lastly, bind the tmp matrix to the growing out
matrix
This finishes off our loop body

out rbind(out,tmp)
50
for(i in 1nrow(dat)) hold
as.numeric(dati,cols) x rle(is.na(hold)) x
data.frame(cbind(xvalues, xlength)) names(x)
lt- c("missing", "length") xend lt-
cumsum(xlength) - 1 xstart lt- xend - xlength
1 x lt- x,c(1,2,4,3) tmp
xwhich(xmissing 0),34 id dati,1 tmp
cbind(rep(id,nrow(tmp)), tmp) names(tmp)1
"ID" out rbind(out,tmp) rownames(out)
1nrow(out) cleaning
51
gt head(out,10) ID start end 1 1 0 9 2
1 11 12 3 2 0 5 4 2 7 12 5
3 0 2 6 3 5 12 7 4 0 3 8 4
6 12 9 5 0 0 10 5 3 8 gt
dim(out) 1 1414 3
52
Applied Example

One last adjustment needed, since we asked about
borrowing a dog in the previous month
The non-0 starts must be less 1 since these are
currently indices of visit, but not time at risk

ID start end 1 1 0 9 2 1 11 12
ID start end 1 1 0 9 2 1 10 12
53
Applied Example