Title: Introduction to R - Lecture 5: More loops
1Introduction to R - Lecture 5 More loops
2Overview
- Review For Loop
- Lists
- Aside Patterns
- Application
3Review For Loop
- The syntax is for(var in seq) code
- The seq determines what values var will take in
the loop - The loop is performed length(seq) times
- On the nth iteration of the loop, var takes the
value seqn - var is a completely new variable and not directly
related to anything other variable
4Review For Loop
- Setting up your loop requires determining the
correct seq to loop over usually easy - The real challenge of looping is relating the
values of seq to the dimensions/ indices of your
data
5Review For Loop
- From last lecture were relating seq to the
columns of the data - var is indirectly related to the data, as it
links/relates to the column indices but it has
only has the values 1-12
Index 415 mean_wt lt- rep(0, length(Index)) for(
i in 1length(Index)) ind Indexi column
index mean_wti mean(dog_dat,ind)
6Overview
- Review For Loop
- Lists
- Aside Patterns
- Application
7Lists
- "An R list is an object consisting of an ordered
collection of objects known as its components." - "Components are always numbered and may always be
referred to as such" double brackets can subset
lists
CRAN. Intro to R
8Lists
gt L list() empty list gt L1 14 gt L2
27 gt L3 c("a","b","c") gt L4
matrix(rnorm(4), nrow 2) gt L 1 1 1 2 3
4 2 1 2 3 4 5 6 7 3 1 "a" "b"
"c" 4 ,1 ,2 1,
-1.43944849 -0.4801696 2, 0.09923108 1.0783053
9Lists
gt names(L) c("seq1","seq2","letters","mat") gt
L seq1 1 1 2 3 4 seq2 1 2 3 4 5 6
7 letters 1 "a" "b" "c" mat ,1
,2 1, 1.824487 0.3431034 2, -0.533006
0.9406285
10Lists
gt L1 1 1 2 3 4 gt str(L) List of 4 seq1
int 14 1 2 3 4 seq2 int 16 2 3 4 5
6 7 letters chr 13 "a" "b" "c" mat
num 12, 12 1.824 -0.533 0.343 0.941
11Lists
- Why know lists?
- Can store data of different lengths and types
- Some functions return lists
12Lists
- Load back in the lecture 4 data
- We still have one problem to solve - the averages
of weight, length, and food for each dog type at
each visit
13Lists
- First we can create a list containing each group
we care about
Indexes list() Indexes1 415 weight
Indexes2 1627 length Indexes3
2839 food names(Indexes) c("weight", "length
", "food")
14Lists
gt Indexes weight 1 4 5 6 7 8 9 10 11 12
13 14 15 length 1 16 17 18 19 20 21 22 23 24
25 26 27 food 1 28 29 30 31 32 33 34 35 36
37 38 39
15Lists
- Next, we can create an output list for our
results, and recreate the unique dog list for our
loop
out lt- list() dogs unique(dog_datdog_type)
16Lists
- We want to loop over the different covariates
(wt, len, food) and within each, the different
dog types - For looping over the groups, either works
gt seq(along Indexes) 1 1 2 3 gt
1length(Indexes) 1 1 2 3
17Lists
for(i in seq(along Indexes)) 13 take
the i'th index from the list Index
Indexesi for that variable, create a new
matrix tmp matrix(nrow length(dogs), ncol
length(Index)) ...
18Lists
- We can then fill in that temporary matrix with an
inner 'for' loop - Note that this is the exact same loop as last
week (note the j's)
Index from the outer loop
for(j in 1length(dogs)) hold
dog_datdog_datdog_type dogsj,Index tmpj,
colMeans(hold) rownames(tmp)
dogs colnames(tmp) paste("month",112,sep"_")
19Lists
- Lastly, we save that tmp matrix in our output
list
outi tmp
20for(i in seq(along Indexes)) groups
Index Indexesi tmp matrix(nrow
length(dogs), ncol length(Index)) for(j
in 1length(dogs)) dogs hold
dog_datdog_datdog_type dogsj,Index tmpj,
colMeans(hold) rownames(tmp)
dogs colnames(tmp) paste("month",112,sep"_
") outi tmp names(out) lt-
c("weight","length","food")
21gt out weight month_1 month_2
month_3 month_4 month_5 month_6 month_7 lab
49.81840 48.69200 49.03360 50.26560 50.17600
49.67280 48.41600 poodle 49.40090 48.27297
48.61892 49.84414 49.76126 49.25856
47.99820 husky 49.26372 48.13097 48.48142
49.70088 49.61858 49.11327 47.86195 retriever
50.19474 49.06466 49.40602 50.62632 50.54361
50.04135 48.79248 month_8 month_9
month_10 month_11 month_12 lab 46.54640
44.68640 45.15040 44.30640 45.88240 poodle
46.12613 44.26577 44.73243 43.89009
45.46306 husky 45.98761 44.12832 44.59469
43.75221 45.31858 retriever 46.91278 45.05263
45.51654 44.68496 46.24586 length
month_1 month_2 month_3 month_4 month_5
month_6 month_7 lab 19.91840 20.16800
20.28720 20.49600 20.57840 20.86400
20.96800 poodle 20.63964 20.88198 21.00090
21.20991 21.29189 21.58108 21.68198 husky
20.29115 20.54159 20.65575 20.86195 20.94867
21.23805 21.34071 retriever 20.47068 20.71955
20.83233 21.04135 21.12556 21.41729 21.51880
month_8 month_9 month_10 month_11
month_12 lab 21.10400 21.20880 21.40720
21.57440 21.87440 poodle 21.82072 21.92432
22.12342 22.29009 22.58919 husky 21.47699
21.58142 21.77876 21.94779 22.24779 retriever
21.64962 21.75414 21.95263 22.12406
22.42406 food month_1 month_2
month_3 month_4 month_5 month_6 month_7 lab
30.04000 29.77200 28.77680 28.20880 29.52240
30.24960 30.90160 poodle 30.03063 29.76306
28.77117 28.20631 29.51892 30.23874
30.89910 husky 30.12301 29.85221 28.85841
28.29646 29.60973 30.33363 30.98584 retriever
29.89248 29.62556 28.63008 28.06617 29.37744
30.10075 30.75564 month_8 month_9
month_10 month_11 month_12 lab 29.20880
30.03200 29.89120 29.54240 30.89520 poodle
29.20631 30.02613 29.88739 29.53243
30.89550 husky 29.29646 30.11770 29.97345
29.62389 30.98053 retriever 29.06617 29.88722
29.74887 29.39248 30.75338
22Overview
- Review For Loop
- Lists
- Aside Patterns
- Application
23Aside
- This step is potentially dangerous
- Indexes1 415 weight
- Indexes2 1627 length
- Indexes3 2839 food
- Is there a better way? YES! Each group shares a
common term in the name - wt, len, food
24Aside
- grep(pattern, x) matches "pattern" in vector x
gt grep("wt", names(dog_dat)) 1 4 5 6 7 8
9 10 11 12 13 14 15 gt grep("len",
names(dog_dat)) 1 16 17 18 19 20 21 22 23 24
25 26 27 gt grep("food", names(dog_dat)) 1 28
29 30 31 32 33 34 35 36 37 38 39
25Aside
gt Indexes list() gt Indexes1 grep("wt",
names(dog_dat)) gt Indexes2 grep("len",
names(dog_dat)) gt Indexes3 grep("food",
names(dog_dat)) gt Indexes 1 1 4 5 6 7
8 9 10 11 12 13 14 15 2 1 16 17 18 19 20
21 22 23 24 25 26 27 3 1 28 29 30 31 32
33 34 35 36 37 38 39
26Aside
- grep can be a lot more powerful when combined
with 'regular expression' but we're not going to
get into that
27Aside
- Opposite of paste strsplit(x, split) splits
term 'x' on 'split' character or pattern - Returns a list
gt x paste("month",112,sep"_") gt
head(strsplit(x,"_"),3) 1 1 "month" "1"
2 1 "month" "2" 3 1 "month"
"3"
28Aside
- If you want one element (in this case, the
number), easiest to just use a 'for' loop - If you split each element separately, the output
list only has 1 element 1 - You then need to figure out which slot you want
using the single bracket
29Aside
x paste("month",112,sep"_") num
rep(0,length(x)) for(i in 1length(x))
numi strsplit(xi,"_")12 gt i 1 gt
strsplit(xi,"_") list 1 1 "month" "1"
gt strsplit(xi,"_")1 vector 1 "month"
"1" gt strsplit(xi,"_")12
element 1 "1"
30Overview
- Review For Loop
- Lists
- Aside Patterns
- Application
31Applied Example
- Load in "lec5_data.rda" from the course website
- These are the people from "lec2_data.rda" that
did not have a dog at baseline - Over monthly follow-up, some of these people
borrowed dogs over the past month
32Applied Example
- dog_0 baseline dog ownership all of these
people should have "no" - dog_1 - dog_12 did you borrow a dog over the
past month?
33Applied Example
- Determine person-time at risk for dog borrowing
- Create a "survival" dataset from this data with
columns ID, start, end - Note that there is missing data
34Applied Example
- We want to convert each person's wide data into
two numbers start and end - Because of missing data, some people might have
more than 1 row people aren't at risk for dog
borrowing if they did not report (/are missing)
35Applied Example
gt dat1, id age sex height weight dog_0 dog_1
dog_2 1 1 40 F 63.5 134.5 no no
yes dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9 1
yes no no yes yes no yes
dog_10 dog_11 dog_12 1 ltNAgt no no
36Applied Example
- Person 1 in the new dataset should be
- ID start end
- 1 0 9
- 1 11 12
37Applied Example
- Basic premise write a for-loop that passes over
each person and determines their non-missing
follow-up time - Caveat how many rows do we make our output
matrix? - Perfect opportunity for using rbind()
38Applied Example
- Create a matrix with 0 rows and 3 columns
- Within the body of the loop, using rbind to
append new rows (this is slow though)
gt out matrix(nr 0, nc 3) gt dim(out) 1 0
3 gt p1 c(1,0,9) gt out rbind(out, p1) gt out
,1 ,2 ,3 p1 1 0 9
39Applied Example
out matrix(nrow 0, ncol 3) cols
grep("dog", names(dat)) for(i in 1nrow(dat))
hold as.numeric(dati,cols) ...
40Applied Example
- Here, the follow-up results are factors, which
have numerical values
gt dati,cols dog_0 dog_1 dog_2 dog_3 dog_4
dog_5 dog_6 1 no no yes yes no no
yes dog_7 dog_8 dog_9 dog_10 dog_11 dog_12 1
yes no yes ltNAgt no no gt
as.numeric(dati,cols) 1 1 1 2 2 1 1 2
2 1 2 NA 1 1
41Applied Example
- Now a cool little trick rle() run length
encoding - Compute the lengths and values of runs of equal
values in a vector - We're going to combine this with is.na()
42Applied Example
- This says that there are 10 FALSE in a row, then
1 TRUE, then 2 FALSE - We need to get this in a better format
gt x rle(is.na(hold)) gt x Run Length Encoding
lengths int 13 10 1 2 values logi 13
FALSE TRUE FALSE
43Applied Example
gt x data.frame(cbind(xvalues, xlength)) gt
names(x) lt- c("missing", "length") gt x missing
length 1 0 10 2 1 1 3
0 2
44Applied Example
- cumsum() returns the cumulative sum of a vector
gt xend lt- cumsum(xlength) gt xstart lt- xend -
xlength 1 gt gt x missing length end start 1
0 10 10 1 2 1 1 11
11 3 0 2 13 12
45Applied Example
- Note that we actually want all of the values to
be less one, since our time starts at 0
gt xend lt- cumsum(xlength) - 1 gt xstart lt-
xend - xlength 1 gt x missing length end
start 1 0 10 9 0 2 1 1
10 10 3 0 2 12 11
46Applied Example
gt x lt- x,c(1,2,4,3) gt x missing length start
end 1 0 10 0 9 2 1 1
10 10 3 0 2 11 12
47Applied Example
- We want the last two columns of the non-missing
visits
gt tmp xwhich(xmissing 0),34 gt tmp
start end 1 0 9 3 11 12
48Applied Example
- We then want to add a column of the individual ID
to the front
id dati,1 tmp cbind(rep(id,nrow(tmp)),
tmp) names(tmp)1 "ID" gt tmp ID start end 1
1 0 9 3 1 11 12
49Applied Example
- Lastly, bind the tmp matrix to the growing out
matrix - This finishes off our loop body
out rbind(out,tmp)
50 for(i in 1nrow(dat)) hold
as.numeric(dati,cols) x rle(is.na(hold)) x
data.frame(cbind(xvalues, xlength)) names(x)
lt- c("missing", "length") xend lt-
cumsum(xlength) - 1 xstart lt- xend - xlength
1 x lt- x,c(1,2,4,3) tmp
xwhich(xmissing 0),34 id dati,1 tmp
cbind(rep(id,nrow(tmp)), tmp) names(tmp)1
"ID" out rbind(out,tmp) rownames(out)
1nrow(out) cleaning
51gt head(out,10) ID start end 1 1 0 9 2
1 11 12 3 2 0 5 4 2 7 12 5
3 0 2 6 3 5 12 7 4 0 3 8 4
6 12 9 5 0 0 10 5 3 8 gt
dim(out) 1 1414 3
52Applied Example
- One last adjustment needed, since we asked about
borrowing a dog in the previous month - The non-0 starts must be less 1 since these are
currently indices of visit, but not time at risk
ID start end 1 1 0 9 2 1 11 12
ID start end 1 1 0 9 2 1 10 12
53Applied Example
- What is the total time at risk of this
population?
gt time outend - outstart gt sum(time) 1
4988 person-months
54Applied Example
- Save the 'out' matrix as an rda so it can be used
next week