6.3 Advanced Control Flow
In this section, we’ll discuss more ways to control the flow of your code.
Specifically, we’ll talk about the apply
family of functions, starting with sapply
.
To show what sapply
does, let’s look at the following function:
This function returns the result of squaring the input argument and adding 1. Suppose we wanted to run this function to every element of a vector. One option would be to write a for-loop:
# The function we want to apply our function to
x <- 1:10
# Create an empty vector to hold the result
y <- numeric(10)
# Loop through x and apply the function
for(i in 1:10){
y[i] <- square_plus_one(x[i])
}
y
[1] 2 5 10 17 26 37 50 65 82 101
Here’s another way to do the same thing using the sapply
function:
# Apply the function square_plus_one to every element in x
z <- sapply(x, square_plus_one)
# z agrees with y!
print(z == y)
# z is a vector
mode(z)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] "numeric"
In words, the sapply
function says “loop through every element of this vector, and apply this function to it.”
All you have to specify is which vector to loop through, and which function to apply.
The “s” in sapply
stands for “simplify”, meaning the result will be simplified to a vector, matrix, or higher dimensional array.
In some cases, you may want the result to be returned as a list, instead of a vector. In this case, the lapply
function can be used:
# Apply the function square_plus_one to every element in x
w <- lapply(x, square_plus_one)
# w is a list
mode(w)
[1] "list"
lapply
and sapply
can also be used if the function doesn’t always return the same type or length of data. Here’s an example:
make_vector <- function(n){
# Make a vector of length n numbered from 1 to n
1:n
}
# Apply make_vector to each element of x:
x <- c(2, 2, 3, 1, 6)
lapply(x, make_vector)
[[1]]
[1] 1 2
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
[[4]]
[1] 1
[[5]]
[1] 1 2 3 4 5 6
The differences between lapply
and sapply
(and other functions in the apply family) can be subtle and very confusing at times, even after reading the documentation using ?lapply
, but here is one way they are different:
x <- list(1, 2, 3)
l <- lapply(x, square_plus_one)
print(mode(l))
s <- sapply(x, square_plus_one)
print(mode(s))
[1] "list"
[1] "numeric"
The input x
started as a list, so the default return of lapply
is a list.
However, sapply
simplified the result to a vector.
If you know the type of data that will be returned, you can use the vapply
function, which can be faster than sapply
in some cases.
Here we’ve supplied an extra argument (1
) to tell R that the elements will be numeric types.
The differences between lapply
, sapply
, and
vapply
are not easy to understand, especially because they
give the same results sometimes. A simple yet imperfect rule is that
sapply
will work in most cases, but you can use
lapply
if you don’t want lists to be simplified, and use
vapply
if you know what type the result will be.
6.3.1 Applying Over Multiple Dimensions
Suppose we want to find the variance of each row of a matrix.
The lapply
or sapply
only work on each element of a matrix, not each row.
For this, we need the apply
function:
[1] 0 1 4
The arguments to this function are:
m
: The matrix to be iterated over1
: The axis to iterate overvar
: The function to apply over the axis
6.3.2 Applying Over Data Frame Groups
Lastly, we’ll show how you can apply a function to each group specified by a grouping variable.
Suppose we wanted to add the COVID deaths across all Age groups for each state.
Then we could use the tapply
function, with COVID.19.Deaths
as the vector to iterate over, State
as the grouping variable, and sum
as the function to use.
Here’s the result:
Alaska Hawaii Montana
0 0 0
Wyoming Puerto Rico Vermont
0 13 26
South Dakota North Dakota West Virginia
68 91 93
Idaho Maine Utah
107 107 183
Oregon Nebraska Kansas
232 270 302
Arkansas New Hampshire Oklahoma
332 367 398
New Mexico Delaware Nevada
502 503 567
District of Columbia Kentucky Tennessee
633 644 656
Iowa Wisconsin Rhode Island
762 810 912
South Carolina Missouri Mississippi
957 1001 1193
North Carolina Washington Alabama
1205 1223 1244
Minnesota Colorado Virginia
1469 1623 2062
Arizona Georgia Ohio
2430 2534 2686
Indiana Louisiana Maryland
2719 3082 3607
Texas Connecticut Florida
3693 4007 4331
Michigan Illinois California
5588 6637 7089
Pennsylvania Massachusetts New York
7216 7741 11238
New Jersey New York City United States
13806 20450 260495
Here we also had to specify the na.rm=T
parameter so that NA
values are removed when computing the sum.
Remember that these totals may not match the All Ages
number from each state, because some counts are suppressed.
Here’s another example where we sum the deaths across all states for each Age group
# Remove US total numbers first
covid <- covid[covid$State != "United States",]
agegroup_deaths <- tapply(covid$COVID.19.Deaths, covid$Age.group, sum, na.rm=T)
sort(agegroup_deaths)
1-4 years 5-14 years Under 1 year 15-24 years
0 0 0 54
25-34 years 35-44 years 45-54 years 55-64 years
730 2246 6451 15821
65-74 years 75-84 years 85 years and over
27117 34351 42639
Remember that these totals may not match the United States
values from the dataset because some counts are suppressed.
Any feedback for this section? Click here