6.3 Advanced Control Flow

In this section, we’ll discuss more ways to control the flow of your code. Specifically, we’ll talk about the apply family of functions, starting with sapply. To show what sapply does, let’s look at the following function:

square_plus_one <- function(x){
  return(x^2 + 1)
}

This function returns the result of squaring the input argument and adding 1. Suppose we wanted to run this function to every element of a vector. One option would be to write a for-loop:

# The function we want to apply our function to
x <- 1:10

# Create an empty vector to hold the result
y <- numeric(10)

# Loop through x and apply the function
for(i in 1:10){
  y[i] <- square_plus_one(x[i]) 
}

y
 [1]   2   5  10  17  26  37  50  65  82 101

Here’s another way to do the same thing using the sapply function:

# Apply the function square_plus_one to every element in x
z <- sapply(x, square_plus_one)

# z agrees with y!
print(z == y)

# z is a vector
mode(z)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[1] "numeric"

In words, the sapply function says “loop through every element of this vector, and apply this function to it.” All you have to specify is which vector to loop through, and which function to apply. The “s” in sapply stands for “simplify”, meaning the result will be simplified to a vector, matrix, or higher dimensional array. In some cases, you may want the result to be returned as a list, instead of a vector. In this case, the lapply function can be used:

# Apply the function square_plus_one to every element in x
w <- lapply(x, square_plus_one)

# w is a list
mode(w)
[1] "list"

lapply and sapply can also be used if the function doesn’t always return the same type or length of data. Here’s an example:

make_vector <- function(n){
  # Make a vector of length n numbered from 1 to n
  1:n
}

# Apply make_vector to each element of x:
x <- c(2, 2, 3, 1, 6)
lapply(x, make_vector)
[[1]]
[1] 1 2

[[2]]
[1] 1 2

[[3]]
[1] 1 2 3

[[4]]
[1] 1

[[5]]
[1] 1 2 3 4 5 6

The differences between lapply and sapply (and other functions in the apply family) can be subtle and very confusing at times, even after reading the documentation using ?lapply, but here is one way they are different:

x <- list(1, 2, 3)

l <- lapply(x, square_plus_one)
print(mode(l))

s <- sapply(x, square_plus_one)
print(mode(s))
[1] "list"
[1] "numeric"

The input x started as a list, so the default return of lapply is a list. However, sapply simplified the result to a vector. If you know the type of data that will be returned, you can use the vapply function, which can be faster than sapply in some cases.

ch <- vapply(x, square_plus_one, 1)

Here we’ve supplied an extra argument (1) to tell R that the elements will be numeric types.

The differences between lapply, sapply, and vapply are not easy to understand, especially because they give the same results sometimes. A simple yet imperfect rule is that sapply will work in most cases, but you can use lapply if you don’t want lists to be simplified, and use vapply if you know what type the result will be.

6.3.1 Applying Over Multiple Dimensions

Suppose we want to find the variance of each row of a matrix. The lapply or sapply only work on each element of a matrix, not each row. For this, we need the apply function:

m <- matrix(c(1, 1, 1, 1, 2, 3, 2, 4, 6), 3, 3, byrow = T)

apply(m, 1, var)
[1] 0 1 4

The arguments to this function are:

  • m: The matrix to be iterated over
  • 1: The axis to iterate over
  • var: The function to apply over the axis

6.3.2 Applying Over Data Frame Groups

Lastly, we’ll show how you can apply a function to each group specified by a grouping variable. Suppose we wanted to add the COVID deaths across all Age groups for each state. Then we could use the tapply function, with COVID.19.Deaths as the vector to iterate over, State as the grouping variable, and sum as the function to use. Here’s the result:

state_deaths <- tapply(covid$COVID.19.Deaths, covid$State, sum, na.rm = T)
sort(state_deaths)
              Alaska               Hawaii              Montana 
                   0                    0                    0 
             Wyoming          Puerto Rico              Vermont 
                   0                   13                   26 
        South Dakota         North Dakota        West Virginia 
                  68                   91                   93 
               Idaho                Maine                 Utah 
                 107                  107                  183 
              Oregon             Nebraska               Kansas 
                 232                  270                  302 
            Arkansas        New Hampshire             Oklahoma 
                 332                  367                  398 
          New Mexico             Delaware               Nevada 
                 502                  503                  567 
District of Columbia             Kentucky            Tennessee 
                 633                  644                  656 
                Iowa            Wisconsin         Rhode Island 
                 762                  810                  912 
      South Carolina             Missouri          Mississippi 
                 957                 1001                 1193 
      North Carolina           Washington              Alabama 
                1205                 1223                 1244 
           Minnesota             Colorado             Virginia 
                1469                 1623                 2062 
             Arizona              Georgia                 Ohio 
                2430                 2534                 2686 
             Indiana            Louisiana             Maryland 
                2719                 3082                 3607 
               Texas          Connecticut              Florida 
                3693                 4007                 4331 
            Michigan             Illinois           California 
                5588                 6637                 7089 
        Pennsylvania        Massachusetts             New York 
                7216                 7741                11238 
          New Jersey        New York City        United States 
               13806                20450               390745 

Here we also had to specify the na.rm=T parameter so that NA values are removed when computing the sum. Remember that these totals may not match the All Ages number from each state, because some counts are suppressed. Here’s another example where we sum the deaths across all states for each Age group

# Remove US total numbers first
covid <- covid[covid$State != "United States",]

agegroup_deaths <- tapply(covid$COVID.19.Deaths, covid$Age.group, sum, na.rm=T)
sort(agegroup_deaths)
        1-4 years        5-14 years      Under 1 year       15-24 years 
                0                 0                 0                54 
      25-34 years       35-44 years       45-54 years       55-64 years 
              730              2246              6451             15821 
      65-74 years       75-84 years 85 years and over 
            27117             34351             42639 

Remember that these totals may not match the United States values from the dataset because some counts are suppressed.