6.1 Basic Control Flow

When you write R code, you are creating commands that R will execute one at a time in order, from top to bottom. This is the basic flow of an R program, but R also provides ways that you can control the flow, using basic logic. In this section, we’ll introduce a few ways of controlling the flow of an R program, but first, we need a data set to work with.

Our working example for this chapter will be the latest (as of this book’s writing) provisional estimates of COVID-19 Deaths in the United States, available from the Centers for Disease Control at this link. We’ve downloaded the data and saved it in the data_raw folder, and you should do the same (the data are also available here). First, let’s load the data and do some minor cleaning:

# Load the data
# The "Footnote" column has hyphens,
# which only display correctly if we specify "UTF-8" encoding
covid <- read.csv("data_raw/Provisional_COVID-19_Death_Counts_by_Sex__Age__and_State.csv", fileEncoding = "UTF-8")
# Remove rows with state totals, this will mess up our summary statistics later
covid <- covid[!grepl("Total", covid$State),]
# Remove all ages category
covid <- covid[covid$Age.group != "All ages",]

Download the covid data into your ‘data_raw’ folder, and load/clean it using the code above.

6.1.1 Loops

One of the first things we might like to do with our data set is create visualizations. This data contains deaths data for different states, age groups, and sexes. Let’s pick a state and sex, create a bar chart for deaths in different age groups, and save the image to the output directory:

# Select only Females from Colorado
covid_co_f <- covid[(covid$State == "Colorado") & (covid$Sex == "Female"),]

# Save a barplot of the deaths by age group
png("output/covid_deaths_by_agegroup_colorado_female.png")
par(mar = c(9, 4, 2, 2))

# The COVID.19.Deaths vector doesn't have row names,
# so we specify the bar labels with names.arg 
barplot(covid_co_f$COVID.19.Deaths, names.arg = covid_co_f$Age.group, 
        las = 2, 
        main = "Deaths By Age Group")
dev.off()

Here’s the plot we just created:

covid_co_f <- covid[(covid$State == "Colorado") & (covid$Sex == "Female"),]
par(mar = c(9, 4, 2, 2))
barplot(covid_co_f$COVID.19.Deaths, names.arg = covid_co_f$Age.group, 
        las = 2, 
        main = "Deaths By Age Group")

Note that three age groups have more than 0 but less than 9 cases, so the counts have been omitted from the data set to maintain confidentiality of the victims.

Let’s repeat this process for two other states:

# Deaths by age group for Females in Wyoming
covid_wy_f <- covid[(covid$State == "Wyoming") & (covid$Sex == "Female"),]
png("output/covid_deaths_by_agegroup_wyoming_female.png")
par(mar = c(9, 4, 2, 2))
barplot(covid_wy_f$COVID.19.Deaths, names.arg = covid_wy_f$Age.group, 
        las = 2, 
        main = "Deaths By Age Group")
dev.off()


# Deaths by age group for Females in New Mexico
covid_nm_f <- covid[(covid$State == "New Mexico") & (covid$Sex == "Female"),]
png("output/covid_deaths_by_agegroup_newmexico_female.png")
par(mar = c(9, 4, 2, 2))
barplot(covid_nm_f$COVID.19.Deaths, names.arg = covid_nm_f$Age.group, 
        las = 2, 
        main = "Deaths By Age Group")
dev.off()

Here are these plots, too:

Now, if we wanted to do this for all states in our dataset, this would take a lot of code. But, did you notice that the code we wrote in each case was very similar? This is a perfect opportunity to use looping. Looping involves running the same R commands multiple times, usually making small changes in between. The most common form of loop is called a for-loop. Here’s a simple example:

vec <- c("a", "b", "c")  # Create a vector
for(i in vec){   # Loop through the elements of the vector
  print(i)           # Print out the current element
}                    # Stop the loop
[1] "a"
[1] "b"
[1] "c"

This for-loop printed out each element of the vec variable, one at a time. Here’s the way this works:

  • for tells R that we want to repeat code multiple times. When R “sees” the for command, it knows that the code that follows will be repeated.
  • the parentheses (( and )) specify a vector that will be looped over (vec in this example), and a variable name to use while looping (i in this example). On each iteration of the loop, the variable (i) will have a different value. In this example, the first time through the loop, i will have the value of the first element of vec ("a"), the second time through the loop, i will have the value of the second element of vec ("b"), etc. The name for-loop is common in many programming languages, which reflects the fact that R is running the loop for each element of the vector.
  • The braces ({ and }) specify which code should be run each time through the loop. In this example, we’re just printing out the value of i, so the result is that each element of vec is printed in order. Recall that braces are a way of specifying a block of code, and R knows that everything inside the block should be run while looping.
  • After it finishes looping, R proceeds to run any code below the for-loop.

Here’s another example of a for-loop:

for(j in 1:10){
  print(j^2)
}
print(j+1)
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
[1] 100
[1] 11

There are a few things to learn from this second example:

  • The variable used in the loop doesn’t have to be i. It can be any name you like.
  • You can create vectors in the for-loop. Here we use 1:10 to generate a sequence of numbers from 1 to 10 (remember this?).
  • The value of j (or whatever your looping variable is called) still exists after the for-loop is over. Here the last value of j was 10, so printing j+1 produced 11.
Don’t forget to include the curly braces ({, }) after your for-loop, or else R may not include your code in the loop.
In some languages, white space like tabs and spaces are significant, meaning they imply something about what should happen when the code is run. In R, spaces and tabs don’t change anything about how code is run, and usually are used to make code more readable. For example, it’s common to indent for-loops, for clarity, but it’s not strictly necessary for the code to run. This is another example of coding style.
Technically, you don’t need to include braces after the for-loop, but if you leave them out, then R will only run the first command it finds after the for(...).

Now, let’s gradually change the first example into a loop that runs visualizations for each state in our data set. First, instead of looping over c("a", "b", "c"), let’s loop over state names:

for(i in c("Colorado", "Wyoming", "New Mexico")){
  print(i)
}
[1] "Colorado"
[1] "Wyoming"
[1] "New Mexico"

Now, instead of just printing the state name, let’s create a data frame of just that state, for females:

for(i in c("Colorado", "Wyoming", "New Mexico")){
  covid_state_f <- covid[(covid$State == i) & (covid$Sex == "Female"),]
}

Remember, each time through the loop, the value of i matches one of the state names in the vector. So covid$State == i will produce a logical vector which is true for the rows specific to whichever state name we’re on. Notice that each time through the loop, the covid_state_f data frame will also change, containing only the rows for the state we’re on. Now that we are selecting only the state of interest, let’s produce a bar plot of cases, split by age group:

# Loop though three states
for(i in c("Colorado", "Wyoming", "New Mexico")){
  
  # Select only the rows from the state
  covid_state_f <- covid[(covid$State == i) & (covid$Sex == "Female"),]
  # Create the file name using the state's name
  file_name <- paste("output/covid_deaths_by_agegroup_", i, "_female.png", sep="")
  # Produce the plot
  png(file_name)
  par(mar = c(9, 4, 2, 2))
  barplot(covid_state_f$COVID.19.Deaths, names.arg = covid_state_f$Age.group, 
          las = 2, 
          main = paste("Deaths By Age Group, ", i, sep=""))
  dev.off()
}

We’ve used the paste function a few times in this loop, remember that it combines multiple strings using a separator, which we’ve set as an empty string (so no separator between the strings being combined).

This is some of the longest and most complex code that we’ve discussed so far! It’s important that you fully understand what each line is doing, so take your time and review the code chunk above until you’re comfortable with it.

Here comes the real power of this method. So far, we’ve just produced plots for three states, but with one small change, we can produce plots for each state in the data frame:

for(i in unique(covid$State)){  # <<------ Here's the one change we made!
  covid_state_f <- covid[(covid$State == i) & (covid$Sex == "Female"),]
  file_name <- paste("output/covid_deaths_by_agegroup_", i, "_female.png", sep="")
  png(file_name)
  par(mar = c(9, 4, 2, 2))
  barplot(covid_state_f$COVID.19.Deaths, names.arg = covid_state_f$Age.group, 
          las = 2, 
          main = paste("Deaths By Age Group, ", i, sep=""))
  dev.off()
}

This code will now loop through every unique value in the State column and produce identical visualizations for each state!

Write a for-loop which loops through each age group category, and prints the total number of COVID-19 deaths across all states (Hint: each time through the loop, subset based on the age group, then find the sum of the deaths column, then print the result).

R has other functions for looping as well, but for-loops are by far the most common. Another option is while which, rather than looping through a vector, just continues looping forever as long as some condition is true. Try ?Control for more info.

6.1.1.1 Nested Loops

Sometimes, it becomes necessary to loop over multiple vectors at once. This is possible by nesting the for-loops (putting one inside the other) like so:

for(i in c(10, 50)){
  for(j in c(1, 2)){
    print(i + j)
  }
}
[1] 11
[1] 12
[1] 51
[1] 52

Look carefully at the output, and notice that j is changing “faster” than i: First i is 10, and j cycles through 1 and 2, then i is 50, and j cycles through 1 and 2 again.

Notice that when nesting for-loops, each for-loop has its own set of braces ({, }). Don’t forget to put the second ending brace }! Another reason to use indenting is to catch mistakes like a missing ending brace.

Let’s apply this concept to our COVID-19 data. So far, we’ve been generating plots for the females only, but we can include another loop which cycles through each Sex for each state (changes to the code are marked with comments):

for(i in unique(covid$State)){  
  for(j in unique(covid$Sex)){  # Add a nested loop for sex
    covid_state_sex <- covid[(covid$State == i) & (covid$Sex == j),]  # Compare covid$Sex to j
    # Add j to the file name
    file_name <- paste("output/covid_deaths_by_agegroup_", i, "_", j, ".png", sep="")
    png(file_name)
    par(mar = c(9, 4, 2, 2))
    barplot(covid_state_f$COVID.19.Deaths, names.arg = covid_state_f$Age.group, 
            las = 2, 
            main = paste("Deaths By Age Group,", i, j))  # Change add Sex to title
    dev.off()
  }
}

Nested for-loops can be useful and even necessary, but nesting can sometimes take a very long time to run. If two nested for-loops each run through 1,000 vector elements, that means a total of 1,000,000 iterations through the inner loop’s code! It’s possible to have a set of three nested for-loops or even more, but generally this is not wise practice, and in most cases there is a way to accomplish the same goal without so much looping.

6.1.1.2 Breaking Out of For-Loops.

Sometimes it’s necessary to stop a loop earlier than expected. This can be done with break, but this is best explained after discussing if/else statements.

6.1.2 If Statements

So far, you’ve seen how to control the flow of a program by having R run the same chunk of code multiple times. Another common way of controlling flow is to change the code that runs based on some condition. Let’s return to the COVID example for motivation.

Suppose we wanted to create a visualization of the data in each column of the data frame. Remember that the choice of visualization is affected by the type of variable being visualized (quantitative or categorical). If the column is quantitative, we’d like to produce a histogram, perhaps, and if the column is categorical, we’d like to produce a bar graph. Remember that looping runs the same code each time through the loop, so how are we supposed to change the plot method to suit the variable type? The answer is to use if statements.

Before going further, here’s a quick example:

if("cat" == "dog"){
  print("Something doesn't make sense!")
}

This code produces precisely no output. Even though there is a print command, R does not print anything! The reason is that the print command is inside of an if statement, and R only runs that code if the specified condition is met. Here’s how it works:

  • The if indicates the start of the if statement.
  • R expects the parentheses to contain a logical statement that produces either TRUE or FALSE. In this example, we are comparing the character strings "cat" and "dog" (which are not the same, so the result is FALSE).
  • If the condition is TRUE, then the code block in curly braces is run (not true for this example).
  • If the condition is FALSE, then the code block is not run (which is why the above example did not print).
  • Whether the code block is run or not, R will then proceed to run any code below the if statement.

In the simple example above, the logical condition (“cat” == “dog”) is obviously false, so every time we run the code, the print statement will not be run. If the code never runs, then why go through the trouble of including it? The answer is that this simple example isn’t realistic, and you should look at the next example.

Let’s see how if statements can be used inside of a for-loop.

for(i in 1:5){
  if(i == 4){
    print(i)
  }
}
[1] 4

Here we have a for-loop which loops through the vector 1:5. Remember that the value of i is changing each time through the loop to a different element of the vector. Each time through the loop, R evaluates the condition i == 4. If it is true, then the value of i is printed. Otherwise, nothing happens because there is no other code in the for-loop. i takes on the value 4 exactly once, in which case the print statement runs and we see the value of i. To summarize, the for-loop code ran five times, four of these times the if condition was FALSE and nothing happened, but one time the if condition was TRUE.

6.1.2.1 Else

If statements can also be written with an else block, which specifies code to run if the logical condition is FALSE:

for(i in 1:5){
  if(i == 4){  # Condition to test
    print(i)   # Code to run if condition is TRUE
  } else {
    print("Not 4")  # Code to run if condition is FALSE
  }
}
[1] "Not 4"
[1] "Not 4"
[1] "Not 4"
[1] 4
[1] "Not 4"

Here you can see that rather than doing nothing when the condition is not true, the second code block (after else) is run instead.

Returning to the COVID example, let’s loop through each of the columns of the data frame, and use an if statement to determine if it is a character or numeric mode (remember mode?). Then let’s choose an appropriate visualization based on the mode:

for(col_name in names(covid)){
  col <- covid[[col_name]]
  if(mode(col) == "numeric"){           # Check if mode is numeric
    hist(col, main = col_name)            # Plot histogram
  } else {                              # Not numeric, assume character
    barplot(table(col), main = col_name)  # Plot barplot
  }
}

Admittedly, these plots are a little crude, but the point is that the code is able to create the appropriate plot depending on the type of variable. This is another way in which you can control the flow of an R program.

Looking at visualizations like this for variables in a data set can be a useful way to identify potential problems. Look at the barplot for State, and notice that one category has more observations than the others. Which state this is (hint: the table and sort functions might be useful)? Create a new data frame by subsetting on the outlier state, and examine it. Is there cause for concern? Why or why not?

There are other variations on if statements, including using else if to test a second condition if the first is not met, and the switch function which matches an argument to one of several possibilities, and runs different code for each match. These are more advanced topics that will not be covered here.

Create a for-loop which loops through each state. In the for-loop, determine whether there are more male or female deaths in the 45-54 age group. If there are more females, print “There are more female deaths in <State>”, where <State> is the state name for that iteration in the loop. If there are more males or the deaths are the same, the print “There are not more female deaths in <State>”.

6.1.2.2 Breaking Out of For Loops.

Sometimes it’s useful to be able to stop a for-loop before it has finished looping through the whole vector. This can be done with the break statement, which is usually placed within an if statement. Here’s an example:

for(i in 1:10){
  if(i == 6){
    break
  }
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

The if condition is FALSE until i is 6, at which point the if condition is TRUE, so the break statement is run. The break statement causes R to exit from the loop before 6 is printed and before i is updated to 7. Hence we see the numbers 1-5 printed but not the numbers 6-10.

6.1.3 Formatting Conventions

Since R is a programming language, it is not immune to the common debates between programmers regarding proper formatting. For example, the following for-loops are all equivalent:

for(i in 1:3){
  print(i)
}
[1] 1
[1] 2
[1] 3
for(i in 1:3)
{
  print(i)
}
[1] 1
[1] 2
[1] 3
for(i in 1:3) {print(i)}
[1] 1
[1] 2
[1] 3
for(i in 1:3) print(i)
[1] 1
[1] 2
[1] 3

But different programmers (and sometimes programming communities) will have different ideas about what is best. For this book, we’ll use the first convention, but you could reasonably choose the second as well. The third and fourth conventions should probably only be used if the code block is very short (e.g. a single command). Mainly, we raise these differences because you may have to read code written by someone with different conventions from yourself.

This is the last section you should include in Progress Check 4. Knit your output document and submit on Canvas.