5.7 Vignettes

In this section, you’ll gain more experience working with data by following along with some more data analysis examples.

5.7.1 Flood analysis example

In this example we will learn how to analyze flood data to better understand the history of flooding in the last ten years in the Cache La Poudre river which runs through Fort Collins. This vignette will help you learn three key ideas:

  1. Data can be read into R directly from online data services using packages which you will learn about later in this book.

  2. We can use this ‘live’ data to understand both past and present river conditions in the river.

  3. We can use R to look at changes over time in river flow and water height.

5.7.1.1 Map

The river monitoring location for the Cache La Poudre River is right at Lincoln Bridge near Odell Brewing.

5.7.1.2 Installing and using packages

In order to download river flow and height data we will first need to load a package called dataRetrieval this is a package run by the United States Geological Survey (USGS) and it provides access to data from over 8000 river monitoring stations in the United States and millions of water quantity and quality records. You can learn more about the data from the USGS here. To use packages we first have to install them using the command install.packages and then load them using the command library.

# Install the package if it's not already installed by uncommenting the line 
# below

#install.packages('dataRetrieval')

# Load the package
library(dataRetrieval)

5.7.1.3 Downloading the data

Once we have loaded the package we can use the special functions that it brings to the table. In this case, dataRetrival provides a function called readNWISdv which can download daily data (or daily values, hence readNWISdv) for specific monitoring locations. But how do we use this function? Try ?readNWISdv to get more details. So the readNWISdv command requires a few key arguments. First siteNumbers, these are simply the site identifiers that the USGS uses to track different monitoring stations and in our case that number for the Cache La Poudre is 06752260, which you can find here. The second argument is the parameterCd which is a cryptic code that indicates what kind of data you want to download. In this case we want to download river flow data. River flow tells us how much water is moving past a given point and is correlated with the height of the river water. This code is 00060 for discharge which means river flow. Lastly we need to tell the readNWISdv command the time period for which we want data which is startDate which we will set to 2010, and endDate which we will set to the current day using the command Sys.Date(). These arguments should be in the YYYY-MM-DD format. With all this knowledge of how the command works, we can finally download some data, directly into R!

poudre <- readNWISdv(siteNumbers = '06752260',
                     parameterCd = c('00060'),
                     startDate = '2010-10-01',
                     endDate = Sys.Date())

5.7.1.4 Explore the data structure

Now that we have our dataframe called poudre, we can explore the properties of this data frame using commands we have already learned. First let’s see what the structure of the data is using the head command, which will print the first 6 rows of data.

head(poudre)

It looks like we have a dataframe that is 5 columns wide with columns agency_cd which is just the USGS, site_no which is just the site id we already told it. Since we only asked for data from one site, we don’t really need this column. A Date column which tells us… the date! There are two more columns that are kind of weird which are labeled X_00060_00003 which is the column that actually has values of river flow in Cubic Feet Per Second (cfs), or the amount of water that is flowing by a location per second in Cubic Feet volume units (1 cubic foot ~ 7.5 gallons). Finally X_00060_00003_cd tells us something about the quality of the data. For now we will ignore this final column, but if you were doing an analysis of this data for a project, you would want to investigate what codes are acceptable for high quality analyses. To make working with this data a little easier let’s rename and remove some of our columns in a new, simpler dataframe.

# Remove the first two columns
pq <- poudre[,-c(1,2,5)]

# Rename the remaining columns
names(pq) <- c('Date','q_cfs')

5.7.1.5 Explore the data

Now that we have cleaned up our data frame a little, let’s explore the data. First we can use the summary function to just quickly look at the variables we have.

summary(pq)
      Date                q_cfs        
 Min.   :2010-10-01   Min.   :   1.31  
 1st Qu.:2014-03-11   1st Qu.:  23.00  
 Median :2017-08-19   Median :  63.00  
 Mean   :2017-08-19   Mean   : 222.68  
 3rd Qu.:2021-01-27   3rd Qu.: 148.00  
 Max.   :2024-07-07   Max.   :7150.00  

It looks like we have data from 2010 to 2024-07-08 and a range in river flow (q_cfs) from 2.6 cfs all the way up to 7150 cfs. If you’re a hydrologist, hopefully these flow numbers look right, but another way to check to make sure is to simply plot the data as we do below.

plot(pq$q_cfs ~ pq$Date, type = 'l', ylab = 'River Flow (cfs)', 
     xlab = 'Date', col = 'blue3')

The above plot is called a “hydrograph” or a plot of how river flow has changed over time. In the Cache La Poudre, what might explain the peaks and valleys of the data? What controls river flow in Colorado rivers?

5.7.1.6 Analyze the data

Now that we’ve plotted the data we can see some interesting patterns that we might want to explore. For example, how has the average flow changed in the last ten years. To analyze this, we need to use the concept of a Water Year. Simply put, a water year is a way to analyze yearly variation in flow, which doesn’t map well to a calendar year. Water years in the USA are typically from October 1 to September 30th. Luckily for us, the dataRetrieval package has a function that calculates water year for us. It’s simply called addWaterYear.

pq_wy <- addWaterYear(pq)

To look at variation per year we can use the function tapply which can take the mean, max or any summary function of groups of data (more on this in the next chapter, but you can type ?tapply for more info). In this case, we want to look at the mean river flow for each water year. Now to use tapply we use the following code

annual <- tapply(pq_wy$q_cfs, pq_wy$waterYear, mean)

Now let’s plot the data, where the values (y) are the annual average flow and the years (x) are the names of the annual vector from the tapply function.

plot(names(annual), annual, xlab = "Water Year", ylab = "Annual Average Flow (cfs)")

Has annual mean river flow declined over the past ten years? What about the last 6?