3.2 Functional Programming with purrr

In the previous subsection, we explored how to reshape and wrangle data with the tidyr and dplyr packages. We broke the functions up into sections for easier cross reference, but in reality, they can (and should) be combined together using pipes to avoid intermediate assignments. The following outlines all steps necessary to clean the rushing and receiving statistics for the 2020 Denver Broncos.

"https://www.pro-football-reference.com/teams/den/2020.htm" %>%
  # scrape team stats
  get_team_stats(.) %>%
  # clean results of scrape
  mutate(., age = as.numeric(age),
         pos = pos %>% 
           str_to_upper() %>% 
           na_if(., '') %>% 
           as_factor(),
         receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
  mutate_at(., vars(starts_with('games_'), 
                    starts_with('rushing_'), 
                    starts_with('receiving_'), 
                    starts_with('yds_')), 
            ~as.numeric(.x)) %>%
  mutate_all(., ~na_if(.x, '')) %>%
  filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
  select_at(., vars(team, no, player, age, pos, 
                    starts_with('games_'),
                    starts_with('rushing_'), 
                    starts_with('receiving_')))

We’ve outlined how to get rushing and receiving statistics for the 2020 Denver Broncos, but now, we would like to attain the same statistics for every NFL team. Since Pro-Football Reference reports the same statistics for each team in the same manner, we hope to scrape and clean each team’s statistics using the code above, changing only the web address.

When we need to iterate over an object, we should reach for the purrr package which contains functions allowing us to map over different elements of a vector, data frame, or list. In our case, we have a vector of webpages, and for each webpage, we would like to scrape the rushing and receiving statistics before cleaning the result. First, we will use the map function to iterate over the vector of webpages. For each webpage, we will invoke get_team_stats. The result is a list of data frames where each element of the list corresponds to a specific team’s scraped rushing and receiving statistics. The helper function set_names allows us to specify the name of each element of the resulting list. The lambda function str_sub extracts the three letter code in the team web address which specifies the team.

team_stats <- team_urls %>%
  # create list where each element is the result of the function get_team_stats
  # applied to the cooresponding 
  map(., ~get_team_stats(.x))

class(team_stats)

# print only the first three elements of the list
team_stats[1]
team_stats[2]
[1] "list"
[[1]]
# A tibble: 20 x 28
   team  no    player       age   pos   games_g games_gs rushing_att rushing_yds
   <chr> <chr> <chr>        <chr> <chr> <chr>   <chr>    <chr>       <chr>      
 1 crd   "41"  Kenyan Drake "26"  "RB"  15      "13"     239         955        
 2 crd   "1"   Kyler Murra~ "23"  "QB"  16      "16"     133         819        
 3 crd   "29"  Chase Edmon~ "24"  ""    16      "2"      97          448        
 4 crd   "15"  Chris Strev~ "25"  ""    5       "0"      4           15         
 5 crd   "13"  Christian K~ "24"  "WR"  14      "10"     2           3          
 6 crd   "37"  D.J. Foster  "27"  ""    10      "0"      2           2          
 7 crd   "10"  DeAndre Hop~ "28"  "WR"  16      "16"     1           1          
 8 crd   "17"  Andy Isabel~ "24"  ""    13      "2"      1           -6         
 9 crd   "11"  Larry Fitzg~ "37"  "WR"  13      "13"     0           0          
10 crd   "85"  Dan Arnold   "25"  "te"  16      "5"      0           0          
11 crd   "19"  KeeSean Joh~ "24"  ""    8       "1"      0           0          
12 crd   "81"  Darrell Dan~ "26"  "te"  12      "8"      0           0          
13 crd   "87"  Maxx Willia~ "26"  "TE"  9       "8"      0           0          
14 crd   "16"  Trent Sherf~ "24"  ""    15      "1"      0           0          
15 crd   "80"  Jordan Thom~ "24"  ""    4       "0"      0           0          
16 crd   "47"  Zeke Turner  "24"  ""    16      "0"      0           0          
17 crd   "38"  Jonathan Wa~ "23"  ""    14      "0"      0           0          
18 crd   "86"  Seth Devalve "27"  ""    4       "0"      0           0          
19 crd   ""    Team Total   "27.~ ""    16      ""       479         2237       
20 crd   ""    Opp Total    ""    ""    16      ""       436         2008       
# ... with 19 more variables: rushing_td <chr>, rushing_lng <chr>,
#   rushing_y_a <chr>, rushing_y_g <chr>, rushing_a_g <chr>, rushing_fmb <chr>,
#   receiving_tgt <chr>, receiving_rec <chr>, receiving_yds <chr>,
#   receiving_y_r <chr>, receiving_td <chr>, receiving_lng <chr>,
#   receiving_r_g <chr>, receiving_y_g <chr>, receiving_ctch_percent <chr>,
#   receiving_y_tgt <chr>, yds_touch <chr>, yds_y_tch <chr>, yds_y_scm <chr>

[[1]]
# A tibble: 21 x 28
   team  no    player       age   pos   games_g games_gs rushing_att rushing_yds
   <chr> <chr> <chr>        <chr> <chr> <chr>   <chr>    <chr>       <chr>      
 1 atl   21    Todd Gurley  26    "RB"  15      15       195         678        
 2 atl   23    Brian Hill   25    ""    16      1        100         465        
 3 atl   25    Ito Smith    25    ""    14      0        63          268        
 4 atl   2     Matt Ryan    35    "QB"  16      16       29          92         
 5 atl   18    Calvin Ridl~ 26    "WR"  15      15       5           1          
 6 atl   40    Keith Smith  28    "FB"  16      7        4           7          
 7 atl   36    Tony Brooks~ 26    ""    1       0        3           4          
 8 atl   8     Matt Schaub  39    ""    1       0        3           -4         
 9 atl   83    Russell Gage 24    "wr"  16      8        2           9          
10 atl   15    Brandon Pow~ 25    ""    15      1        2           7          
# ... with 11 more rows, and 19 more variables: rushing_td <chr>,
#   rushing_lng <chr>, rushing_y_a <chr>, rushing_y_g <chr>, rushing_a_g <chr>,
#   rushing_fmb <chr>, receiving_tgt <chr>, receiving_rec <chr>,
#   receiving_yds <chr>, receiving_y_r <chr>, receiving_td <chr>,
#   receiving_lng <chr>, receiving_r_g <chr>, receiving_y_g <chr>,
#   receiving_ctch_percent <chr>, receiving_y_tgt <chr>, yds_touch <chr>,
#   yds_y_tch <chr>, yds_y_scm <chr>

The first two elements of the resulting lists are the scraped rushing and receiving statistics for the 2020 Arizona Cardinals and the 2020 Atlanta Falcons. These data sets, along with the other thirty teams’ statistics, need to be cleaned in the same manner as the 2020 Denver Broncos. While we originally used the map function to iterate over a vector of web addresses, we can also use it to iterate over the list of each team’s scraped statistics, applying the same cleaning procedure to each element. Rather than use a lambda expression, we define a function. Lambda expressions are a convenient and concise way to specify a transformation if the transformation is defined by a single function. Previously, the entire act of scraping was described by one function: get_team_stats(). Since the cleaning procedure consists of multiple functions such as mutate, filter, and select, it is more convenient to specify the transformation in a function, rather than a lambda expression.

team_stats_clean <- team_stats %>%
  # clean
  map(., function(x){
    x %>%
      mutate(., age = as.numeric(age),
             pos = pos %>% 
               str_to_upper() %>% 
               na_if(., '') %>% 
               as_factor(),
             receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
      mutate_at(., vars(starts_with('games_'), 
                        starts_with('rushing_'), 
                        starts_with('receiving_'), 
                        starts_with('yds_')), 
                ~as.numeric(.x)) %>%
      mutate_all(., ~na_if(.x, '')) %>%
      filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
      select_at(., vars(team, no, player, age, pos, 
                        starts_with('games_'),
                        starts_with('rushing_'), 
                        starts_with('receiving_')))
    }
  )

We can always rewrite functions within a map as a lambda expression if we first define the desired transformation as a function before calling the function as a lambda expression in the map.

# define function with desired transfomation 
clean_team_stats <- function(x){
  x %>%
    mutate(., age = as.numeric(age),
           pos = pos %>% 
             str_to_upper() %>% 
             na_if(., '') %>% 
             as_factor(),
           receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
    mutate_at(., vars(starts_with('games_'), 
                      starts_with('rushing_'), 
                      starts_with('receiving_'), 
                      starts_with('yds_')), 
              ~as.numeric(.x)) %>%
    mutate_all(., ~na_if(.x, '')) %>%
    filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
    select_at(., vars(team, no, player, age, pos, 
                      starts_with('games_'),
                      starts_with('rushing_'), 
                      starts_with('receiving_')))
}

# map desired transformation to each team's scraped statistics using lambda
# expression
team_stats_clean_lambda <- team_stats %>%
  map(., ~clean_team_stats(.x))
  
# show results using funciton in map and lambda expression in map are the same
identical(team_stats_clean, team_stats_clean_lambda)
[1] TRUE

After mapping the cleaning procedure to each of the teams’ scraped statistics, the result is a list where each element of the list is a tidy data frame consisting of a team’s rushing and receiving statistics. As discussed in Chapter 2, we can rectangularize this list of data frames by binding the rows of each of these data frames, creating a column to indicate which team each row belongs.

team_stats_clean %>%
  bind_rows(.)

Since the purrr package is a member of the Tidyverse, it is created with principles of Tidy data in mind. Transforming a list to a vector, or in our case, a data frame is often a common last step after mapping a function. To accommodate this need, the purrr package provides specific maps such as map_dbl and map_chr which transform the resulting list into a vector of type double or character, respectively. There is also map_dfr and map_dfc which binds the list elements together row-wise or column-wise, respectively, to return a data frame. If the result cannot be transformed into the requested type, an error will be returned. The mapping illustrated in this example can be written concisely as

rush_receive <- team_urls %>%
  map(., ~get_team_stats(.x)) %>%
  map_dfr(., ~clean_team_stats(.x))