3.2 Functional Programming with purrr
In the previous subsection, we explored how to reshape and wrangle data with the tidyr
and dplyr
packages. We broke the functions up into sections for easier cross reference, but in reality, they can (and should) be combined together using pipes to avoid intermediate assignments. The following outlines all steps necessary to clean the rushing and receiving statistics for the 2020 Denver Broncos.
"https://www.pro-football-reference.com/teams/den/2020.htm" %>%
# scrape team stats
get_team_stats(.) %>%
# clean results of scrape
mutate(., age = as.numeric(age),
pos = pos %>%
str_to_upper() %>%
na_if(., '') %>%
as_factor(),
receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
mutate_at(., vars(starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_'),
starts_with('yds_')),
~as.numeric(.x)) %>%
mutate_all(., ~na_if(.x, '')) %>%
filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
select_at(., vars(team, no, player, age, pos,
starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_')))
We’ve outlined how to get rushing and receiving statistics for the 2020 Denver Broncos, but now, we would like to attain the same statistics for every NFL team. Since Pro-Football Reference reports the same statistics for each team in the same manner, we hope to scrape and clean each team’s statistics using the code above, changing only the web address.
When we need to iterate over an object, we should reach for the purrr
package which contains functions allowing us to map
over different elements of a vector, data frame, or list. In our case, we have a vector of webpages, and for each webpage, we would like to scrape the rushing and receiving statistics before cleaning the result. First, we will use the map
function to iterate over the vector of webpages. For each webpage, we will invoke get_team_stats
. The result is a list of data frames where each element of the list corresponds to a specific team’s scraped rushing and receiving statistics. The helper function set_names
allows us to specify the name of each element of the resulting list. The lambda function str_sub
extracts the three letter code in the team web address which specifies the team.
team_stats <- team_urls %>%
# create list where each element is the result of the function get_team_stats
# applied to the cooresponding
map(., ~get_team_stats(.x))
class(team_stats)
# print only the first three elements of the list
team_stats[1]
team_stats[2]
[1] "list"
[[1]]
# A tibble: 20 x 28
team no player age pos games_g games_gs rushing_att rushing_yds
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 crd "41" Kenyan Drake "26" "RB" 15 "13" 239 955
2 crd "1" Kyler Murra~ "23" "QB" 16 "16" 133 819
3 crd "29" Chase Edmon~ "24" "" 16 "2" 97 448
4 crd "15" Chris Strev~ "25" "" 5 "0" 4 15
5 crd "13" Christian K~ "24" "WR" 14 "10" 2 3
6 crd "37" D.J. Foster "27" "" 10 "0" 2 2
7 crd "10" DeAndre Hop~ "28" "WR" 16 "16" 1 1
8 crd "17" Andy Isabel~ "24" "" 13 "2" 1 -6
9 crd "11" Larry Fitzg~ "37" "WR" 13 "13" 0 0
10 crd "85" Dan Arnold "25" "te" 16 "5" 0 0
11 crd "19" KeeSean Joh~ "24" "" 8 "1" 0 0
12 crd "81" Darrell Dan~ "26" "te" 12 "8" 0 0
13 crd "87" Maxx Willia~ "26" "TE" 9 "8" 0 0
14 crd "16" Trent Sherf~ "24" "" 15 "1" 0 0
15 crd "80" Jordan Thom~ "24" "" 4 "0" 0 0
16 crd "47" Zeke Turner "24" "" 16 "0" 0 0
17 crd "38" Jonathan Wa~ "23" "" 14 "0" 0 0
18 crd "86" Seth Devalve "27" "" 4 "0" 0 0
19 crd "" Team Total "27.~ "" 16 "" 479 2237
20 crd "" Opp Total "" "" 16 "" 436 2008
# ... with 19 more variables: rushing_td <chr>, rushing_lng <chr>,
# rushing_y_a <chr>, rushing_y_g <chr>, rushing_a_g <chr>, rushing_fmb <chr>,
# receiving_tgt <chr>, receiving_rec <chr>, receiving_yds <chr>,
# receiving_y_r <chr>, receiving_td <chr>, receiving_lng <chr>,
# receiving_r_g <chr>, receiving_y_g <chr>, receiving_ctch_percent <chr>,
# receiving_y_tgt <chr>, yds_touch <chr>, yds_y_tch <chr>, yds_y_scm <chr>
[[1]]
# A tibble: 21 x 28
team no player age pos games_g games_gs rushing_att rushing_yds
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 atl 21 Todd Gurley 26 "RB" 15 15 195 678
2 atl 23 Brian Hill 25 "" 16 1 100 465
3 atl 25 Ito Smith 25 "" 14 0 63 268
4 atl 2 Matt Ryan 35 "QB" 16 16 29 92
5 atl 18 Calvin Ridl~ 26 "WR" 15 15 5 1
6 atl 40 Keith Smith 28 "FB" 16 7 4 7
7 atl 36 Tony Brooks~ 26 "" 1 0 3 4
8 atl 8 Matt Schaub 39 "" 1 0 3 -4
9 atl 83 Russell Gage 24 "wr" 16 8 2 9
10 atl 15 Brandon Pow~ 25 "" 15 1 2 7
# ... with 11 more rows, and 19 more variables: rushing_td <chr>,
# rushing_lng <chr>, rushing_y_a <chr>, rushing_y_g <chr>, rushing_a_g <chr>,
# rushing_fmb <chr>, receiving_tgt <chr>, receiving_rec <chr>,
# receiving_yds <chr>, receiving_y_r <chr>, receiving_td <chr>,
# receiving_lng <chr>, receiving_r_g <chr>, receiving_y_g <chr>,
# receiving_ctch_percent <chr>, receiving_y_tgt <chr>, yds_touch <chr>,
# yds_y_tch <chr>, yds_y_scm <chr>
The first two elements of the resulting lists are the scraped rushing and receiving statistics for the 2020 Arizona Cardinals and the 2020 Atlanta Falcons. These data sets, along with the other thirty teams’ statistics, need to be cleaned in the same manner as the 2020 Denver Broncos. While we originally used the map
function to iterate over a vector of web addresses, we can also use it to iterate over the list of each team’s scraped statistics, applying the same cleaning procedure to each element. Rather than use a lambda expression, we define a function. Lambda expressions are a convenient and concise way to specify a transformation if the transformation is defined by a single function. Previously, the entire act of scraping was described by one function: get_team_stats()
. Since the cleaning procedure consists of multiple functions such as mutate
, filter
, and select
, it is more convenient to specify the transformation in a function, rather than a lambda expression.
team_stats_clean <- team_stats %>%
# clean
map(., function(x){
x %>%
mutate(., age = as.numeric(age),
pos = pos %>%
str_to_upper() %>%
na_if(., '') %>%
as_factor(),
receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
mutate_at(., vars(starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_'),
starts_with('yds_')),
~as.numeric(.x)) %>%
mutate_all(., ~na_if(.x, '')) %>%
filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
select_at(., vars(team, no, player, age, pos,
starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_')))
}
)
We can always rewrite functions within a map
as a lambda expression if we first define the desired transformation as a function before calling the function as a lambda expression in the map
.
# define function with desired transfomation
clean_team_stats <- function(x){
x %>%
mutate(., age = as.numeric(age),
pos = pos %>%
str_to_upper() %>%
na_if(., '') %>%
as_factor(),
receiving_ctch_percent = str_remove_all(receiving_ctch_percent, '%')) %>%
mutate_at(., vars(starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_'),
starts_with('yds_')),
~as.numeric(.x)) %>%
mutate_all(., ~na_if(.x, '')) %>%
filter(., !(player %in% c('Team Total', 'Opp Total')), !is.na(pos), pos != 'P') %>%
select_at(., vars(team, no, player, age, pos,
starts_with('games_'),
starts_with('rushing_'),
starts_with('receiving_')))
}
# map desired transformation to each team's scraped statistics using lambda
# expression
team_stats_clean_lambda <- team_stats %>%
map(., ~clean_team_stats(.x))
# show results using funciton in map and lambda expression in map are the same
identical(team_stats_clean, team_stats_clean_lambda)
[1] TRUE
After mapping the cleaning procedure to each of the teams’ scraped statistics, the result is a list where each element of the list is a tidy data frame consisting of a team’s rushing and receiving statistics. As discussed in Chapter 2, we can rectangularize this list of data frames by binding the rows of each of these data frames, creating a column to indicate which team each row belongs.
Since the purrr
package is a member of the Tidyverse, it is created with principles of Tidy data in mind. Transforming a list to a vector, or in our case, a data frame is often a common last step after mapping a function. To accommodate this need, the purrr
package provides specific maps such as map_dbl
and map_chr
which transform the resulting list into a vector of type double or character, respectively. There is also map_dfr
and map_dfc
which binds the list elements together row-wise or column-wise, respectively, to return a data frame. If the result cannot be transformed into the requested type, an error will be returned. The mapping illustrated in this example can be written concisely as