3.3 Parallelization with furrr
When we map
the function to scrape team statistics over the vector of team web addresses, we are insinuating that the function within the map
(e.g. get_team_stats
) should be applied to each element of the vector (e.g. team_urls
) sequentially. Computational tasks which involve many separate, independently executable jobs should be run in parallel. When jobs are run in parallel, this means they are run at the same time, rather than sequentially. If the computational burden at each step is larger than the computational burden of setting up instructions for parallelization, then running code in parallel will save time.
While there are different types of parallelization, we will only focus on one: multi-core parallelization, which allows us to make use of the whole computer rather than rely on single processor architecture. furrr
, a new Tidyverse package, attempts to make mapping in parallel easy and pain-free by combining the functionality of the purrr
package and the future
package. By combining purrr
’s mapping capabilities and future
’s parallel processing capabilities, furrr
allows for parallelization with similar syntax.
In the previous subsection, we learned how to apply mapping functions to scrape and clean the statistics for each team, sequentially. Using furrr
, we can parallelize this process. We will juxtapose each approach for comparison.
# sequential computation using purrr
rush_receive <- team_urls %>%
map(., ~get_team_stats(.x)) %>%
map_dfr(., ~clean_team_stats(.x))
# parallel computation using furrr
library(furrr)
future::plan(multiprocess)
rush_receive_parallel <- team_urls %>%
future_map(., ~get_team_stats(.x)) %>%
future_map_dfr(., ~clean_team_stats(.x))
# compare output
identical(rush_receive, rush_receive_parallel)
[1] TRUE
Let’s compare the speed of the operations.
# sequentially
system.time(
rush_receive <- team_urls %>%
map(., ~get_team_stats(.x)) %>%
map_dfr(., ~clean_team_stats(.x))
)
# parallel
system.time(
rush_receive_parallel <- team_urls %>%
future_map(., ~get_team_stats(.x)) %>%
future_map_dfr(., ~clean_team_stats(.x))
)
user system elapsed
2.65 0.09 3.65
user system elapsed
0.28 0.03 1.09
We can see that parallelizing this process is about one and a half times faster than applying these functions sequentially. This effect is only amplified when we increase the number of sources to scrape.
player_urls <- team_urls %>%
map(., ~get_players(.x)) %>%
flatten_chr()
# sequentially
system.time(
player_stats <- player_urls %>%
map(., ~get_player_stats(.x))
)
# parallel
system.time(
player_stats_par <- player_urls %>%
future_map(., ~get_player_stats(.x))
)
user system elapsed
35.20 1.63 153.31
user system elapsed
0.16 0.01 9.42
As we can see, parallelizing the scraping of all 627 players who recorded rushing and receiving statistics in 2020 is about five times faster than scraping the same data sequentially.