3.3 Parallelization with furrr

3.3 Parallelization with `furrr`

When we map the function to scrape team statistics over the vector of team web addresses, we are insinuating that the function within the map (e.g. get_team_stats) should be applied to each element of the vector (e.g. team_urls) sequentially. Computational tasks which involve many separate, independently executable jobs should be run in parallel. When jobs are run in parallel, this means they are run at the same time, rather than sequentially. If the computational burden at each step is larger than the computational burden of setting up instructions for parallelization, then running code in parallel will save time.

While there are different types of parallelization, we will only focus on one: multi-core parallelization, which allows us to make use of the whole computer rather than rely on single processor architecture. furrr, a new Tidyverse package, attempts to make mapping in parallel easy and pain-free by combining the functionality of the purrr package and the future package. By combining purrr’s mapping capabilities and future’s parallel processing capabilities, furrr allows for parallelization with similar syntax.

In the previous subsection, we learned how to apply mapping functions to scrape and clean the statistics for each team, sequentially. Using furrr, we can parallelize this process. We will juxtapose each approach for comparison.

# sequential computation using purrr
rush_receive <- team_urls %>%
  map(., ~get_team_stats(.x)) %>%
  map_dfr(., ~clean_team_stats(.x))

# parallel computation using furrr
library(furrr)
future::plan(multiprocess)

rush_receive_parallel <- team_urls %>%
  future_map(., ~get_team_stats(.x)) %>%
  future_map_dfr(., ~clean_team_stats(.x))

# compare output
identical(rush_receive, rush_receive_parallel)

[1] TRUE

Let’s compare the speed of the operations.

# sequentially
system.time(
  rush_receive <- team_urls %>%
    map(., ~get_team_stats(.x)) %>%
    map_dfr(., ~clean_team_stats(.x))
)

# parallel
system.time(
  rush_receive_parallel <- team_urls %>%
    future_map(., ~get_team_stats(.x)) %>%
    future_map_dfr(., ~clean_team_stats(.x))
)

   user  system elapsed 
   2.65    0.09    3.65 
   user  system elapsed 
   0.28    0.03    1.09

We can see that parallelizing this process is about one and a half times faster than applying these functions sequentially. This effect is only amplified when we increase the number of sources to scrape.

player_urls <- team_urls %>% 
  map(., ~get_players(.x)) %>% 
  flatten_chr()

# sequentially
system.time(
  player_stats <- player_urls %>%
    map(., ~get_player_stats(.x))
)

# parallel
system.time(
  player_stats_par <- player_urls %>% 
    future_map(., ~get_player_stats(.x))
)

   user  system elapsed 
  35.20    1.63  153.31 
   user  system elapsed 
   0.16    0.01    9.42

As we can see, parallelizing the scraping of all 627 players who recorded rushing and receiving statistics in 2020 is about five times faster than scraping the same data sequentially.