The tidyverse is built on the idea that problems are most easily solved within the structure of a data frame.
The purrr package allows for functional programming that plays well with data frames using the map functions and for capturing errors with the safely functions. However, the combination of functional programming and capturing errors in purrr returns a list which does not work well with data frames.
purrrplus adds functionality to purrr that allows the combination of functional programming and capturing errors to be kept within the structure of a data frame. This is useful for many data analysis tasks and is particularly useful for conducting simulations.
To learn more about the package, you can either continue reading or watch this video:
Imagine you have a function (which returns a named list or a named vector and might throw an error):
calculate_if_positive <- function(a, b){
if(a < 0 & b < 0) {stop("Both numbers are negative.")}
else if(a < 0) {stop("Just the first number is negative")}
else if(b < 0) {stop("Just the second number is negative")}
list(add = a + b,
subtract = a - b,
multiply = a * b,
divide = a / b)
}
And you want to apply this function to each row of a data frame (which might contain irrelevant variables):
The irrelevant variable causes pmap to throw an error:
output <- pmap(numbers, calculate_if_positive)
## Error in .f(a = .l[[c(1L, i)]], b = .l[[c(2L, i)]], irrelevant = .l[[c(3L, : unused argument (irrelevant = .l[[c(3, i)]])
One way around this is to remove the irrelevant variables:
However, now calculate_if_positive function throws an error that stops everything if any of the rows contain a negative number:
output <- pmap(numbers2, calculate_if_positive)
## Error in .f(a = .l[[c(1L, i)]], b = .l[[c(2L, i)]], ...): Just the first number is negative
We are applying the function calculate_if_positive 4 times (once for each row in the data frame numbers). It should work for rows 2 and 3 and throw an error for rows 1 and 4.
The purrrr function safely allows us to capture the results when it works and the error when it doesn’t:
However, a function wrapped in safely returns a list which is difficult to work with:
str(output)
## List of 4
## $ :List of 2
## ..$ result: NULL
## ..$ error :List of 2
## .. ..$ message: chr "Just the first number is negative"
## .. ..$ call : language .f(...)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
## $ :List of 2
## ..$ result:List of 4
## .. ..$ add : num 1
## .. ..$ subtract: num -1
## .. ..$ multiply: num 0
## .. ..$ divide : num 0
## ..$ error : NULL
## $ :List of 2
## ..$ result:List of 4
## .. ..$ add : num 1
## .. ..$ subtract: num 1
## .. ..$ multiply: num 0
## .. ..$ divide : num Inf
## ..$ error : NULL
## $ :List of 2
## ..$ result: NULL
## ..$ error :List of 2
## .. ..$ message: chr "Just the second number is negative"
## .. ..$ call : language .f(...)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
pmap_safely is the key function in purrrplus. pmap_safely takes a data frame (which might contain irrelevant variables) and applies a function (which returns a named list or a named vector and might throw an error) to each row.
pmap_safely adds an error and a result column (which come from applying the function) to the inputted data frame.
(output <- pmap_safely(numbers, calculate_if_positive))
## Note that the function does not use the following variables: irrelevant
## # A tibble: 4 x 5
## a b irrelevant error result
## <dbl> <dbl> <chr> <chr> <list>
## 1 -1 2 minneapolis Just the first number is negative <NULL>
## 2 0 1 st_paul <NA> <list [4]>
## 3 1 0 minneapolis <NA> <list [4]>
## 4 2 -1 st_paul Just the second number is negative <NULL>
get_errors allows for quick analysis of errors:
get_errors(output)
## # A tibble: 10 x 5
## variable value n_errors count error_rate
## <chr> <chr> <int> <int> <dbl>
## 1 a -1 1 1 1
## 2 a 2 1 1 1
## 3 a 0 0 1 0
## 4 a 1 0 1 0
## 5 b -1 1 1 1
## 6 b 2 1 1 1
## 7 b 0 0 1 0
## 8 b 1 0 1 0
## 9 irrelevant minneapolis 1 2 0.5
## 10 irrelevant st_paul 1 2 0.5
get_errors with specific = TRUE breaks down the analysis by the specific error:
get_errors(output, specific = TRUE)
## # A tibble: 12 x 6
## variable value error n count rate
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 a -1 Just the first number is nega… 1 1 1
## 2 a 0 <NA> 1 1 1
## 3 a 1 <NA> 1 1 1
## 4 a 2 Just the second number is neg… 1 1 1
## 5 b -1 Just the second number is neg… 1 1 1
## 6 b 0 <NA> 1 1 1
## 7 b 1 <NA> 1 1 1
## 8 b 2 Just the first number is nega… 1 1 1
## 9 irrelevant minneapolis <NA> 1 2 0.5
## 10 irrelevant minneapolis Just the first number is nega… 1 2 0.5
## 11 irrelevant st_paul <NA> 1 2 0.5
## 12 irrelevant st_paul Just the second number is neg… 1 2 0.5
get_results filters out rows with errors and unnests results such that each item in the list that the function returns has its own column:
get_results(output)
## Removed 2 errors out of 4 rows.
## # A tibble: 2 x 7
## a b irrelevant add_result subtract_result multiply_result
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 1 st_paul 1 -1 0
## 2 1 0 minneapolis 1 1 0
## # ... with 1 more variable: divide_result <dbl>
Notice that "_result" is appended to each of these columns so any subsequent analysis can easily differentiate between variables the function produced and variables it didn’t.