pick(), reframe(), and arrange()

dplyr 1.1.0

dplyr
dplyr 1.1.0 is waaaay faster at sorting character vectors, and introduces pick() and reframe() as better alternatives for your data workflow.
Published

January 29, 2023

Install dplyr 1.1.0 with:

pak::pak("cran/dplyr@1.1.0")

Load the package with:

pick()

You may have used across() for column selection while working inside a data-masking function like mutate() or summarize().

df <- tibble(
  x_1 = c(1, 3, 2, 1, 2), 
  x_2 = 6:10, 
  w_4 = 11:15, 
  y_2 = c(5, 2, 4, 0, 6)
)

df
# A tibble: 5 × 4
    x_1   x_2   w_4   y_2
  <dbl> <int> <int> <dbl>
1     1     6    11     5
2     3     7    12     2
3     2     8    13     4
4     1     9    14     0
5     2    10    15     6
df |>
  summarise(
    n_x = ncol(across(starts_with("x"))),
    n_y = ncol(across(starts_with("y")))
  )
# A tibble: 1 × 2
    n_x   n_y
  <int> <int>
1     2     1

But, across() is meant to apply functions to columns, not select them. dplyr 1.1.0 provides a new function for this function :), called pick():

df |>
  summarise(
    n_x = ncol(pick(starts_with("x"))),
    n_y = ncol(pick(starts_with("y")))
  )
# A tibble: 1 × 2
    n_x   n_y
  <int> <int>
1     2     1

across() still works without functions for now, but the tidyverse team plans to deprecate it in the future.

reframe()

dplyr 1.0.0 introduces a powerful new feature: summarise() could return per-group results of any length:

table <- c("a", "b", "d", "f")

df <- tibble(
  g = c(1, 1, 1, 2, 2, 2, 2),
  x = c("e", "a", "b", "c", "f", "d", "a")
)

df
# A tibble: 7 × 2
      g x    
  <dbl> <chr>
1     1 e    
2     1 a    
3     1 b    
4     2 c    
5     2 f    
6     2 d    
7     2 a    
df |>
  summarise(x = intersect(x, table), .by = g)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 5 × 2
      g x    
  <dbl> <chr>
1     1 a    
2     1 b    
3     2 f    
4     2 d    
5     2 a    

However, this raised some concerns.

  • Increases the chance for accidental bugs
  • Is against the spirit of a “summary,” which implies 1 row per group
  • Makes translation to dbplyr very difficult

This feature has been walked back and summarize() will throw a warning when either 0 or >1 rows are returned per group.

As its replacement, welcome new function reframe()!

Think of reframe() as: “do something to each group”.

df |>
  reframe(x = intersect(x, table), .by = g)
# A tibble: 5 × 2
      g x    
  <dbl> <chr>
1     1 a    
2     1 b    
3     2 f    
4     2 d    
5     2 a    

reframe() always returns an ungrouped data frame (i.e., not a grouped data frame even if the input was grouped).

arrange()

When sorting character vectors, the C locale is now the default, rather than the system locale. This makes dplyr 1.1.0 wayyy faster at sorting character variables.

library(withr)
library(dplyr)

df <- tibble(x = stringi::stri_rand_strings(n = 5e5, length = 15))
df
# A tibble: 500,000 × 1
   x              
   <chr>          
 1 h0myzPRtu57XbQT
 2 aaYu8q2bRepCcq1
 3 DVFhH1yGIMLUedf
 4 Esf49mkgK2Oz5rs
 5 p4KYioo2nx5fuIn
 6 CoTjxgZB6MdWcMM
 7 Xag5GvaJXXNY60G
 8 Cz2Jvn9aFySJm6r
 9 zyWGJPkSqXm6VB1
10 gciv0cIZLOfGvr8
# … with 499,990 more rows
withr::with_options(list(dplyr.legacy_locale = TRUE),
                    {
                      bench::system_time(df %>% arrange(x))
                    })
process    real 
  4.18s   4.36s 
bench::system_time(df %>% arrange(x))
process    real 
  365ms   426ms 

There is a new locale argument for you to explicitly request an alternative locale using a stringi locale identifier (like “en” for English, or “fr” for French).

bench::system_time(df %>% arrange(x, locale = "fr"))
process    real 
  415ms   450ms 
Warning

Be aware: the new locale slightly changes how vectors are ordered.

Learn more