pak::pak("cran/dplyr@1.1.0")
pick()
, reframe()
, and arrange()
dplyr 1.1.0
Install dplyr 1.1.0 with:
Load the package with:
pick()
You may have used across()
for column selection while working inside a data-masking function like mutate()
or summarize()
.
# A tibble: 5 × 4
x_1 x_2 w_4 y_2
<dbl> <int> <int> <dbl>
1 1 6 11 5
2 3 7 12 2
3 2 8 13 4
4 1 9 14 0
5 2 10 15 6
df |>
summarise(
n_x = ncol(across(starts_with("x"))),
n_y = ncol(across(starts_with("y")))
)
# A tibble: 1 × 2
n_x n_y
<int> <int>
1 2 1
But, across()
is meant to apply functions to columns, not select them. dplyr 1.1.0 provides a new function for this function :), called pick()
:
df |>
summarise(
n_x = ncol(pick(starts_with("x"))),
n_y = ncol(pick(starts_with("y")))
)
# A tibble: 1 × 2
n_x n_y
<int> <int>
1 2 1
across()
still works without functions for now, but the tidyverse team plans to deprecate it in the future.
reframe()
dplyr 1.0.0 introduces a powerful new feature: summarise()
could return per-group results of any length:
table <- c("a", "b", "d", "f")
df <- tibble(
g = c(1, 1, 1, 2, 2, 2, 2),
x = c("e", "a", "b", "c", "f", "d", "a")
)
df
# A tibble: 7 × 2
g x
<dbl> <chr>
1 1 e
2 1 a
3 1 b
4 2 c
5 2 f
6 2 d
7 2 a
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 5 × 2
g x
<dbl> <chr>
1 1 a
2 1 b
3 2 f
4 2 d
5 2 a
However, this raised some concerns.
- Increases the chance for accidental bugs
- Is against the spirit of a “summary,” which implies 1 row per group
- Makes translation to dbplyr very difficult
This feature has been walked back and summarize()
will throw a warning when either 0 or >1 rows are returned per group.
As its replacement, welcome new function reframe()
!
Think of reframe()
as: “do something to each group”.
# A tibble: 5 × 2
g x
<dbl> <chr>
1 1 a
2 1 b
3 2 f
4 2 d
5 2 a
reframe()
always returns an ungrouped data frame (i.e., not a grouped data frame even if the input was grouped).
arrange()
When sorting character vectors, the C locale is now the default, rather than the system locale. This makes dplyr 1.1.0 wayyy faster at sorting character variables.
# A tibble: 500,000 × 1
x
<chr>
1 h0myzPRtu57XbQT
2 aaYu8q2bRepCcq1
3 DVFhH1yGIMLUedf
4 Esf49mkgK2Oz5rs
5 p4KYioo2nx5fuIn
6 CoTjxgZB6MdWcMM
7 Xag5GvaJXXNY60G
8 Cz2Jvn9aFySJm6r
9 zyWGJPkSqXm6VB1
10 gciv0cIZLOfGvr8
# … with 499,990 more rows
withr::with_options(list(dplyr.legacy_locale = TRUE),
{
bench::system_time(df %>% arrange(x))
})
process real
4.18s 4.36s
bench::system_time(df %>% arrange(x))
process real
365ms 426ms
There is a new locale
argument for you to explicitly request an alternative locale using a stringi locale identifier (like “en” for English, or “fr” for French).
bench::system_time(df %>% arrange(x, locale = "fr"))
process real
415ms 450ms