pak::pak("cran/dplyr@1.1.0")
Per-operation grouping
dplyr 1.1.0
Install dplyr 1.1.0 with:
Load the package with:
Per-operation grouping
by
/.by
is an experimental grouping alternative to group_by()
.
group_by()
group_by()
is a function that groups by one or more variable.
transactions <-
tibble::tribble(
~company, ~year, ~revenue,
"A", 2019L, 20L,
"A", 2019L, 50L,
"A", 2020L, 4L,
"B", 2021L, 10L,
"B", 2023L, 12L,
"B", 2023L, 18L
)
Let’s say you want revenue by company and year:
# A tibble: 6 × 4
# Groups: company, year [4]
company year revenue total
<chr> <int> <int> <int>
1 A 2019 20 70
2 A 2019 50 70
3 A 2020 4 4
4 B 2021 10 10
5 B 2023 12 30
6 B 2023 18 30
Notice the message that says Groups: company, year [4]
. group_by()
provides persistent grouping (lasts for more than one operation).
If you want only the total yearly revenue of each company, you can use summarize()
which peels off a layer of grouping by default:
# A tibble: 4 × 3
# Groups: company [2]
company year revenue
<chr> <int> <int>
1 A 2019 70
2 A 2020 4
3 B 2021 10
4 B 2023 30
(Year is removed as a group).
What if you didn’t want groups anymore?
Before: ungroup()
Before: .groups = "drop"
Now: by/.by
by/.by
introduces the idea of per-operation grouping:
Notice this is longer grouped by company on the way out. It does the one operation then drops off.
flowchart LR A[Bare tibble] --> B(Transaction) B --> C{Grouped data frame}
flowchart LR A[Bare tibble] --> B(Transaction) B --> C[Bare tibble]
Advantages:
-
summarise()
didn’t emit a message about regrouping. - You never have to remember to
ungroup()
. - Order doesn’t matter (because you’re not peeling off layers).
- You can place the grouping specification alongside the code that uses it, rather than in a separate
group_by()
line. - You can use tidyselect for multiple columns, including unquoted column names or tidyselections like
.by = all_of(c(""))
.
Things to note:
-
by/.by
is only for selection, it does not create columns. -
by/.by
always returns an ungrouped data frame (so take note if you depend on grouped data frames withgroup_by()
). - With
by/.by
, you must create your grouping columns ahead of time. -
.by
doesn’t sort grouping keys.group_by()
always sorts keys in ascending order, which affects the results of verbs likesummarize()
.
Where did this come from?
by/.by
was inspired by data.table!
-
by
is specified alongside what you want to group - You start with a bare data table and then do this and end up with a bare data table, rather than having a grouped data frame like in dplyr.
transactions[, .(revenue = sum(revenue)), by = .(company, year)]
This raised the question, what if you can put it in line with your summarize call?
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 12 × 2
revenue by
<int> <chr>
1 114 A
2 114 A
3 114 A
4 114 B
5 114 B
6 114 B
7 114 2019
8 114 2019
9 114 2020
10 114 2021
11 114 2023
12 114 2023
In summary
by/.by
is per-operation groupinggroup_by()
is persistent grouping
dplyr verbs that support by/.by
:
mutate()
summarize()
reframe()
filter()
slice()
-
slide_head()
andslice_tail()
-
slide_min()
andslice_max()
slice_sample()
by
or .by
?
Some verbs use .
prefix for their arguments and some don’t. If you use the incorrect one, you will get an informative error:
transactions |>
slice_max(revenue, n = 2, .by = company)
Error in `slice_max()`:
! Can't specify an argument named `.by` in this verb.
ℹ Did you mean to use `by` instead?
What happens to group_by()
?
It’s not going away! It is not deprecated or even superseded. Don’t feel pressure to use by/.by
.