pak::pak("cran/dplyr@1.1.0")Per-operation grouping
dplyr 1.1.0
Install dplyr 1.1.0 with:
Load the package with:
Per-operation grouping
by/.by is an experimental grouping alternative to group_by().
group_by()
group_by() is a function that groups by one or more variable.
transactions <-
tibble::tribble(
~company, ~year, ~revenue,
"A", 2019L, 20L,
"A", 2019L, 50L,
"A", 2020L, 4L,
"B", 2021L, 10L,
"B", 2023L, 12L,
"B", 2023L, 18L
)Let’s say you want revenue by company and year:
# A tibble: 6 × 4
# Groups: company, year [4]
company year revenue total
<chr> <int> <int> <int>
1 A 2019 20 70
2 A 2019 50 70
3 A 2020 4 4
4 B 2021 10 10
5 B 2023 12 30
6 B 2023 18 30
Notice the message that says Groups: company, year [4]. group_by() provides persistent grouping (lasts for more than one operation).
If you want only the total yearly revenue of each company, you can use summarize() which peels off a layer of grouping by default:
# A tibble: 4 × 3
# Groups: company [2]
company year revenue
<chr> <int> <int>
1 A 2019 70
2 A 2020 4
3 B 2021 10
4 B 2023 30
(Year is removed as a group).
What if you didn’t want groups anymore?
Before: ungroup()
Before: .groups = "drop"
Now: by/.by
by/.by introduces the idea of per-operation grouping:
Notice this is longer grouped by company on the way out. It does the one operation then drops off.
flowchart LR
A[Bare tibble] --> B(Transaction)
B --> C{Grouped data frame}
flowchart LR A[Bare tibble] --> B(Transaction) B --> C[Bare tibble]
Advantages:
-
summarise()didn’t emit a message about regrouping. - You never have to remember to
ungroup(). - Order doesn’t matter (because you’re not peeling off layers).
- You can place the grouping specification alongside the code that uses it, rather than in a separate
group_by()line. - You can use tidyselect for multiple columns, including unquoted column names or tidyselections like
.by = all_of(c("")).
Things to note:
-
by/.byis only for selection, it does not create columns. -
by/.byalways returns an ungrouped data frame (so take note if you depend on grouped data frames withgroup_by()). - With
by/.by, you must create your grouping columns ahead of time. -
.bydoesn’t sort grouping keys.group_by()always sorts keys in ascending order, which affects the results of verbs likesummarize().
Where did this come from?
by/.by was inspired by data.table!
-
byis specified alongside what you want to group - You start with a bare data table and then do this and end up with a bare data table, rather than having a grouped data frame like in dplyr.
transactions[, .(revenue = sum(revenue)), by = .(company, year)]This raised the question, what if you can put it in line with your summarize call?
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
# A tibble: 12 × 2
revenue by
<int> <chr>
1 114 A
2 114 A
3 114 A
4 114 B
5 114 B
6 114 B
7 114 2019
8 114 2019
9 114 2020
10 114 2021
11 114 2023
12 114 2023
In summary
by/.byis per-operation groupinggroup_by()is persistent grouping
dplyr verbs that support by/.by:
mutate()summarize()reframe()filter()slice()-
slide_head()andslice_tail() -
slide_min()andslice_max() slice_sample()
by or .by?
Some verbs use . prefix for their arguments and some don’t. If you use the incorrect one, you will get an informative error:
transactions |>
slice_max(revenue, n = 2, .by = company)Error in `slice_max()`:
! Can't specify an argument named `.by` in this verb.
ℹ Did you mean to use `by` instead?
What happens to group_by()?
It’s not going away! It is not deprecated or even superseded. Don’t feel pressure to use by/.by.