Per-operation grouping

dplyr 1.1.0

dplyr

Introducing by/.by, an experimental grouping alternative to group_by().

Published

January 29, 2023

Install dplyr 1.1.0 with:

pak::pak("cran/dplyr@1.1.0")

Load the package with:

library(dplyr)

Per-operation grouping

by/.by is an experimental grouping alternative to group_by().

`group_by()`

group_by() is a function that groups by one or more variable.

transactions <-
  tibble::tribble(
    ~company, ~year, ~revenue,
         "A", 2019L,      20L,
         "A", 2019L,      50L,
         "A", 2020L,       4L,
         "B", 2021L,      10L,
         "B", 2023L,      12L,
         "B", 2023L,      18L
    )

Let’s say you want revenue by company and year:

transactions |>
  group_by(company, year) |>
  mutate(total = sum(revenue))

# A tibble: 6 × 4
# Groups:   company, year [4]
  company  year revenue total
  <chr>   <int>   <int> <int>
1 A        2019      20    70
2 A        2019      50    70
3 A        2020       4     4
4 B        2021      10    10
5 B        2023      12    30
6 B        2023      18    30

Notice the message that says Groups: company, year [4]. group_by() provides persistent grouping (lasts for more than one operation).

If you want only the total yearly revenue of each company, you can use summarize() which peels off a layer of grouping by default:

transactions %>% 
  group_by(company, year) %>% 
  summarize(revenue = sum(revenue))

# A tibble: 4 × 3
# Groups:   company [2]
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

(Year is removed as a group).

What if you didn’t want groups anymore?

Before: `ungroup()`

transactions %>% 
  group_by(company, year,) %>% 
  summarize(revenue = sum(revenue)) %>% 
  ungroup()

`summarise()` has grouped output by 'company'. You can override using the
`.groups` argument.

# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

Before: `.groups = "drop"`

transactions %>% 
  group_by(company, year,) %>% 
  summarize(revenue = sum(revenue),
            .groups = "drop")

# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

Now: `by/.by`

by/.by introduces the idea of per-operation grouping:

transactions |>
  mutate(total = sum(revenue), .by = c(company, year))

# A tibble: 6 × 4
  company  year revenue total
  <chr>   <int>   <int> <int>
1 A        2019      20    70
2 A        2019      50    70
3 A        2020       4     4
4 B        2021      10    10
5 B        2023      12    30
6 B        2023      18    30

Notice this is longer grouped by company on the way out. It does the one operation then drops off.

group_by()
by/.by

flowchart LR
  A[Bare tibble] --> B(Transaction)
  B --> C{Grouped data frame}

flowchart LR
  A[Bare tibble] --> B(Transaction)
  B --> C[Bare tibble]

Advantages:

summarise() didn’t emit a message about regrouping.
You never have to remember to ungroup().
Order doesn’t matter (because you’re not peeling off layers).
You can place the grouping specification alongside the code that uses it, rather than in a separate group_by() line.
You can use tidyselect for multiple columns, including unquoted column names or tidyselections like .by = all_of(c("")).

Things to note:

by/.by is only for selection, it does not create columns.
by/.by always returns an ungrouped data frame (so take note if you depend on grouped data frames with group_by()).
With by/.by, you must create your grouping columns ahead of time.
.by doesn’t sort grouping keys. group_by() always sorts keys in ascending order, which affects the results of verbs like summarize().

Where did this come from?

by/.by was inspired by data.table!

by is specified alongside what you want to group
You start with a bare data table and then do this and end up with a bare data table, rather than having a grouped data frame like in dplyr.

transactions[, .(revenue = sum(revenue)), by = .(company, year)]

This raised the question, what if you can put it in line with your summarize call?

transactions %>%
  summarize(revenue = sum(revenue),
            by = c(company, year))

Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

# A tibble: 12 × 2
   revenue by   
     <int> <chr>
 1     114 A    
 2     114 A    
 3     114 A    
 4     114 B    
 5     114 B    
 6     114 B    
 7     114 2019 
 8     114 2019 
 9     114 2020 
10     114 2021 
11     114 2023 
12     114 2023

In summary

by/.by is per-operation grouping
group_by() is persistent grouping

dplyr verbs that support by/.by:

mutate()
summarize()
reframe()
filter()
slice()
slide_head() and slice_tail()
slide_min() and slice_max()
slice_sample()

`by` or `.by`?

Some verbs use . prefix for their arguments and some don’t. If you use the incorrect one, you will get an informative error:

transactions |>
  slice_max(revenue, n = 2, .by = company)

Error in `slice_max()`:
! Can't specify an argument named `.by` in this verb.
ℹ Did you mean to use `by` instead?

transactions %>% 
  slice_max(revenue, n = 2, by = company)

# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      50
2 A        2019      20
3 B        2023      18
4 B        2023      12

What happens to `group_by()`?

It’s not going away! It is not deprecated or even superseded. Don’t feel pressure to use by/.by.

Per-operation grouping

group_by()

Before: ungroup()

Before: .groups = "drop"

Now: by/.by

Where did this come from?

In summary

by or .by?

What happens to group_by()?

Learn more

`group_by()`

Before: `ungroup()`

Before: `.groups = "drop"`

Now: `by/.by`

`by` or `.by`?

What happens to `group_by()`?