Per-operation grouping

dplyr 1.1.0

dplyr
Introducing by/.by, an experimental grouping alternative to group_by().
Published

January 29, 2023

Install dplyr 1.1.0 with:

pak::pak("cran/dplyr@1.1.0")

Load the package with:

Per-operation grouping

by/.by is an experimental grouping alternative to group_by().

group_by()

group_by() is a function that groups by one or more variable.

transactions <-
  tibble::tribble(
    ~company, ~year, ~revenue,
         "A", 2019L,      20L,
         "A", 2019L,      50L,
         "A", 2020L,       4L,
         "B", 2021L,      10L,
         "B", 2023L,      12L,
         "B", 2023L,      18L
    )

Let’s say you want revenue by company and year:

transactions |>
  group_by(company, year) |>
  mutate(total = sum(revenue))
# A tibble: 6 × 4
# Groups:   company, year [4]
  company  year revenue total
  <chr>   <int>   <int> <int>
1 A        2019      20    70
2 A        2019      50    70
3 A        2020       4     4
4 B        2021      10    10
5 B        2023      12    30
6 B        2023      18    30

Notice the message that says Groups: company, year [4]. group_by() provides persistent grouping (lasts for more than one operation).

If you want only the total yearly revenue of each company, you can use summarize() which peels off a layer of grouping by default:

transactions %>% 
  group_by(company, year) %>% 
  summarize(revenue = sum(revenue))
# A tibble: 4 × 3
# Groups:   company [2]
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

(Year is removed as a group).

What if you didn’t want groups anymore?

Before: ungroup()

transactions %>% 
  group_by(company, year,) %>% 
  summarize(revenue = sum(revenue)) %>% 
  ungroup()
`summarise()` has grouped output by 'company'. You can override using the
`.groups` argument.
# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

Before: .groups = "drop"

transactions %>% 
  group_by(company, year,) %>% 
  summarize(revenue = sum(revenue),
            .groups = "drop")
# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      70
2 A        2020       4
3 B        2021      10
4 B        2023      30

Now: by/.by

by/.by introduces the idea of per-operation grouping:

transactions |>
  mutate(total = sum(revenue), .by = c(company, year))
# A tibble: 6 × 4
  company  year revenue total
  <chr>   <int>   <int> <int>
1 A        2019      20    70
2 A        2019      50    70
3 A        2020       4     4
4 B        2021      10    10
5 B        2023      12    30
6 B        2023      18    30

Notice this is longer grouped by company on the way out. It does the one operation then drops off.

flowchart LR
  A[Bare tibble] --> B(Transaction)
  B --> C{Grouped data frame}

flowchart LR
  A[Bare tibble] --> B(Transaction)
  B --> C[Bare tibble]

Advantages:

Things to note:

Where did this come from?

by/.by was inspired by data.table!

  • by is specified alongside what you want to group
  • You start with a bare data table and then do this and end up with a bare data table, rather than having a grouped data frame like in dplyr.
transactions[, .(revenue = sum(revenue)), by = .(company, year)]

This raised the question, what if you can put it in line with your summarize call?

transactions %>%
  summarize(revenue = sum(revenue),
            by = c(company, year))
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
# A tibble: 12 × 2
   revenue by   
     <int> <chr>
 1     114 A    
 2     114 A    
 3     114 A    
 4     114 B    
 5     114 B    
 6     114 B    
 7     114 2019 
 8     114 2019 
 9     114 2020 
10     114 2021 
11     114 2023 
12     114 2023 

In summary

  1. by/.by is per-operation grouping

  2. group_by() is persistent grouping

dplyr verbs that support by/.by:

by or .by?

Some verbs use . prefix for their arguments and some don’t. If you use the incorrect one, you will get an informative error:

transactions |>
  slice_max(revenue, n = 2, .by = company)
Error in `slice_max()`:
! Can't specify an argument named `.by` in this verb.
ℹ Did you mean to use `by` instead?
transactions %>% 
  slice_max(revenue, n = 2, by = company)
# A tibble: 4 × 3
  company  year revenue
  <chr>   <int>   <int>
1 A        2019      50
2 A        2019      20
3 B        2023      18
4 B        2023      12

What happens to group_by()?

It’s not going away! It is not deprecated or even superseded. Don’t feel pressure to use by/.by.

Learn more