`*_join()`

dplyr 1.1.0

dplyr

dplyr 1.1.0 has relaxed assumptions for non-equi joins.

Published

January 29, 2023

Install dplyr 1.1.0 with:

pak::pak("cran/dplyr@1.1.0")

Load the package with:

library(dplyr)

Let’s create some data. In transactions, we have company IDs, years, and revenue. In companies, we have the company IDs and full company names.

transactions <-
  tibble::tribble(
    ~company, ~year, ~revenue,
         "A", 2019L,      20L,
         "A", 2019L,      50L,
         "A", 2020L,       4L,
         "B", 2021L,      10L,
         "B", 2023L,      12L,
         "B", 2023L,      18L
    )

companies <- 
  tibble::tribble(
    ~id, ~name,
    "A", "Patagonia",
    "B", "RStudio"
    )

`join_by()`

Say you want to join these two tables. You’ve been able to do this in dplyr:

transactions |> 
  inner_join(companies, by = c(company = "id"))

# A tibble: 6 × 4
  company  year revenue name     
  <chr>   <int>   <int> <chr>    
1 A        2019      20 Patagonia
2 A        2019      50 Patagonia
3 A        2020       4 Patagonia
4 B        2021      10 RStudio  
5 B        2023      12 RStudio  
6 B        2023      18 RStudio

This is a bit odd:

Using = not ==
Using c()
"id" has to be in quotes

Welcome join_by()!

transactions |> 
  inner_join(companies, by = join_by(company == id))

# A tibble: 6 × 4
  company  year revenue name     
  <chr>   <int>   <int> <chr>    
1 A        2019      20 Patagonia
2 A        2019      50 Patagonia
3 A        2020       4 Patagonia
4 B        2021      10 RStudio  
5 B        2023      12 RStudio  
6 B        2023      18 RStudio

This is a much more natural way of expressing this join.

Multiple matches

RStudio became Posit in 2023. Now, let’s add a column to track the change:

companies <-
  tibble::tribble(
    ~id, ~since, ~name,
    "A", 1973, "Patagonia",
    "B", 2009, "RStudio",
    "B", 2022, "Posit"
    )

What happens when we try to join the tables together?

transactions |> 
  inner_join(companies, by = join_by(company == id))

Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 4 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this
  warning.

# A tibble: 9 × 5
  company  year revenue since name     
  <chr>   <int>   <int> <dbl> <chr>    
1 A        2019      20  1973 Patagonia
2 A        2019      50  1973 Patagonia
3 A        2020       4  1973 Patagonia
4 B        2021      10  2009 RStudio  
5 B        2021      10  2022 Posit    
6 B        2023      12  2009 RStudio  
7 B        2023      12  2022 Posit    
8 B        2023      18  2009 RStudio  
9 B        2023      18  2022 Posit

In 2021, it joined B with both RStudio and Posit, creating multiple matches. We want to match with RStudio, but not with Posit (because the name hasn’t changed yet).

You see a warning:

Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 3 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this warning.

2023-03-22 Update

On March 22nd, Davis Vaughn announced dplyr 1.1.1. The error message that let you know when performed a join where a row from x matched more than one row from y is now removed. The tidyverse team adjusted it in such a way that “it now only catches the most dangerous type of join (a many-to-many join), meaning that you should see the warning much less often.”

Inequality joins

dplyr 1.1.0 has inequality joins: join expressions containing one of the following the inequality conditions >=, >, <=, or <.

You can think through the logic of what you would like with the year and since columns:

# `year[i] >= since`?
2021 >= 2009

[1] TRUE

2021 >= 2022

[1] FALSE

Now, you can add an inequality condition to join_by():

join_by(company == id, year >= since)

Running it with the inequality join:

transactions |>
  inner_join(companies, join_by(company == id, year >= since))

# A tibble: 8 × 5
  company  year revenue since name     
  <chr>   <int>   <int> <dbl> <chr>    
1 A        2019      20  1973 Patagonia
2 A        2019      50  1973 Patagonia
3 A        2020       4  1973 Patagonia
4 B        2021      10  2009 RStudio  
5 B        2023      12  2009 RStudio  
6 B        2023      12  2022 Posit    
7 B        2023      18  2009 RStudio  
8 B        2023      18  2022 Posit

You’re down to five! But, in 2023, you still have two matches. This is because the logic is true but not complete:

2023 >= 2009

[1] TRUE

2023 >= 2022

[1] TRUE

Rolling joins

You can use rolling joins to find the ‘closest’ inequality match.

You prefer the Posit match over the RStudio match because 2022 is closer to the transaction year of 2023 than 2009 is. Wrap something in closest() to express this:

transactions |>
  inner_join(companies, join_by(company == id, closest(year >= since)))

# A tibble: 6 × 5
  company  year revenue since name     
  <chr>   <int>   <int> <dbl> <chr>    
1 A        2019      20  1973 Patagonia
2 A        2019      50  1973 Patagonia
3 A        2020       4  1973 Patagonia
4 B        2021      10  2009 RStudio  
5 B        2023      12  2022 Posit    
6 B        2023      18  2022 Posit

closest(year >= since) finds all of the matches in since for a particular year, and then filters them down to only the closest match to that year.

Unmatched rows

Say you add a new company to transactions but forget to add them to companies:

transactions <-
  transactions |>
  tibble::add_row(company = "C",
                  year = 2023,
                  revenue = 15)

transactions

# A tibble: 7 × 3
  company  year revenue
  <chr>   <dbl>   <dbl>
1 A        2019      20
2 A        2019      50
3 A        2020       4
4 B        2021      10
5 B        2023      12
6 B        2023      18
7 C        2023      15

When you run your join, company C will disappear:

transactions |>
  inner_join(companies,
             join_by(company == id, closest(year >= since)))

# A tibble: 6 × 5
  company  year revenue since name     
  <chr>   <dbl>   <dbl> <dbl> <chr>    
1 A        2019      20  1973 Patagonia
2 A        2019      50  1973 Patagonia
3 A        2020       4  1973 Patagonia
4 B        2021      10  2009 RStudio  
5 B        2023      12  2022 Posit    
6 B        2023      18  2022 Posit

You can now catch this problem automatically by using a new quality control argument, unmatched:

transactions |>
  inner_join(
    companies, 
    join_by(company == id, closest(year >= since)),
    unmatched = "error"
  )

Error in `inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 7 of `x` does not have a match.

Tip

Have you been wondering why Davis is using an inner_join() instead of a left_join()? You’d use a left_join() is to ensure that rows from x are always retained, so it wouldn’t make sense to error when rows from x are also unmatched. In an inner join, both inputs can potentially drop rows, so unmatched = "error" checks for unmatched rows in both inputs.

Overlap joins

These are special cases of inequality joins popular in time series and genomics.

join_by(id, between(date, y_lower, y_upper)): You have a date in one column in one table and a range of upper/lower bounds in the other table and want to match when values are in the range in the other table.
join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper)): Now, you have two sets of ranges and if they overlap at all, they match.
join_by(id, within(x_lower, x_upper, y_lower, y_upper)) or has to be completely inside the lower and upper of the other table

Cross joins

cross_join() match each row in x to every row in y, giving you the full Cartesian production.

characters <-
  tibble::tribble(
    ~person,         ~title,
     "Shiv",    "President",
  "Kendall",  "Interim CEO",
    "Logan",          "CEO",
      "Tom", "Head of News",
    "Roman",          "COO"
  )

alliances <-
  tibble::tribble(
      ~person, ~allies,
       "Shiv",   "Tom",
    "Kendall",  "Greg",
      "Roman",  "Geri"
    )

cross_join(characters, alliances)

# A tibble: 15 × 4
   person.x title        person.y allies
   <chr>    <chr>        <chr>    <chr> 
 1 Shiv     President    Shiv     Tom   
 2 Shiv     President    Kendall  Greg  
 3 Shiv     President    Roman    Geri  
 4 Kendall  Interim CEO  Shiv     Tom   
 5 Kendall  Interim CEO  Kendall  Greg  
 6 Kendall  Interim CEO  Roman    Geri  
 7 Logan    CEO          Shiv     Tom   
 8 Logan    CEO          Kendall  Greg  
 9 Logan    CEO          Roman    Geri  
10 Tom      Head of News Shiv     Tom   
11 Tom      Head of News Kendall  Greg  
12 Tom      Head of News Roman    Geri  
13 Roman    COO          Shiv     Tom   
14 Roman    COO          Kendall  Greg  
15 Roman    COO          Roman    Geri

Tip

cross_join() achieves the same effect as tidyr::crossing() but dplyr::cross_join() is always about data frames whereas tidyr::crossing() is more for individual data.