pak::pak("cran/dplyr@1.1.0")
*_join()
dplyr 1.1.0
Install dplyr 1.1.0 with:
Load the package with:
Let’s create some data. In transactions
, we have company IDs, years, and revenue. In companies
, we have the company IDs and full company names.
join_by()
Say you want to join these two tables. You’ve been able to do this in dplyr:
transactions |>
inner_join(companies, by = c(company = "id"))
# A tibble: 6 × 4
company year revenue name
<chr> <int> <int> <chr>
1 A 2019 20 Patagonia
2 A 2019 50 Patagonia
3 A 2020 4 Patagonia
4 B 2021 10 RStudio
5 B 2023 12 RStudio
6 B 2023 18 RStudio
This is a bit odd:
- Using
=
not==
- Using
c()
-
"id"
has to be in quotes
Welcome join_by()
!
transactions |>
inner_join(companies, by = join_by(company == id))
# A tibble: 6 × 4
company year revenue name
<chr> <int> <int> <chr>
1 A 2019 20 Patagonia
2 A 2019 50 Patagonia
3 A 2020 4 Patagonia
4 B 2021 10 RStudio
5 B 2023 12 RStudio
6 B 2023 18 RStudio
This is a much more natural way of expressing this join.
Multiple matches
RStudio became Posit in 2023. Now, let’s add a column to track the change:
companies <-
tibble::tribble(
~id, ~since, ~name,
"A", 1973, "Patagonia",
"B", 2009, "RStudio",
"B", 2022, "Posit"
)
What happens when we try to join the tables together?
transactions |>
inner_join(companies, by = join_by(company == id))
Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 4 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this
warning.
# A tibble: 9 × 5
company year revenue since name
<chr> <int> <int> <dbl> <chr>
1 A 2019 20 1973 Patagonia
2 A 2019 50 1973 Patagonia
3 A 2020 4 1973 Patagonia
4 B 2021 10 2009 RStudio
5 B 2021 10 2022 Posit
6 B 2023 12 2009 RStudio
7 B 2023 12 2022 Posit
8 B 2023 18 2009 RStudio
9 B 2023 18 2022 Posit
In 2021, it joined B
with both RStudio
and Posit
, creating multiple matches. We want to match with RStudio, but not with Posit (because the name hasn’t changed yet).
You see a warning:
Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
ℹ Row 3 of `x` matches multiple rows.
ℹ If multiple matches are expected, set `multiple = "all"` to silence this warning.
Inequality joins
dplyr 1.1.0 has inequality joins: join expressions containing one of the following the inequality conditions >=
, >
, <=
, or <
.
You can think through the logic of what you would like with the year
and since
columns:
# `year[i] >= since`?
2021 >= 2009
[1] TRUE
2021 >= 2022
[1] FALSE
Now, you can add an inequality condition to join_by()
:
join_by(company == id, year >= since)
Running it with the inequality join:
transactions |>
inner_join(companies, join_by(company == id, year >= since))
# A tibble: 8 × 5
company year revenue since name
<chr> <int> <int> <dbl> <chr>
1 A 2019 20 1973 Patagonia
2 A 2019 50 1973 Patagonia
3 A 2020 4 1973 Patagonia
4 B 2021 10 2009 RStudio
5 B 2023 12 2009 RStudio
6 B 2023 12 2022 Posit
7 B 2023 18 2009 RStudio
8 B 2023 18 2022 Posit
You’re down to five! But, in 2023, you still have two matches. This is because the logic is true but not complete:
2023 >= 2009
[1] TRUE
2023 >= 2022
[1] TRUE
Rolling joins
You can use rolling joins to find the ‘closest’ inequality match.
You prefer the Posit match over the RStudio match because 2022 is closer to the transaction year of 2023 than 2009 is. Wrap something in closest()
to express this:
transactions |>
inner_join(companies, join_by(company == id, closest(year >= since)))
# A tibble: 6 × 5
company year revenue since name
<chr> <int> <int> <dbl> <chr>
1 A 2019 20 1973 Patagonia
2 A 2019 50 1973 Patagonia
3 A 2020 4 1973 Patagonia
4 B 2021 10 2009 RStudio
5 B 2023 12 2022 Posit
6 B 2023 18 2022 Posit
closest(year >= since)
finds all of the matches insince
for a particular year, and then filters them down to only the closest match to that year.
Unmatched rows
Say you add a new company to transactions
but forget to add them to companies
:
transactions <-
transactions |>
tibble::add_row(company = "C",
year = 2023,
revenue = 15)
transactions
# A tibble: 7 × 3
company year revenue
<chr> <dbl> <dbl>
1 A 2019 20
2 A 2019 50
3 A 2020 4
4 B 2021 10
5 B 2023 12
6 B 2023 18
7 C 2023 15
When you run your join, company C will disappear:
transactions |>
inner_join(companies,
join_by(company == id, closest(year >= since)))
# A tibble: 6 × 5
company year revenue since name
<chr> <dbl> <dbl> <dbl> <chr>
1 A 2019 20 1973 Patagonia
2 A 2019 50 1973 Patagonia
3 A 2020 4 1973 Patagonia
4 B 2021 10 2009 RStudio
5 B 2023 12 2022 Posit
6 B 2023 18 2022 Posit
You can now catch this problem automatically by using a new quality control argument, unmatched
:
transactions |>
inner_join(
companies,
join_by(company == id, closest(year >= since)),
unmatched = "error"
)
Error in `inner_join()`:
! Each row of `x` must have a match in `y`.
ℹ Row 7 of `x` does not have a match.
Overlap joins
These are special cases of inequality joins popular in time series and genomics.
-
join_by(id, between(date, y_lower, y_upper))
: You have a date in one column in one table and a range of upper/lower bounds in the other table and want to match when values are in the range in the other table. -
join_by(id, overlaps(x_lower, x_upper, y_lower, y_upper))
: Now, you have two sets of ranges and if they overlap at all, they match. -
join_by(id, within(x_lower, x_upper, y_lower, y_upper))
or has to be completely inside the lower and upper of the other table
Cross joins
cross_join()
match each row in x to every row in y, giving you the full Cartesian production.
characters <-
tibble::tribble(
~person, ~title,
"Shiv", "President",
"Kendall", "Interim CEO",
"Logan", "CEO",
"Tom", "Head of News",
"Roman", "COO"
)
alliances <-
tibble::tribble(
~person, ~allies,
"Shiv", "Tom",
"Kendall", "Greg",
"Roman", "Geri"
)
cross_join(characters, alliances)
# A tibble: 15 × 4
person.x title person.y allies
<chr> <chr> <chr> <chr>
1 Shiv President Shiv Tom
2 Shiv President Kendall Greg
3 Shiv President Roman Geri
4 Kendall Interim CEO Shiv Tom
5 Kendall Interim CEO Kendall Greg
6 Kendall Interim CEO Roman Geri
7 Logan CEO Shiv Tom
8 Logan CEO Kendall Greg
9 Logan CEO Roman Geri
10 Tom Head of News Shiv Tom
11 Tom Head of News Kendall Greg
12 Tom Head of News Roman Geri
13 Roman COO Shiv Tom
14 Roman COO Kendall Greg
15 Roman COO Roman Geri