New separate_* functions

tidyr 1.3.0

tidyr
A new family of separate_*() functions supersedes separate(), separate_rows(), and extract().
Published

January 24, 2023

Install tidyr 1.3.0 with:

pak::pak("cran/tidyr@1.3.0")

separate_*() family of functions

A new family of separate_*() functions supersedes separate(), separate_rows(), and extract().

Make columns Make rows
Separate with delimiter separate(sep = string) separate_rows()
Separate by position separate(sep = integer_vector N/A
Separate with regular expression extract()
Make columns Make rows
Separate with delimiter separate_wider_delim() separate_longer_delim()
Separate by position separate_wider_position() separate_longer_position()
Separate with regular expression separate_wider_regex()

Let’s grab some data from the tidyhydat package. The Date column is made up of a date in YYYY-MM-DD format and time in HH:MM:SS format:

library(tidyhydat)
library(tidyr)
library(dplyr)

station <-
  realtime_dd(station_number = c("01CD005", "08MF005"))

station
  Queried on: 2023-03-22 03:31:40 (UTC)
  Date range: 2023-02-20 to 2023-03-22 
# A tibble: 34,144 × 8
   STATION_NUMBER PROV_TE…¹ Date                Param…² Value Grade Symbol Code 
   <chr>          <chr>     <dttm>              <chr>   <dbl> <chr> <chr>  <chr>
 1 01CD005        PE        2023-02-20 04:00:00 Flow    0.31  <NA>  <NA>   1    
 2 01CD005        PE        2023-02-20 04:05:00 Flow    0.31  <NA>  <NA>   1    
 3 01CD005        PE        2023-02-20 04:10:00 Flow    0.313 <NA>  <NA>   1    
 4 01CD005        PE        2023-02-20 04:15:00 Flow    0.31  <NA>  <NA>   1    
 5 01CD005        PE        2023-02-20 04:20:00 Flow    0.31  <NA>  <NA>   1    
 6 01CD005        PE        2023-02-20 04:25:00 Flow    0.31  <NA>  <NA>   1    
 7 01CD005        PE        2023-02-20 04:30:00 Flow    0.31  <NA>  <NA>   1    
 8 01CD005        PE        2023-02-20 04:35:00 Flow    0.31  <NA>  <NA>   1    
 9 01CD005        PE        2023-02-20 04:40:00 Flow    0.31  <NA>  <NA>   1    
10 01CD005        PE        2023-02-20 04:45:00 Flow    0.31  <NA>  <NA>   1    
# … with 34,134 more rows, and abbreviated variable names ¹​PROV_TERR_STATE_LOC,
#   ²​Parameter

Use separate_wider_position() to move these into their own columns:

station |>
  select(Date) |>
  separate_wider_position(Date,
                          widths = c(ymd = 10, space = 1, hms = 8))
# A tibble: 34,144 × 3
   ymd        space hms     
   <chr>      <chr> <chr>   
 1 2023-02-20 " "   04:00:00
 2 2023-02-20 " "   04:05:00
 3 2023-02-20 " "   04:10:00
 4 2023-02-20 " "   04:15:00
 5 2023-02-20 " "   04:20:00
 6 2023-02-20 " "   04:25:00
 7 2023-02-20 " "   04:30:00
 8 2023-02-20 " "   04:35:00
 9 2023-02-20 " "   04:40:00
10 2023-02-20 " "   04:45:00
# … with 34,134 more rows

What if we don’t want the space column?

station |>
  select(Date) |>
  separate_wider_position(Date,
                          widths = c(ymd = 10, hms = 8))
Error in `separate_wider_position()`:
! Expected 18 characters in each element of `Date`.
! 34144 values were too long.
ℹ Use `too_many = "debug"` to diagnose the problem.
ℹ Use `too_many = "drop"` to silence this message.
station |>
  select(Date) |>
  separate_wider_position(Date,
                          widths = c(ymd = 10, hms = 8),
                          too_many = "debug")
Warning: Debug mode activated: adding variables `Date_ok`, `Date_width`, and
`Date_remainder`.
# A tibble: 34,144 × 6
   ymd        hms        Date                Date_width Date_remainder Date_ok
   <chr>      <chr>      <dttm>                   <int> <chr>          <lgl>  
 1 2023-02-20 " 04:00:0" 2023-02-20 04:00:00         19 0              FALSE  
 2 2023-02-20 " 04:05:0" 2023-02-20 04:05:00         19 0              FALSE  
 3 2023-02-20 " 04:10:0" 2023-02-20 04:10:00         19 0              FALSE  
 4 2023-02-20 " 04:15:0" 2023-02-20 04:15:00         19 0              FALSE  
 5 2023-02-20 " 04:20:0" 2023-02-20 04:20:00         19 0              FALSE  
 6 2023-02-20 " 04:25:0" 2023-02-20 04:25:00         19 0              FALSE  
 7 2023-02-20 " 04:30:0" 2023-02-20 04:30:00         19 0              FALSE  
 8 2023-02-20 " 04:35:0" 2023-02-20 04:35:00         19 0              FALSE  
 9 2023-02-20 " 04:40:0" 2023-02-20 04:40:00         19 0              FALSE  
10 2023-02-20 " 04:45:0" 2023-02-20 04:45:00         19 0              FALSE  
# … with 34,134 more rows

Use NA if there are components that you don’t want to appear in the output:

station_split <-
  station |>
  select(Date) |>
  separate_wider_position(Date,
                          widths = c(ymd = 10, 1, hms = 8))

station_split
# A tibble: 34,144 × 2
   ymd        hms     
   <chr>      <chr>   
 1 2023-02-20 04:00:00
 2 2023-02-20 04:05:00
 3 2023-02-20 04:10:00
 4 2023-02-20 04:15:00
 5 2023-02-20 04:20:00
 6 2023-02-20 04:25:00
 7 2023-02-20 04:30:00
 8 2023-02-20 04:35:00
 9 2023-02-20 04:40:00
10 2023-02-20 04:45:00
# … with 34,134 more rows

Use separate_wider_delim() to break things further down:

station_split |>
  separate_wider_delim(ymd,
                       delim = "-",
                       names = c("year", "month", "day")) |>
  separate_wider_delim(hms,
                       delim = ":",
                       names = c("hour", "minute", "second"))
# A tibble: 34,144 × 6
   year  month day   hour  minute second
   <chr> <chr> <chr> <chr> <chr>  <chr> 
 1 2023  02    20    04    00     00    
 2 2023  02    20    04    05     00    
 3 2023  02    20    04    10     00    
 4 2023  02    20    04    15     00    
 5 2023  02    20    04    20     00    
 6 2023  02    20    04    25     00    
 7 2023  02    20    04    30     00    
 8 2023  02    20    04    35     00    
 9 2023  02    20    04    40     00    
10 2023  02    20    04    45     00    
# … with 34,134 more rows
stations <- 
  tidyhydat::hy_monthly_levels()