It’s common for a lot of numerical information to be encoded in strings, particularly in file names. Consider a series of microscope images of cells from different patients detailing the patient number, the cell number and the number of hours after biopsy that the image was taken. They might be named like:
img_names
#>  [1] "patient1-cell1-0hours-after-biopsy.tif"  
#>  [2] "patient1-cell1-2.5hours-after-biopsy.tif"
#>  [3] "patient1-cell2-0hours-after-biopsy.tif"  
#>  [4] "patient1-cell2-2.5hours-after-biopsy.tif"
#>  [5] "patient1-cell3-0hours-after-biopsy.tif"  
#>  [6] "patient1-cell3-2.5hours-after-biopsy.tif"
#>  [7] "patient2-cell1-0hours-after-biopsy.tif"  
#>  [8] "patient2-cell1-2.5hours-after-biopsy.tif"
#>  [9] "patient2-cell2-0hours-after-biopsy.tif"  
#> [10] "patient2-cell2-2.5hours-after-biopsy.tif"
#> [11] "patient2-cell3-0hours-after-biopsy.tif"  
#> [12] "patient2-cell3-2.5hours-after-biopsy.tif"For some crude reason, you might just want all of the numbers:
library(strex)
str_extract_numbers(img_names)
#> [[1]]
#> [1] 1 1 0
#> 
#> [[2]]
#> [1] 1 1 2 5
#> 
#> [[3]]
#> [1] 1 2 0
#> 
#> [[4]]
#> [1] 1 2 2 5
#> 
#> [[5]]
#> [1] 1 3 0
#> 
#> [[6]]
#> [1] 1 3 2 5
#> 
#> [[7]]
#> [1] 2 1 0
#> 
#> [[8]]
#> [1] 2 1 2 5
#> 
#> [[9]]
#> [1] 2 2 0
#> 
#> [[10]]
#> [1] 2 2 2 5
#> 
#> [[11]]
#> [1] 2 3 0
#> 
#> [[12]]
#> [1] 2 3 2 5It seems to have missed the fact that 2.5 is a number and not two
numbers 2 and 5. This is because the default is
decimals = FALSE. To recognise decimals, set
decimals = TRUE. Also, note that there is an option to
recognise scientific notation. More on that below.
str_extract_numbers(img_names, decimals = TRUE)
#> [[1]]
#> [1] 1 1 0
#> 
#> [[2]]
#> [1] 1.0 1.0 2.5
#> 
#> [[3]]
#> [1] 1 2 0
#> 
#> [[4]]
#> [1] 1.0 2.0 2.5
#> 
#> [[5]]
#> [1] 1 3 0
#> 
#> [[6]]
#> [1] 1.0 3.0 2.5
#> 
#> [[7]]
#> [1] 2 1 0
#> 
#> [[8]]
#> [1] 2.0 1.0 2.5
#> 
#> [[9]]
#> [1] 2 2 0
#> 
#> [[10]]
#> [1] 2.0 2.0 2.5
#> 
#> [[11]]
#> [1] 2 3 0
#> 
#> [[12]]
#> [1] 2.0 3.0 2.5It’s also possible to extract the non-numeric parts of the strings:
str_extract_non_numerics(img_names, decimals = TRUE)
#> [[1]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[2]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[3]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[4]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[5]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[6]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[7]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[8]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[9]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[10]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[11]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"
#> 
#> [[12]]
#> [1] "patient"                "-cell"                  "-"                     
#> [4] "hours-after-biopsy.tif"What if we just want the cell number from each image?
nth numberWe know the cell number is always the second number, so we can use
the str_nth_number() function with n = 2.
str_nth_number(img_names, n = 2)
#>  [1] 1 1 2 2 3 3 1 1 2 2 3 3To be more specific, you could say the cell number is the first
number after the first instance of the word “cell”. To go this route,
strex provides str_nth_number_after_mth()
which gives the nth number after the
mth appearance of a given pattern:
str_nth_number_after_mth(img_names, "cell", n = 1, m = 1)
#>  [1] 1 1 2 2 3 3 1 1 2 2 3 3There’s also a convenient wrapper for getting the first number after the first appearance of a pattern:
str_first_number_after_first(img_names, "cell")
#>  [1] 1 1 2 2 3 3 1 1 2 2 3 3Now what if we want the number of hours after biopsy for each image? Looking at the image file names, we’d need the last number before the first occurrence of the word “biopsy”.
str_last_number_before_first(img_names, "biopsy", decimals = TRUE)
#>  [1] 0.0 2.5 0.0 2.5 0.0 2.5 0.0 2.5 0.0 2.5 0.0 2.5To extract all of this information tidily, use a data frame:
data.frame(img_names,
  patient = str_first_number_after_first(img_names, "patient"),
  cell = str_first_number_after_first(img_names, "cell"),
  hrs_after_biop = str_last_number_before_first(img_names, "biop",
    decimals = TRUE
  )
)
#>                                   img_names patient cell hrs_after_biop
#> 1    patient1-cell1-0hours-after-biopsy.tif       1    1            0.0
#> 2  patient1-cell1-2.5hours-after-biopsy.tif       1    1            2.5
#> 3    patient1-cell2-0hours-after-biopsy.tif       1    2            0.0
#> 4  patient1-cell2-2.5hours-after-biopsy.tif       1    2            2.5
#> 5    patient1-cell3-0hours-after-biopsy.tif       1    3            0.0
#> 6  patient1-cell3-2.5hours-after-biopsy.tif       1    3            2.5
#> 7    patient2-cell1-0hours-after-biopsy.tif       2    1            0.0
#> 8  patient2-cell1-2.5hours-after-biopsy.tif       2    1            2.5
#> 9    patient2-cell2-0hours-after-biopsy.tif       2    2            0.0
#> 10 patient2-cell2-2.5hours-after-biopsy.tif       2    2            2.5
#> 11   patient2-cell3-0hours-after-biopsy.tif       2    3            0.0
#> 12 patient2-cell3-2.5hours-after-biopsy.tif       2    3            2.5strex can also deal with numbers in scientific and comma
notation.
string <- c("$1,000", "$1e6")
str_first_number(string, commas = TRUE, sci = TRUE)
#> [1] 1e+03 1e+06There are a whole host of functions for extracting numbers from
strings in the strex package:
str_subset(ls("package:strex"), "number")
#>  [1] "str_extract_numbers"           "str_first_number"             
#>  [3] "str_first_number_after_first"  "str_first_number_after_last"  
#>  [5] "str_first_number_after_mth"    "str_first_number_before_first"
#>  [7] "str_first_number_before_last"  "str_first_number_before_mth"  
#>  [9] "str_last_number"               "str_last_number_after_first"  
#> [11] "str_last_number_after_last"    "str_last_number_after_mth"    
#> [13] "str_last_number_before_first"  "str_last_number_before_last"  
#> [15] "str_last_number_before_mth"    "str_nth_number"               
#> [17] "str_nth_number_after_first"    "str_nth_number_after_last"    
#> [19] "str_nth_number_after_mth"      "str_nth_number_before_first"  
#> [21] "str_nth_number_before_last"    "str_nth_number_before_mth"    
#> [23] "str_split_by_numbers"Of course, all of the above is possible with regular expression using
stringr, it’s just more difficult and less expressive:
data.frame(img_names,
  patient = str_match(img_names, "patient(\\d+)")[, 2],
  cell = str_match(img_names, "cell(\\d+)")[, 2],
  hrs_after_biop = str_match(img_names, "(\\d*\\.*\\d+)hour")[, 2]
)
#>                                   img_names patient cell hrs_after_biop
#> 1    patient1-cell1-0hours-after-biopsy.tif       1    1              0
#> 2  patient1-cell1-2.5hours-after-biopsy.tif       1    1            2.5
#> 3    patient1-cell2-0hours-after-biopsy.tif       1    2              0
#> 4  patient1-cell2-2.5hours-after-biopsy.tif       1    2            2.5
#> 5    patient1-cell3-0hours-after-biopsy.tif       1    3              0
#> 6  patient1-cell3-2.5hours-after-biopsy.tif       1    3            2.5
#> 7    patient2-cell1-0hours-after-biopsy.tif       2    1              0
#> 8  patient2-cell1-2.5hours-after-biopsy.tif       2    1            2.5
#> 9    patient2-cell2-0hours-after-biopsy.tif       2    2              0
#> 10 patient2-cell2-2.5hours-after-biopsy.tif       2    2            2.5
#> 11   patient2-cell3-0hours-after-biopsy.tif       2    3              0
#> 12 patient2-cell3-2.5hours-after-biopsy.tif       2    3            2.5