A common set up is the following: x is a dataset of names with mistakes and abbreviations, while y is a dataset of id and their possible names - the “true” or “master dataset”. The goal is to find the id of each observation in x.
The package includes two functions that can be applied to the “true” dataset y before using fuzzy_join. They allow to control both type I and type II mistakes when matching on names.
count_combinations returns a data.frame with four columns. Within each pair (id, name), it computes all permutation of length lower than n, and returns the number of occurences of each permutation within and across groups:R id <- c(1, 1, 2, 2) name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp") count_combinations(name, id = id) #. id name count_within count_across #> 1 1 coca 2 1 #> 2 1 cola 2 1 #> 3 1 company 1 1 #> 4 1 incorporated 1 2 #> 5 2 apple 2 1 #> 6 2 corp 1 1 #> 7 2 incorporated 1 2 Words with high count_within and low count_across are good identifiers, since they are specific to some id. On the other hand, words with low count_within and high count_across are not good identifiers, and one may want to delete these words from x and y.
The function compute_distance returns a data.frame with three columns. Within each pair (id, name), it computes all permutations of length lower than n and returns the minimum string distance of the name with respect to other ids.
id <- c(1, 1, 2, 2)
name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp")
compute_distance(name, id = id, n = 0)
#> id name distance
#> 1 1 coca cola company 0.5087146
#> 2 1 coca cola incorporated 0.2727273
#> 3 2 apple corp 0.4701299
#> 4 2 apple incorporated 0.2727273
compute_distance(name, id = id, n = 1)
#> id name distance
#> 1 1 coca 0.2666667
#> 2 1 cola 0.2666667
#> 3 1 company 0.2190476
#> 4 1 incorporated 0.0000000
#> 5 2 apple 0.5166667
#> 6 2 corp 0.2190476
#> 7 2 incorporated 0.0000000
Again, words with low distance should be discarded while words with high distance could be added to the “master” dataset y.
For each row in x, fuzzy_join finds the closest row(s) in y for a specific metric. The distance is a weighted average of a string distance over multiple columns. Both the weights and the string distance can be specified by the user. By default, fuzzy_join uses the jaro-winkler distance with a winkler adjustment of 0.1 (which gives a higher score to common prefixes).
x <- data.table(a = c("france", "franc"), b = c("arras", "dijon"))
y <- data.table(a = c("franc", "france"), b = c("arvars", "dijjon"))
fuzzy_join(x, y, fuzzy = c("a", "b"), w = c(0.1, 0.9))
#> distance a.x b.x a.y b.y
#> 1: 0.09133333 france arras franc arvars
#> 2: 0.03833333 franc dijon france dijjon
fuzzy_join(x, y, exact = "a", fuzzy = "b")
#> distance a.x b.x a.y b.y
#> 0 franc dijon franc arvars
#> 0 france arras france dijjon
The function corresponds roughly to the Stata command reclink. The type of distance between strings can be arbitrarly specified thanks to the package stringdist.