fuzzy_rbind: fuzzy_rbind
In hkarp1/messy.cats: Employs String Distance Tools to Help Clean Categorical Data

fuzzy_rbind

R Documentation

fuzzy_rbind

Description

fuzzy_rbind() binds dataframes based on columns with slightly different names.

Usage

fuzzy_rbind(
  df1,
  df2,
  threshold,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1)
)

Arguments

`df1`	The first dataframe to be bound.
`df2`	The second dataframe to be bound.
`threshold`	The maximum string distance between column names, if the distance between columns is greater than this threshold the columns will not be bound.
`method`	The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw', Default: 'jw'
`q`	Size of the q-gram used in string distance calculation. Default: 1
`p`	Only used with method "jw", the Jaro-Winkler penatly size. Default: 0
`bt`	Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0
`useBytes`	Whether or not to perform byte-wise comparison. Default: FALSE
`weight`	Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)

Details

When using datasets often times column names are slightly different, and fuzzy_rbind() helps to bind dataframes using fuzzy matching of the column names.

Value

fuzzy_rbind() returns a dataframe that has bound the two inputted dataframes based on the closest matching columns, column names from dataframe 1 are preserved.

Examples

if(interactive()){
 mtcars_colnames_messy = mtcars
 colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
 colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .5)
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .2)
 }

hkarp1/messy.cats documentation built on Feb. 9, 2023, 10:42 a.m.