README.md
In courtiol/dfuzz: Tidy Strings

dfuzz

The goal of {dfuzz} is to help you cleaning up a messy column of strings of characters in your tibble or data.frame.

This package is highly experimental and is not yet ready for being used for real applications.

It is build around two dependencies which themselves have no dependencies:

{rlang}
{stringdist}, and it is possible to use the full power of the function stringdist() from this excellent package.

{dfuzz} aims at being compatible with both tidyverse and base R dialects.

You can install this package using {remotes} (or {devtools}):

remotes::install_github("courtiol/dfuzz")

library(dfuzz)

## a toy example:
test_df <- data.frame(fruit = c("banana", "blueberry", "limon", "pinapple",
                                "aple", "apple", "ApplE", "bonana"))
test_df
#>       fruit
#> 1    banana
#> 2 blueberry
#> 3     limon
#> 4  pinapple
#> 5      aple
#> 6     apple
#> 7     ApplE
#> 8    bonana

## fast and dirty workflow:
clean_df1 <- fuzzy_tidy(test_df, fruit)
clean_df1
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>          aple       aple
#> 6     apple        <NA>          aple       aple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## more subtle workflow:
template_fruit <- fuzzy_match(test_df, fruit)
template_fruit
#>   selected  syn_1  syn_2
#> 1     aple   aple  apple
#> 2   banana banana bonana
template_fruit$selected[1] <- "apple"
clean_df2 <- fuzzy_tidy(test_df, fruit, template_fruit)
clean_df2
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>         apple      apple
#> 6     apple        <NA>         apple      apple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## fast and dirty workflow with {tidyverse}:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#> ✓ tibble  3.0.4     ✓ dplyr   1.0.2
#> ✓ tidyr   1.1.2     ✓ stringr 1.4.0
#> ✓ readr   1.4.0     ✓ forcats 0.5.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
test_df %>%
  fuzzy_tidy(fruit) %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 banana   
#> 2 blueberry
#> 3 limon    
#> 4 pinapple 
#> 5 aple     
#> 6 aple     
#> 7 ApplE    
#> 8 banana

## more subtle workflow with {tidyverse}:
test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_match(fruit) -> template_fruit
template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <chr>    <chr>  <chr> 
#> 1 Aple     Aple   Apple 
#> 2 Banana   Banana Bonana

template_fruit %>%
  mutate(selected = fct_recode(selected, Apple = "Aple")) -> better_template_fruit

better_template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <fct>    <chr>  <chr> 
#> 1 Apple    Aple   Apple 
#> 2 Banana   Banana Bonana

test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_tidy(fruit, better_template_fruit) -> clean_df3
clean_df3
#> # A tibble: 8 x 4
#>   fruit     fruit.clean fruit.cleaned fruit.tidy
#>   <chr>     <chr>       <chr>         <chr>     
#> 1 Banana    <NA>        Banana        Banana    
#> 2 Blueberry Blueberry   <NA>          Blueberry 
#> 3 Limon     Limon       <NA>          Limon     
#> 4 Pinapple  Pinapple    <NA>          Pinapple  
#> 5 Aple      <NA>        Apple         Apple     
#> 6 Apple     <NA>        Apple         Apple     
#> 7 Apple     <NA>        Apple         Apple     
#> 8 Bonana    <NA>        Banana        Banana

clean_df3 %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 Banana   
#> 2 Blueberry
#> 3 Limon    
#> 4 Pinapple 
#> 5 Apple    
#> 6 Apple    
#> 7 Apple    
#> 8 Banana

If you find that this package is an idea worth pursuing, please let me know. Developing is always more fun when it becomes a collaborative work. So please also email me (or leave an issue) if you want to get involved!

courtiol/dfuzz documentation built on Oct. 28, 2020, 6 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com