In chainsawriot/textsdc: Statistical Data Cleaning For Text Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
devtools::load_all()

textsdc

The goal of textsdc (text statistical data cleaning) is to clean text data statistically. The current version can do:

text deduplication using a very simple similarity-based algorithm.

Future version should be able to do:

removal of "boilerplates".

Related packages:

quanteda - for text analysis
textclean - for normalization of text data

Installation

You can install the experimental version of textsdc from github:

devtools::install_github("chainsawriot/textsdc")

Example

Deduplication

Calculate the possible duplicates in your input text.

require(textsdc)

lyrics <- c("He drinks a Whiskey drink",
            "he drinks a Vodka drink",
            "He drinks a Lager drink",
            "he drinks a Cider drink",
            "He sings the songs that remind him of the good times",
            "He sings the songs that remind him of the best times",
            "Oh Danny Boy",
            "Danny Boy",
            "Danny Boy",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down",
            "I get knocked down, but I get up again",
            "You are never gonna keep me down")
dups <- calculate_textsdc(lyrics)
dups

dups$dist_matrix

Extract the deduplicated version

clean_textsdc(dups)

Adjust the threshold for duplication.

dups2 <- calculate_textsdc(lyrics, threshold = 0.9)
dups2

clean_textsdc(dups2)

You can also use percentile-based threshold, e.g. assuming 70% of the articles are not duplicates.

dups3 <- calculate_textsdc(lyrics, threshold = 0.7, percentile = TRUE)
dups3

clean_textsdc(dups3)

CJK language

demands2 <- c("徹底撤回修例",
              "收回暴動定義",
              "撤銷對至今為止所有反送中抗爭者控罪",
              "徹底追究警隊濫權情況",
              "以行政命令解散立法會，立即實行雙真普選",
              "撤銷對至今為止所有反送中抗爭者控罪",
              "解散立法會，立即實行雙真普選")
dups4 <- calculate_textsdc(demands2, threshold = 0.7, percentile = TRUE)
dups4

clean_textsdc(dups4)

There are four precedence options on how to get the deduplicated version of the input text.

Default: earlier

metallica <- c("The Unforgiven",
               "The Unforgiven II",
               "The Unforgiven III",
               "Fight Fire With Fire",
               "Master of Puppets",
               "For Whom The Bell Tolls",
               "For Whom The Bell Toll",
               "Master of Puppets")
metallica_dups <- calculate_textsdc(metallica, threshold = 0.7)
clean_textsdc(metallica_dups)

Longer

clean_textsdc(metallica_dups, precedence = "longer")

Shorter

clean_textsdc(metallica_dups, precedence = "shorter")

Random

clean_textsdc(metallica_dups, precedence = "random")

chainsawriot/textsdc documentation built on Dec. 31, 2021, 9:54 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

chainsawriot/textsdc
Statistical Data Cleaning For Text Data

In chainsawriot/textsdc: Statistical Data Cleaning For Text Data

textsdc

Installation

Example

Deduplication

R Package Documentation

Browse R Packages

We want your feedback!

chainsawriot/textsdc Statistical Data Cleaning For Text Data

In chainsawriot/textsdc: Statistical Data Cleaning For Text Data

textsdc

Installation

Example

Deduplication

R Package Documentation

Browse R Packages

We want your feedback!

chainsawriot/textsdc
Statistical Data Cleaning For Text Data