Home

/

GitHub

/

jenswaeckerle/wersim

/

wersim: Introducing additional Word Error into a corpus

wersim: Introducing additional Word Error into a corpus
In jenswaeckerle/wersim: Simulating and Measuring Word Error

View source: R/wersim.R

wersim

R Documentation

Introducing additional Word Error into a corpus

Description

This function introduces word error into a corpus. We find word error to consist of three components: deletions (words that are missing in the ASR transcription), insertions (words in the ASR transcription that are not in the reference text) and substitutions (words with inaccurate ASR transcription). In the function, you have to specify the ratio between deletions, insertions and substitutions. As a baseline, for deletions D, we simply randomly draw a unique token from the corpus and delete it from a randomly chosen text it occurs in, and repeat this until we have reached ND number of deletions needed. In turn, for insertions I, we randomly select a token from the corpus and insert it in a text, repeating this NI times. Last, we create NS substitutions S by randomly selecting a unique token and replacing it with the token from the corpus that has the smallest Levenshtein distance (Levenshtein,1966) to the selected token, which measures the similarity between two strings on the basis of single-character differences, e.g. "bad" may be replaced by "bat". The random draw of the token for all three operations is weighted by the number of times the word occured in a specific text. This means that more common words in longer texts are more likely to be chosen.

Usage

wersim(x, target_wer = 0.05, deletions = 0.13, insertions = 0.22,
  substitutions = 0.65, groupingvar)

Arguments

`x`	A quanteda corpus to be modified
`target_wer`	The additional error that needs to be introduced into the corpus
`deletions`	The share of word error that should be introduced through deletions
`insertions`	The share of word error that should be introduced through insertions
`substitutions`	The share of word error that should be introduced through substitutions
`groupingvar`	The variable that groups the corpus

Value

Returns a quanteda dfm of the corpus with added error as specified.

Examples

# For an example, please see the documentation of the wersimtext function

jenswaeckerle/wersim documentation built on Dec. 7, 2022, 9:31 a.m.

jenswaeckerle/wersim index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jenswaeckerle/wersim
Simulating and Measuring Word Error

wersim: Introducing additional Word Error into a corpus
In jenswaeckerle/wersim: Simulating and Measuring Word Error

Introducing additional Word Error into a corpus

Description

Usage

Arguments

Value

Examples

Related to wersim in jenswaeckerle/wersim...

R Package Documentation

Browse R Packages

We want your feedback!

jenswaeckerle/wersim Simulating and Measuring Word Error

wersim: Introducing additional Word Error into a corpus In jenswaeckerle/wersim: Simulating and Measuring Word Error

Introducing additional Word Error into a corpus

Description

Usage

Arguments

Value

Examples

Related to wersim in jenswaeckerle/wersim...

R Package Documentation

Browse R Packages

We want your feedback!

jenswaeckerle/wersim
Simulating and Measuring Word Error

wersim: Introducing additional Word Error into a corpus
In jenswaeckerle/wersim: Simulating and Measuring Word Error