txt_clean_word2vec: Text cleaning specific for input to word2vec

Description Usage Arguments Value Examples

View source: R/utils.R

Description

Standardise text by

Usage

1
txt_clean_word2vec(x, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)

Arguments

x

a character vector in UTF-8 encoding

ascii

logical indicating to use iconv to convert the input from UTF-8 to ASCII. Defaults to TRUE.

alpha

logical indicating to keep only alphanumeric characters. Defaults to TRUE.

tolower

logical indicating to lowercase x. Defaults to TRUE.

trim

logical indicating to trim leading/trailing white space. Defaults to TRUE.

Value

a character vector of the same length as x which is standardised by converting the encoding to ascii, lowercasing and keeping only alphanumeric elements

Examples

1
2
x <- c("  Just some.texts,  ok?", "123.456 and\tsome MORE!  ")
txt_clean_word2vec(x)

Example output

[1] "just some texts ok"    "123 456 and some more"

word2vec documentation built on July 2, 2021, 5:07 p.m.