DataCleaner: Provides data cleaning functionality
In wordpredictor: Develop Text Prediction Models Based on N-Grams

DataCleaner

R Documentation

Provides data cleaning functionality

Description

It provides a memory efficient method for removing unneeded characters from text files. It is suitable for cleaning large text files.

Details

It provides a method for cleaning text files. It allows removing bad words, stop words, non dictionary words, extra space, punctuation and non-alphabet characters. It also allows conversion to lower case. It supports large text files.

Super class

wordpredictor::Base -> DataCleaner

Methods

Method `new()`

It initializes the current object. It is used to set the file name and verbose options.

Usage

DataCleaner$new(fn = NULL, opts = list(), ve = 0)

Arguments

fn

The path to the file to clean.

opts

The options for data cleaning.

min_words. The minimum number of words per sentence.
line_count. The number of lines to read and clean at a time.
save_data. If the combined processed lines should be saved.
output_file. Name of the output file used to store the data.
sw_file. The stop words file path.
dict_file. The dictionary file path.
bad_file. The bad words file path.
to_lower. If the words should be converted to lower case.
remove_stop. If stop words should be removed.
remove_punct. If punctuation symbols should be removed.
remove_non_dict. If non dictionary words should be removed.
remove_non_alpha. -> If non alphabet symbols should be removed.
remove_extra_space. -> If leading, trailing and double spaces should be removed.
remove_bad. If bad words should be removed

ve

The level of detail in the information messages.

Method `clean_file()`

It provides an efficient method for cleaning text files. It removes unneeded characters from the given text file with several options.

It allows removing punctuation, bad words, stop words, non-alphabetical symbols and non-dictionary words. It reads a certain number of lines from the given text file. It removes unneeded characters from the lines and then saves the lines to an output text file.

File cleaning progress is displayed if the verbose option was set in the class constructor. It is suitable for cleaning large text files.

Usage

DataCleaner$clean_file()

Examples

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The cleaned test file name
cfn <- paste0(ed, "/test-clean.txt")
# The test file name
fn <- paste0(ed, "/test.txt")
# The data cleaning options
dc_opts <- list("output_file" = cfn)
# The data cleaner object is created
dc <- DataCleaner$new(fn, dc_opts, ve = ve)
# The sample file is cleaned
dc$clean_file()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method `clean_lines()`

It cleans the given lines of text using the options passed to the current object.

Usage

DataCleaner$clean_lines(lines)

Arguments

lines: The input sentences.

Returns

The cleaned lines of text.

Examples

# The level of detail in the information messages
ve <- 0
# Test data is read
l <- c(
    "If you think I'm wrong, send me a link to where it's happened",
    "We're about 90percent done with this room",
    "This isn't how I wanted it between us.",
    "Almost any cute breed can become ornamental",
    "Once upon a time there was a kingdom with a castle",
    "That's not a thing any of us are granted'",
    "Why are you being so difficult? she asks."
)
# The expected results
res <- c(
    "if you think wrong send me a link to where its happened",
    "were about percent done with this room",
    "this how i wanted it between us",
    "almost any cute breed can become ornamental",
    "once upon a time there was a kingdom with a castle",
    "thats not a thing any of us are granted",
    "why are you being so difficult she asks"
)
# The DataCleaner object is created
dc <- DataCleaner$new(ve = ve)
# The line is cleaned
cl <- dc$clean_lines(l)
# The cleaned lines are printed
print(cl)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

DataCleaner$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `DataCleaner$clean_file`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The cleaned test file name
cfn <- paste0(ed, "/test-clean.txt")
# The test file name
fn <- paste0(ed, "/test.txt")
# The data cleaning options
dc_opts <- list("output_file" = cfn)
# The data cleaner object is created
dc <- DataCleaner$new(fn, dc_opts, ve = ve)
# The sample file is cleaned
dc$clean_file()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

## ------------------------------------------------
## Method `DataCleaner$clean_lines`
## ------------------------------------------------

# The level of detail in the information messages
ve <- 0
# Test data is read
l <- c(
    "If you think I'm wrong, send me a link to where it's happened",
    "We're about 90percent done with this room",
    "This isn't how I wanted it between us.",
    "Almost any cute breed can become ornamental",
    "Once upon a time there was a kingdom with a castle",
    "That's not a thing any of us are granted'",
    "Why are you being so difficult? she asks."
)
# The expected results
res <- c(
    "if you think wrong send me a link to where its happened",
    "were about percent done with this room",
    "this how i wanted it between us",
    "almost any cute breed can become ornamental",
    "once upon a time there was a kingdom with a castle",
    "thats not a thing any of us are granted",
    "why are you being so difficult she asks"
)
# The DataCleaner object is created
dc <- DataCleaner$new(ve = ve)
# The line is cleaned
cl <- dc$clean_lines(l)
# The cleaned lines are printed
print(cl)

wordpredictor documentation built on Oct. 8, 2024, 5:10 p.m.

wordpredictor index

Package overview README.md Features Overview

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wordpredictor
Develop Text Prediction Models Based on N-Grams

DataCleaner: Provides data cleaning functionality
In wordpredictor: Develop Text Prediction Models Based on N-Grams

Provides data cleaning functionality

Description

Details

Super class

Methods

Public methods

Method `new()`

Usage

Arguments

Method `clean_file()`

Usage

Examples

Method `clean_lines()`

Usage

Arguments

Returns

Examples

Method `clone()`

Usage

Arguments

Examples

Related to DataCleaner in wordpredictor...

R Package Documentation

Browse R Packages

We want your feedback!

wordpredictor Develop Text Prediction Models Based on N-Grams

DataCleaner: Provides data cleaning functionality In wordpredictor: Develop Text Prediction Models Based on N-Grams

Provides data cleaning functionality

Description

Details

Super class

Methods

Public methods

Method new()

Usage

Arguments

Method clean_file()

Usage

Examples

Method clean_lines()

Usage

Arguments

Returns

Examples

Method clone()

Usage

Arguments

Examples

Related to DataCleaner in wordpredictor...

R Package Documentation

Browse R Packages

We want your feedback!

wordpredictor
Develop Text Prediction Models Based on N-Grams

DataCleaner: Provides data cleaning functionality
In wordpredictor: Develop Text Prediction Models Based on N-Grams

Method `new()`

Method `clean_file()`

Method `clean_lines()`

Method `clone()`