DataCleaner: Provides data cleaning functionality

Description Details Super class Methods Examples

Description

It provides a memory efficient method for removing unneeded characters from text files. It is suitable for cleaning large text files.

Details

It provides a method for cleaning text files. It allows removing bad words, stop words, non dictionary words, extra space, punctuation and non-alphabet characters. It also allows conversion to lower case. It supports large text files.

Super class

wordpredictor::Base -> DataCleaner

Methods

Public methods

Inherited methods

Method new()

It initializes the current object. It is used to set the file name and verbose options.

Usage
DataCleaner$new(fn = NULL, opts = list(), ve = 0)
Arguments
fn

The path to the file to clean.

opts

The options for data cleaning.

  • min_words. The minimum number of words per sentence.

  • line_count. The number of lines to read and clean at a time.

  • save_data. If the combined processed lines should be saved.

  • output_file. Name of the output file used to store the data.

  • sw_file. The stop words file path.

  • dict_file. The dictionary file path.

  • bad_file. The bad words file path.

  • to_lower. If the words should be converted to lower case.

  • remove_stop. If stop words should be removed.

  • remove_punct. If punctuation symbols should be removed.

  • remove_non_dict. If non dictionary words should be removed.

  • remove_non_alpha. -> If non alphabet symbols should be removed.

  • remove_extra_space. -> If leading, trailing and double spaces should be removed.

  • remove_bad. If bad words should be removed

ve

The level of detail in the information messages.


Method clean_file()

It provides an efficient method for cleaning text files. It removes unneeded characters from the given text file with several options.

It allows removing punctuation, bad words, stop words, non-alphabetical symbols and non-dictionary words. It reads a certain number of lines from the given text file. It removes unneeded characters from the lines and then saves the lines to an output text file.

File cleaning progress is displayed if the verbose option was set in the class constructor. It is suitable for cleaning large text files.

Usage
DataCleaner$clean_file()
Examples
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The cleaned test file name
cfn <- paste0(ed, "/test-clean.txt")
# The test file name
fn <- paste0(ed, "/test.txt")
# The data cleaning options
dc_opts <- list("output_file" = cfn)
# The data cleaner object is created
dc <- DataCleaner$new(fn, dc_opts, ve = ve)
# The sample file is cleaned
dc$clean_file()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

Method clean_lines()

It cleans the given lines of text using the options passed to the current object.

Usage
DataCleaner$clean_lines(lines)
Arguments
lines

The input sentences.

Returns

The cleaned lines of text.

Examples
# The level of detail in the information messages
ve <- 0
# Test data is read
l <- c(
    "If you think I'm wrong, send me a link to where it's happened",
    "We're about 90percent done with this room",
    "This isn't how I wanted it between us.",
    "Almost any cute breed can become ornamental",
    "Once upon a time there was a kingdom with a castle",
    "That's not a thing any of us are granted'",
    "Why are you being so difficult? she asks."
)
# The expected results
res <- c(
    "if you think wrong send me a link to where its happened",
    "were about percent done with this room",
    "this how i wanted it between us",
    "almost any cute breed can become ornamental",
    "once upon a time there was a kingdom with a castle",
    "thats not a thing any of us are granted",
    "why are you being so difficult she asks"
)
# The DataCleaner object is created
dc <- DataCleaner$new(ve = ve)
# The line is cleaned
cl <- dc$clean_lines(l)
# The cleaned lines are printed
print(cl)

Method clone()

The objects of this class are cloneable with this method.

Usage
DataCleaner$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
## ------------------------------------------------
## Method `DataCleaner$clean_file`
## ------------------------------------------------

# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("test.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code

# The cleaned test file name
cfn <- paste0(ed, "/test-clean.txt")
# The test file name
fn <- paste0(ed, "/test.txt")
# The data cleaning options
dc_opts <- list("output_file" = cfn)
# The data cleaner object is created
dc <- DataCleaner$new(fn, dc_opts, ve = ve)
# The sample file is cleaned
dc$clean_file()

# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()

## ------------------------------------------------
## Method `DataCleaner$clean_lines`
## ------------------------------------------------

# The level of detail in the information messages
ve <- 0
# Test data is read
l <- c(
    "If you think I'm wrong, send me a link to where it's happened",
    "We're about 90percent done with this room",
    "This isn't how I wanted it between us.",
    "Almost any cute breed can become ornamental",
    "Once upon a time there was a kingdom with a castle",
    "That's not a thing any of us are granted'",
    "Why are you being so difficult? she asks."
)
# The expected results
res <- c(
    "if you think wrong send me a link to where its happened",
    "were about percent done with this room",
    "this how i wanted it between us",
    "almost any cute breed can become ornamental",
    "once upon a time there was a kingdom with a castle",
    "thats not a thing any of us are granted",
    "why are you being so difficult she asks"
)
# The DataCleaner object is created
dc <- DataCleaner$new(ve = ve)
# The line is cleaned
cl <- dc$clean_lines(l)
# The cleaned lines are printed
print(cl)

wordpredictor documentation built on Jan. 4, 2022, 5:07 p.m.