DataSampler | R Documentation |
It provides a method for generating training, testing and validation data sets from a given input text file.
It also provides a method for generating a sample file of given size or number of lines from an input text file. The contents of the sample file may be cleaned or randomized.
wordpredictor::Base
-> DataSampler
new()
It initializes the current object. It is used to set the verbose option.
DataSampler$new(dir = ".", ve = 0)
dir
The directory for storing the input and output files.
ve
The level of detail in the information messages.
generate_sample()
Generates a sample file of given size from the given input file. The file is saved to the directory given by the dir object attribute. Once the file has been generated, its contents may be cleaned or randomized.
DataSampler$generate_sample(fn, ss, ic, ir, ofn, is, dc_opts = NULL)
fn
The input file name. It is the short file name relative to the dir attribute.
ss
The number of lines or proportion of lines to sample.
ic
If the sample file should be cleaned.
ir
If the sample file contents should be randomized.
ofn
The output file name. It will be saved to the dir.
is
If the sampled data should be saved to a file.
dc_opts
The options for cleaning the data.
# Start of environment setup code # The level of detail in the information messages ve <- 0 # The name of the folder that will contain all the files. It will be # created in the current directory. NULL implies tempdir will be used fn <- NULL # The required files. They are default files that are part of the # package rf <- c("input.txt") # An object of class EnvManager is created em <- EnvManager$new(ve = ve, rp = "./") # The required files are downloaded ed <- em$setup_env(rf, fn) # End of environment setup code # The sample file name sfn <- paste0(ed, "/sample.txt") # An object of class DataSampler is created ds <- DataSampler$new(dir = ed, ve = ve) # The sample file is generated ds$generate_sample( fn = "input.txt", ss = 0.5, ic = FALSE, ir = FALSE, ofn = "sample.txt", is = TRUE ) # The test environment is removed. Comment the below line, so the # files generated by the function can be viewed em$td_env()
generate_data()
It generates training, testing and validation data sets from the given input file. It first reads the file given as a parameter to the current object. It partitions the data into training, testing and validation sets, according to the perc parameter. The files are named train.txt, test.txt and va.txt and are saved to the given output folder.
DataSampler$generate_data(fn, percs)
fn
The input file name. It should be relative to the dir attribute.
percs
The size of the training, testing and validation sets.
# Start of environment setup code # The level of detail in the information messages ve <- 0 # The name of the folder that will contain all the files. It will be # created in the current directory. NULL implies tempdir will be # used fn <- NULL # The required files. They are default files that are part of the # package rf <- c("input.txt") # An object of class EnvManager is created em <- EnvManager$new(ve = ve) # The required files are downloaded ed <- em$setup_env(rf, fn) # End of environment setup code # The files to clean fns <- c("train", "test", "validate") # An object of class DataSampler is created ds <- DataSampler$new(dir = ed, ve = ve) # The train, test and validation files are generated ds$generate_data( fn = "input.txt", percs = list( "train" = 0.8, "test" = 0.1, "validate" = 0.1 ) ) # The test environment is removed. Comment the below line, so the # files generated by the function can be viewed em$td_env()
clone()
The objects of this class are cloneable with this method.
DataSampler$clone(deep = FALSE)
deep
Whether to make a deep clone.
## ------------------------------------------------
## Method `DataSampler$generate_sample`
## ------------------------------------------------
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve, rp = "./")
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The sample file name
sfn <- paste0(ed, "/sample.txt")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The sample file is generated
ds$generate_sample(
fn = "input.txt",
ss = 0.5,
ic = FALSE,
ir = FALSE,
ofn = "sample.txt",
is = TRUE
)
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()
## ------------------------------------------------
## Method `DataSampler$generate_data`
## ------------------------------------------------
# Start of environment setup code
# The level of detail in the information messages
ve <- 0
# The name of the folder that will contain all the files. It will be
# created in the current directory. NULL implies tempdir will be
# used
fn <- NULL
# The required files. They are default files that are part of the
# package
rf <- c("input.txt")
# An object of class EnvManager is created
em <- EnvManager$new(ve = ve)
# The required files are downloaded
ed <- em$setup_env(rf, fn)
# End of environment setup code
# The files to clean
fns <- c("train", "test", "validate")
# An object of class DataSampler is created
ds <- DataSampler$new(dir = ed, ve = ve)
# The train, test and validation files are generated
ds$generate_data(
fn = "input.txt",
percs = list(
"train" = 0.8,
"test" = 0.1,
"validate" = 0.1
)
)
# The test environment is removed. Comment the below line, so the
# files generated by the function can be viewed
em$td_env()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.