LargeDataSetForText: Abstract class for large data sets containing raw texts
In aifeducation: Artificial Intelligence for Education

LargeDataSetForText

R Documentation

Abstract class for large data sets containing raw texts

Description

This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.

Value

Returns a new object of this class.

Super class

aifeducation::LargeDataSetBase -> LargeDataSetForText

Methods

Public methods

LargeDataSetForText$new()
LargeDataSetForText$add_from_files_txt()
LargeDataSetForText$add_from_files_pdf()
LargeDataSetForText$add_from_files_xlsx()
LargeDataSetForText$add_from_data.frame()
LargeDataSetForText$get_private()
LargeDataSetForText$clone()

Inherited methods

Method `new()`

Method for creation of LargeDataSetForText instance. It can be initialized with init_data parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).

Usage

LargeDataSetForText$new(init_data = NULL)

Arguments

init_data: Initial data.frame for dataset.

Returns

A new instance of this class initialized with init_data if passed.

Method `add_from_files_txt()`

Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:

bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.

The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_txt(
  dir_path,
  batch_size = 500L,
  log_file = NULL,
  log_write_interval = 2L,
  log_top_value = 0L,
  log_top_total = 1L,
  log_top_message = NA,
  clean_text = TRUE,
  trace = TRUE
)

Arguments

dir_path

Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

clean_text

bool If TRUE the text is modified to improve the quality of the following analysis:

Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_files_pdf()`

Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:

bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.

The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.

Usage

LargeDataSetForText$add_from_files_pdf(
  dir_path,
  batch_size = 500L,
  log_file = NULL,
  log_write_interval = 2L,
  log_top_value = 0L,
  log_top_total = 1L,
  log_top_message = NA,
  clean_text = TRUE,
  trace = TRUE
)

Arguments

dir_path

Path to the directory where the files are stored.

batch_size

int determining the number of files to process at once.

log_file

string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.

log_write_interval

int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.

log_top_value

int indicating the current iteration of the process.

log_top_total

int determining the maximal number of iterations.

log_top_message

string providing additional information of the process.

clean_text

bool If TRUE the text is modified to improve the quality of the following analysis:

Some special symbols are removed.
All spaces at the beginning and the end of a row are removed.
Multiple spaces are reduced to single space.
All rows with a number from 1 to 999 at the beginning or at the end are removed (header and footer).
List of content is removed.
Hyphenation is made undone.
Line breaks within a paragraph are removed.
Multiple line breaks are reduced to a single line break.

trace

bool If TRUE information on the progress is printed to the console.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_files_xlsx()`

Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.

Usage

LargeDataSetForText$add_from_files_xlsx(
  dir_path,
  trace = TRUE,
  id_column = "id",
  text_column = "text",
  bib_entry_column = "bib_entry",
  license_column = "license",
  url_license_column = "url_license",
  text_license_column = "text_license",
  url_source_column = "url_source",
  log_file = NULL,
  log_write_interval = 2L,
  log_top_value = 0L,
  log_top_total = 1L,
  log_top_message = NA
)

Arguments

dir_path: Path to the directory where the files are stored.
trace: bool If TRUE prints information on the progress to the console.
id_column: string Name of the column storing the ids for the texts.
text_column: string Name of the column storing the raw text.
bib_entry_column: string Name of the column storing the bibliographic information of the texts.
license_column: string Name of the column storing information about the licenses.
url_license_column: string Name of the column storing information about the url to the license in the internet.
text_license_column: string Name of the column storing the license as text.
url_source_column: string Name of the column storing information about about the url to the source in the internet.
log_file: string Path to the file where the log should be saved. If no logging is desired set this argument to NULL.
log_write_interval: int Time in seconds determining the interval in which the logger should try to update the log files. Only relevant if log_file is not NULL.
log_top_value: int indicating the current iteration of the process.
log_top_total: int determining the maximal number of iterations.
log_top_message: string providing additional information of the process.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `add_from_data.frame()`

Method for adding raw texts from a data.frame

Usage

LargeDataSetForText$add_from_data.frame(data_frame)

Arguments

data_frame: Object of class data.frame with at least the following columns "id","text","bib_entry", "license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs. If the other columns are not present in the data.frame they are added with empty values(NA). Additional columns are dropped.

Returns

The method does not return anything. It adds new raw texts to the data set.

Method `get_private()`

Method for requesting all private fields and methods. Used for loading and updating an object.

Usage

LargeDataSetForText$get_private()

Returns

Returns a list with all private fields and methods.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

LargeDataSetForText$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

aifeducation
Artificial Intelligence for Education

LargeDataSetForText: Abstract class for large data sets containing raw texts
In aifeducation: Artificial Intelligence for Education

Abstract class for large data sets containing raw texts

Description

Value

Super class

Methods

Public methods

Method `new()`

Usage

Arguments

Returns

Method `add_from_files_txt()`

Usage

Arguments

Returns

Method `add_from_files_pdf()`

Usage

Arguments

Returns

Method `add_from_files_xlsx()`

Usage

Arguments

Returns

Method `add_from_data.frame()`

Usage

Arguments

Returns

Method `get_private()`

Usage

Returns

Method `clone()`

Usage

Arguments

See Also

Related to LargeDataSetForText in aifeducation...

R Package Documentation

Browse R Packages

We want your feedback!

aifeducation Artificial Intelligence for Education

LargeDataSetForText: Abstract class for large data sets containing raw texts In aifeducation: Artificial Intelligence for Education

Abstract class for large data sets containing raw texts

Description

Value

Super class

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method add_from_files_txt()

Usage

Arguments

Returns

Method add_from_files_pdf()

Usage

Arguments

Returns

Method add_from_files_xlsx()

Usage

Arguments

Returns

Method add_from_data.frame()

Usage

Arguments

Returns

Method get_private()

Usage

Returns

Method clone()

Usage

Arguments

See Also

Related to LargeDataSetForText in aifeducation...

R Package Documentation

Browse R Packages

We want your feedback!

aifeducation
Artificial Intelligence for Education

LargeDataSetForText: Abstract class for large data sets containing raw texts
In aifeducation: Artificial Intelligence for Education

Method `new()`

Method `add_from_files_txt()`

Method `add_from_files_pdf()`

Method `add_from_files_xlsx()`

Method `add_from_data.frame()`

Method `get_private()`

Method `clone()`