| LargeDataSetForText | R Documentation |
This object stores raw texts. The data of this objects is not stored in memory directly. By using memory mapping these objects allow to work with data sets which do not fit into memory/RAM.
Returns a new object of this class.
aifeducation::LargeDataSetBase -> LargeDataSetForText
aifeducation::LargeDataSetBase$get_all_fields()aifeducation::LargeDataSetBase$get_colnames()aifeducation::LargeDataSetBase$get_dataset()aifeducation::LargeDataSetBase$get_ids()aifeducation::LargeDataSetBase$load()aifeducation::LargeDataSetBase$load_from_disk()aifeducation::LargeDataSetBase$n_cols()aifeducation::LargeDataSetBase$n_rows()aifeducation::LargeDataSetBase$reduce_to_unique_ids()aifeducation::LargeDataSetBase$save()aifeducation::LargeDataSetBase$select()new()Method for creation of LargeDataSetForText instance. It can be initialized with init_data
parameter if passed (Uses add_from_data.frame() method if init_data is data.frame).
LargeDataSetForText$new(init_data = NULL)
init_dataInitial data.frame for dataset.
A new instance of this class initialized with init_data if passed.
add_from_files_txt()Method for adding raw texts saved within .txt files to the data set. Please note the the directory should contain one folder for each .txt file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .txt file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_txt( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
dir_pathPath to the directory where the files are stored.
batch_sizeint determining the number of files to process at once.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
tracebool If TRUE information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_pdf()Method for adding raw texts saved within .pdf files to the data set. Please note the the directory should contain one folder for each .pdf file. In order to create an informative data set every folder can contain the following additional files:
bib_entry.txt: containing a text version of the bibliographic information of the raw text.
license.txt: containing a statement about the license to use the raw text such as "CC BY".
url_license.txt: containing the url/link to the license in the internet.
text_license.txt: containing the license in raw text.
url_source.txt: containing the url/link to the source in the internet.
The id of every .pdf file is the file name without file extension. Please be aware to provide unique file names. Id and raw texts are mandatory, bibliographic and license information are optional.
LargeDataSetForText$add_from_files_pdf( dir_path, batch_size = 500, log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA, trace = TRUE )
dir_pathPath to the directory where the files are stored.
batch_sizeint determining the number of files to process at once.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
tracebool If TRUE information on the progress is printed to the console.
The method does not return anything. It adds new raw texts to the data set.
add_from_files_xlsx()Method for adding raw texts saved within .xlsx files to the data set. The method assumes that the texts are saved in the rows and that the columns store the id and the raw texts in the columns. In addition, a column for the bibliography information and the license can be added. The column names for these rows must be specified with the following arguments. They must be the same for all .xlsx files in the chosen directory. Id and raw texts are mandatory, bibliographic, license, license's url, license's text, and source's url are optional. Additional columns are dropped.
LargeDataSetForText$add_from_files_xlsx( dir_path, trace = TRUE, id_column = "id", text_column = "text", bib_entry_column = "bib_entry", license_column = "license", url_license_column = "url_license", text_license_column = "text_license", url_source_column = "url_source", log_file = NULL, log_write_interval = 2, log_top_value = 0, log_top_total = 1, log_top_message = NA )
dir_pathPath to the directory where the files are stored.
tracebool If TRUE prints information on the progress to the console.
id_columnstring Name of the column storing the ids for the texts.
text_columnstring Name of the column storing the raw text.
bib_entry_columnstring Name of the column storing the bibliographic information of the texts.
license_columnstring Name of the column storing information about the licenses.
url_license_columnstring Name of the column storing information about the url to the license in the
internet.
text_license_columnstring Name of the column storing the license as text.
url_source_columnstring Name of the column storing information about about the url to the source in the
internet.
log_filestring Path to the file where the log should be saved. If no logging is desired set this
argument to NULL.
log_write_intervalint Time in seconds determining the interval in which the logger should try to update
the log files. Only relevant if log_file is not NULL.
log_top_valueint indicating the current iteration of the process.
log_top_totalint determining the maximal number of iterations.
log_top_messagestring providing additional information of the process.
The method does not return anything. It adds new raw texts to the data set.
add_from_data.frame()Method for adding raw texts from a data.frame
LargeDataSetForText$add_from_data.frame(data_frame)
data_frameObject of class data.frame with at least the following columns "id","text","bib_entry",
"license", "url_license", "text_license", and "url_source". If "id" and7or "text" is missing an error occurs.
If the other columns are not present in the data.frame they are added with empty values(NA).
Additional columns are dropped.
The method does not return anything. It adds new raw texts to the data set.
get_private()Method for requesting all private fields and methods. Used for loading and updating an object.
LargeDataSetForText$get_private()
Returns a list with all private fields and methods.
clone()The objects of this class are cloneable with this method.
LargeDataSetForText$clone(deep = FALSE)
deepWhether to make a deep clone.
Other Data Management:
DataManagerClassifier,
EmbeddedText,
LargeDataSetForTextEmbeddings
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.