prepare_data | R Documentation |
Convert data frame or character vector to a ‘corporaexplorerobject’ for subsequent exploration.
prepare_data(dataset, ...)
## S3 method for class 'data.frame'
prepare_data(
dataset,
date_based_corpus = TRUE,
text_column = "Text",
grouping_variable = NULL,
within_group_identifier = "sequential",
columns_doc_info = c("Date", "Title", "URL"),
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
tile_length_range = c(1, 10),
columns_for_ui_checkboxes = NULL,
...
)
## S3 method for class 'character'
prepare_data(
dataset,
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
...
)
dataset |
Object to convert to corporaexplorerobject:
|
... |
Other arguments to be passed to |
date_based_corpus |
Logical. Set to |
text_column |
Character. Default: "Text".
The column in |
grouping_variable |
Character string indicating column name in dataset. If date_based_corpus is TRUE, this argument is ignored. If date_based_corpus is FALSE, this argument is used to group the documents, e.g., if dataset is organised by chapters belonging to different books. The order of groups in the app is determined as follows:
|
within_group_identifier |
Character string indicating column name in |
columns_doc_info |
Character vector. The columns from |
corpus_name |
Character string with name of corpus. |
use_matrix |
Logical. Should the function create a document term matrix
for fast searching? If |
matrix_without_punctuation |
Should punctuation and digits be stripped
from the text before constructing the document term matrix? If
If |
tile_length_range |
Numeric vector of length two.
Fine-tune the tile lengths in document wall
and day corpus view. Tile length is calculated by
|
columns_for_ui_checkboxes |
Character. Character or factor column(s) in dataset.
Include sets of checkboxes in the app sidebar for
convenient filtering of corpus.
Typical useful for columns with a small set of unique
(and short) values.
Checkboxes will be arranged by |
For data.frame: Each row in dataset
is treated as a base differentiating unit in the corpus,
typically chapters in books, or a single document in document collections.
The following column names are reserved and cannot be used in dataset
:
"Date_",
"cx_ID",
"Text_original_case",
"Text_column_",
"Tile_length",
"Year_",
"cx_Seq",
"Weekday_n",
"Day_without_docs",
"Invisible_fake_date",
"Tile_length".
A character vector will be converted to a simple corporaexplorerobject with no metadata.
A corporaexplorer
object to be passed as argument to
explore
and
run_document_extractor
.
## From data.frame
# Constructing test data frame:
dates <- as.Date(paste(2011:2020, 1:10, 21:30, sep = "-"))
texts <- paste0(
"This is a document about ", month.name[1:10], ". ",
"This is not a document about ", rev(month.name[1:10]), "."
)
titles <- paste("Text", 1:10)
test_df <- tibble::tibble(Date = dates, Text = texts, Title = titles)
# Converting to corporaexplorerobject:
corpus <- prepare_data(test_df, corpus_name = "Test corpus")
if(interactive()){
# Running exploration app:
explore(corpus)
# Running app to extract documents:
run_document_extractor(corpus)
}
## From character vector
alphabet_corpus <- prepare_data(LETTERS)
if(interactive()){
# Running exploration app:
explore(alphabet_corpus)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.