rancors_builder: Build Multiple Random Corpora

View source: R/utils-perm.R

rancors_builderR Documentation

Build Multiple Random Corpora

Description

rancors_builder() generates multiple random corpus (rancor) based on a user defined term probabilities and vocabulary. Users can set the number of documents, as well as the mean, standard deviation, minimum, and maximum document lengths (i.e. number of tokens). The output is a list of document-term matrices. To produce a single random corpus, use rancor_builder() (note the singular).

Usage

rancors_builder(
  data,
  vocab,
  probs,
  n_cors,
  n_docs,
  len_mean,
  len_var,
  len_min,
  len_max,
  seed = NULL
)

Arguments

data

Data.frame containing vocabulary and probabilities

vocab

Name of the column containing vocabulary

probs

Name of the column containing probabilities

n_cors

Integer indicating the number of corpora to build

n_docs

Integer(s) indicating the number of documents to be returned If two numbers are provide, number will be randomly sampled within the range for each corpora.

len_mean

Integer(s) indicating the mean of the document lengths. If two numbers are provide, number will be randomly sampled within the range for each corpora.

len_var

Integer(s) indicating the standard deviation of the document lengths. If two numbers are provide, number will be randomly sampled within the range for each corpora.

len_min

Integer(s) indicating the minimum of the document lengths. If two numbers are provide, number will be randomly sampled within the range for each corpora.

len_max

Integer(s) indicating the maximum of the document lengths. If two numbers are provide, number will be randomly sampled within the range for each corpora.

seed

Optional seed for reproducibility

Author(s)

Dustin Stoltz and Marshall Taylor

Examples

# create corpus and DTM
my_corpus <- data.frame(
  text = c(
    "I hear babies crying I watch them grow",
    "They'll learn much more than I'll ever know",
    "And I think to myself",
    "What a wonderful world",
    "Yes I think to myself",
    "What a wonderful world"
  ),
  line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))

dtm <- dtm_builder(
  data = my_corpus,
  text = clean_text,
  doc_id = line_id
)

# use colSums to get term frequencies
df <- data.frame(
  vocab = colnames(dtm),
  freqs = colSums(dtm)
)
# convert to probabilities
df$probs <- df$freqs / sum(df$freqs)

# create random DTM
ls_dtms <- df |> 
rancors_builder(vocab,
   probs,
   n_cors = 20,
   n_docs = 100,
   len_mean = c(50, 200),
   len_var = 5,
   len_min = 20,
   len_max = 1000,
   seed = 59801
)
length(ls_dtms)


text2map documentation built on July 9, 2023, 6:35 p.m.