scale_text: Scale text using pivoted text scaling
In wilryh/parrot: Text Scaling for Open-Ended Survey Responses

Description Usage Arguments See Also Examples

View source: R/scale_text.R

scale_text runs pivoted text scaling

scale_text(tdm, meta = NULL, tdm_vocab = NULL, embeddings = NULL,
  embeddings_vocab = NULL, compress_fast = TRUE,
  n_dimension_compression = NULL, pivot = 2, verbose = TRUE,
  constrain_outliers = FALSE, simple = TRUE, holdout = NULL)

`tdm`	A sparseMatrix. Rows are documents and columns are vocabulary.
`meta`	data.frame. Must line up with tdm etc. This is included only to keep track of any accompanying variables. It is unaltered by the function.
`tdm_vocab`	A character vector. Provide vocabulary for columns of tdm if missing in column names.
`embeddings`	A numeric matrix. A matrix of embedding values.
`embeddings_vocab`	A character vector. Provide vocabulary for rows of chosen embeddings if missing in row names.
`compress_fast`	A logical scalar. use R base (F) or RSpectra (T)
`n_dimension_compression`	An integer scalar. How many dimensions of PCA to use. The algorithm will not work if this is set too high. If left NULL, a recommended number of dimensions will be calculated automatically.
`pivot`	An integer scalar. This is the power of the pivot. It should be set as high as possible as long as algorithm still works. 2 or 4 is a good bet. If using out-of-sample embeddings, this can be set lower (e.g. 1/2).
`verbose`	A logical scalar. Print progress of the function.
`constrain_outliers`	A logical scalar. This requires in-sample words and embedding scores for documents to have approximately unit norms. Recommended for online surveys (reduce influence of bad data), focused survey questions, and online social media data.
`simple`	A logical scalar. Pivot once.
`holdout`	A logical or numeric vector. A logical or numeric vector indicating which rows to exclude from training.

read_word_embeddings, get_keywords, plot_keywords, score_documents, doc_to_tdm

## Not run: 
library(stm)
library(parrot)

processed <- textProcessor(
    input_data$text,
    data.frame(input_data),
    removestopwords = T, lowercase = T, stem = F
    )
out <- prepDocuments(
    processed$documents, processed$vocab, processed$meta
    )

tdm <- doc_to_tdm(out)

# download and extract embeddings data first

embeddings <- read_word_embeddings(
    in_vocab = out$vocab,
    ovefile = "O2M_overlap.txt" # must add location on your computer "path/to/O2M_overlap.txt"
    ## ovefile2 = "O2M_oov.txt", # very rare words and misspellings
    ## available here http://www.cis.uni-muenchen.de/~wenpeng/renamed-meta-emb.tar.gz
    ## must unpack and replace "path/to/" with location on your computer
    )

scores <- scale_text(
    meta = out$meta,
    tdm = tdm,
##    embeddings = embeddings[["meta"]], ## limited effects on output
    compress_fast = TRUE,
    constrain_outliers = TRUE
    )

document_scores <- score_documents(
    scores = scores, n_dimensions = 10
    )

get_keywords(scores, n_dimensions = 3, n_words = 15)

with(document_scores, cor(sqrt(n_words), X0, use = "complete"))

plot_keywords(
    scores, x_dimension = 1, y_dimension = 2, q_cutoff = 0.9
    )

## End(Not run)