scale_text: Scale text using pivoted text scaling

Description Usage Arguments See Also Examples

View source: R/scale_text.R

Description

scale_text runs pivoted text scaling

Usage

1
2
3
4
scale_text(tdm, meta = NULL, tdm_vocab = NULL, embeddings = NULL,
  embeddings_vocab = NULL, compress_fast = TRUE,
  n_dimension_compression = NULL, pivot = 2, verbose = TRUE,
  constrain_outliers = FALSE, simple = TRUE, holdout = NULL)

Arguments

tdm

A sparseMatrix. Rows are documents and columns are vocabulary.

meta

data.frame. Must line up with tdm etc. This is included only to keep track of any accompanying variables. It is unaltered by the function.

tdm_vocab

A character vector. Provide vocabulary for columns of tdm if missing in column names.

embeddings

A numeric matrix. A matrix of embedding values.

embeddings_vocab

A character vector. Provide vocabulary for rows of chosen embeddings if missing in row names.

compress_fast

A logical scalar. use R base (F) or RSpectra (T)

n_dimension_compression

An integer scalar. How many dimensions of PCA to use. The algorithm will not work if this is set too high. If left NULL, a recommended number of dimensions will be calculated automatically.

pivot

An integer scalar. This is the power of the pivot. It should be set as high as possible as long as algorithm still works. 2 or 4 is a good bet. If using out-of-sample embeddings, this can be set lower (e.g. 1/2).

verbose

A logical scalar. Print progress of the function.

constrain_outliers

A logical scalar. This requires in-sample words and embedding scores for documents to have approximately unit norms. Recommended for online surveys (reduce influence of bad data), focused survey questions, and online social media data.

simple

A logical scalar. Pivot once.

holdout

A logical or numeric vector. A logical or numeric vector indicating which rows to exclude from training.

See Also

read_word_embeddings, get_keywords, plot_keywords, score_documents, doc_to_tdm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## Not run: 
library(stm)
library(parrot)

processed <- textProcessor(
    input_data$text,
    data.frame(input_data),
    removestopwords = T, lowercase = T, stem = F
    )
out <- prepDocuments(
    processed$documents, processed$vocab, processed$meta
    )

tdm <- doc_to_tdm(out)

# download and extract embeddings data first

embeddings <- read_word_embeddings(
    in_vocab = out$vocab,
    ovefile = "O2M_overlap.txt" # must add location on your computer "path/to/O2M_overlap.txt"
    ## ovefile2 = "O2M_oov.txt", # very rare words and misspellings
    ## available here http://www.cis.uni-muenchen.de/~wenpeng/renamed-meta-emb.tar.gz
    ## must unpack and replace "path/to/" with location on your computer
    )

scores <- scale_text(
    meta = out$meta,
    tdm = tdm,
##    embeddings = embeddings[["meta"]], ## limited effects on output
    compress_fast = TRUE,
    constrain_outliers = TRUE
    )

document_scores <- score_documents(
    scores = scores, n_dimensions = 10
    )

get_keywords(scores, n_dimensions = 3, n_words = 15)

with(document_scores, cor(sqrt(n_words), X0, use = "complete"))

plot_keywords(
    scores, x_dimension = 1, y_dimension = 2, q_cutoff = 0.9
    )

## End(Not run)

wilryh/parrot documentation built on Jan. 9, 2020, 2:16 p.m.