textVectors: Tokenize a vector of text and convert to a sparse matrix....

Description Usage Arguments Value References Examples

View source: R/text_to_sparse_matrix.R

Description

This code takes a vector of text, cleans it up, tokenizes it, spellchecks it, removes stopwords, stems it, finds n-grams, crates a bag og words and converts it to a sparse matrix using a bag of words model. Optinally it also applies td-idf weighting to the matrix. This function can be slow. Note that freqCutoff and absCutoff are relative to the number of documents the term appears in, and ignore its frequency within documents.

Usage

1
2
3
4
5
6
textVectors(x, normalize = FALSE, split_token = " ", verbose = FALSE,
  freqCutoff = 0, absCutoff = 0, tfidf = FALSE, idf = NULL,
  bagofwords = NULL, spellcheck = FALSE, remove_stopwords = FALSE,
  stem = FALSE, ngrams = 1, skips = 0, stops = NULL, pca = FALSE,
  pca_comp = 5, pca_rotation = NULL, tsne = FALSE, tsne_dims = 2,
  tsne_perplexity = 30)

Arguments

x

a character vector

normalize

normalize the character vector by converting to lowercase, removing accents, and converting punctuation and spaces to single spaces and then trimming the string.

split_token

token to use to split the text data. If NULL, text will not be tokenized and the bagofwords will be detected via regular expressions.

verbose

whether to print a log while performing the operations

freqCutoff

columns below this pct frequency will be removed from the final object

absCutoff

columns below this absolute frequency will be removed from the final object

tfidf

whether to apply tfidf weighting. NOTE THAT THIS WILL CREATE A DENSE MATRIX, WHICH IN MANY CASES IS BAD.

idf

Pre-computed inverse document frequencies (perhaps from another, larger dataset)

bagofwords

input bagofwords to use to construct the final matrix

spellcheck

if TRUE tokens will be spellchecked before they are stemmed

remove_stopwords

if TRUE, english stopwords will be removed from the tokens

stem

if TRUE the tokens will be stemmed, after tokenizing and before creating a matrix

ngrams

If great than 1, n-grams of this degree will be added to the word bag

skips

If great than 0, skips of this degree will be added to the word bag

stops

Optional list of stopwords, otherwise a default list will be used.

pca

Apply PCA after transforming text to sparse matrix?

pca_comp

Number of components to use for PCA

pca_rotation

Rotation matrix to use for PCA. If NULL, will be computed by irlba.

tsne

Apply the tsne transformation after the PCA rotation?

tsne_dims

Dimension of the final TSNE embedding. Should be smaller than pca_comp.

tsne_perplexity

Preplexity for the tsne transformation.

Value

a textVectors object

References

http://stackoverflow.com/questions/4942361/how-to-turn-a-list-of-lists-to-a-sparse-matrix-in-r-without-using-lapply http://en.wikipedia.org/wiki/Tf-idf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
x <- c(
  'i like this package written by zach mayer',
  'this package is so much fun',
  'thanks zach for writing it',
  'this package is the best package',
  'i want to give zach mayer a million dollars')
textVectors(
  x,
  absCutoff=1, ngrams=2, stem=TRUE, verbose=TRUE)
textVectors(
  x,
  absCutoff=1, ngrams=2, skips=1, stem=TRUE, verbose=TRUE, tfidf=TRUE)

zachmayer/r2vec documentation built on May 4, 2019, 9:05 p.m.