textVectors: Tokenize a vector of text and convert to a sparse matrix....
In zachmayer/r2vec: Text data to numeric vectors

Description Usage Arguments Value References Examples

View source: R/text_to_sparse_matrix.R

This code takes a vector of text, cleans it up, tokenizes it, spellchecks it, removes stopwords, stems it, finds n-grams, crates a bag og words and converts it to a sparse matrix using a bag of words model. Optinally it also applies td-idf weighting to the matrix. This function can be slow. Note that freqCutoff and absCutoff are relative to the number of documents the term appears in, and ignore its frequency within documents.

textVectors(x, normalize = FALSE, split_token = " ", verbose = FALSE,
  freqCutoff = 0, absCutoff = 0, tfidf = FALSE, idf = NULL,
  bagofwords = NULL, spellcheck = FALSE, remove_stopwords = FALSE,
  stem = FALSE, ngrams = 1, skips = 0, stops = NULL, pca = FALSE,
  pca_comp = 5, pca_rotation = NULL, tsne = FALSE, tsne_dims = 2,
  tsne_perplexity = 30)

`x`	a character vector
`normalize`	normalize the character vector by converting to lowercase, removing accents, and converting punctuation and spaces to single spaces and then trimming the string.
`split_token`	token to use to split the text data. If NULL, text will not be tokenized and the bagofwords will be detected via regular expressions.
`verbose`	whether to print a log while performing the operations
`freqCutoff`	columns below this pct frequency will be removed from the final object
`absCutoff`	columns below this absolute frequency will be removed from the final object
`tfidf`	whether to apply tfidf weighting. NOTE THAT THIS WILL CREATE A DENSE MATRIX, WHICH IN MANY CASES IS BAD.
`idf`	Pre-computed inverse document frequencies (perhaps from another, larger dataset)
`bagofwords`	input bagofwords to use to construct the final matrix
`spellcheck`	if TRUE tokens will be spellchecked before they are stemmed
`remove_stopwords`	if TRUE, english stopwords will be removed from the tokens
`stem`	if TRUE the tokens will be stemmed, after tokenizing and before creating a matrix
`ngrams`	If great than 1, n-grams of this degree will be added to the word bag
`skips`	If great than 0, skips of this degree will be added to the word bag
`stops`	Optional list of stopwords, otherwise a default list will be used.
`pca`	Apply PCA after transforming text to sparse matrix?
`pca_comp`	Number of components to use for PCA
`pca_rotation`	Rotation matrix to use for PCA. If NULL, will be computed by irlba.
`tsne`	Apply the tsne transformation after the PCA rotation?
`tsne_dims`	Dimension of the final TSNE embedding. Should be smaller than pca_comp.
`tsne_perplexity`	Preplexity for the tsne transformation.

a textVectors object

http://stackoverflow.com/questions/4942361/how-to-turn-a-list-of-lists-to-a-sparse-matrix-in-r-without-using-lapply http://en.wikipedia.org/wiki/Tf-idf

x <- c(
  'i like this package written by zach mayer',
  'this package is so much fun',
  'thanks zach for writing it',
  'this package is the best package',
  'i want to give zach mayer a million dollars')
textVectors(
  x,
  absCutoff=1, ngrams=2, stem=TRUE, verbose=TRUE)
textVectors(
  x,
  absCutoff=1, ngrams=2, skips=1, stem=TRUE, verbose=TRUE, tfidf=TRUE)