text2vec: Modern Text Mining Framework for R
Version 0.4.0

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

AuthorDmitriy Selivanov [aut, cre], Lincoln Mullen [ctb]
Date of publication2016-10-04 17:48:07
MaintainerDmitriy Selivanov <selivanov.dmitriy@gmail.com>
LicenseGPL (>= 2) | file LICENSE
Version0.4.0
URL http://text2vec.org
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("text2vec")

Getting started

README.md
|
Analyzing Texts with the text2vec package
GloVe Word Embeddings

Popular man pages

check_analogy_accuracy: Checks accuracy of word embeddings on the analogy task
create_vocabulary: Creates a vocabulary of unique terms
glove: Fit a GloVe word-embedded model
prune_vocabulary: Prune vocabulary
similarities: Pairwise Similarity Matrix Computation
text2vec: text2vec
transform_tf: Scale a document-term matrix
See all...

All man pages Function index File listing

Man pages

as.lda_c: Converts document-term matrix sparse matrix to 'lda_c' format
check_analogy_accuracy: Checks accuracy of word embeddings on the analogy task
create_corpus: Create a corpus
create_dtm: Document-term matrix construction
create_tcm: Term-co-occurence matrix construction
create_vocabulary: Creates a vocabulary of unique terms
distances: Pairwise Distance Matrix Computation
fit: Fits model to data
fit_transform: Fit model to data, then transform it
get_dtm: Extract document-term matrix
get_idf: Inverse document-frequency scaling matrix
get_tcm: Extract term-co-occurence matrix
get_tf: Term-frequency scaling matrix
GlobalVectors: Creates Global Vectors word-embeddings model.
glove: Fit a GloVe word-embedded model
ifiles: Creates iterator over text files from the disk
itoken: Iterators over input objects
LatentDirichletAllocation: Creates Latent Dirichlet Allocation model.
LatentSemanticAnalysis: Latent Semantic Analysis model
movie_review: IMDB movie reviews
normalize: Matrix normalization
prepare_analogy_questions: Prepares list of analogy questions
prune_vocabulary: Prune vocabulary
reexports: Objects exported from other packages
RelaxedWordMoversDistance: Creates model which can be used for calculation of "relaxed...
similarities: Pairwise Similarity Matrix Computation
split_into: Split a vector for parallel processing
text2vec: text2vec
TfIdf: TfIdf
tokenizers: Simple tokenization functions, which performs string...
transform: Transforms Matrix-like object using 'model'
transform_filter_commons: Remove terms from a document-term matrix
transform_tf: Scale a document-term matrix
vectorizers: Vocabulary and hash vectorizers

Functions

GloVe Man page
GlobalVectors Man page
LDA Man page
LSA Man page
LatentDirichletAllocation Man page
LatentSemanticAnalysis Man page
RWMD Man page
RelaxedWordMoversDistance Man page
StopIteration Source code
TfIdf Man page
\%>\% Man page
as.lda_c Man page Source code
char_tokenizer Man page Source code
check_analogy_accuracy Man page Source code
coerce_matrix Source code
colMaxs Source code
colMins Source code
collapsedGibbsSampler Source code
combine_vocabulary Source code
corpus_insert Source code
cosine_dist_internal Source code
create_corpus Man page Source code
create_dtm Man page Source code
create_dtm.itoken Man page Source code
create_dtm.list Man page Source code
create_tcm Man page Source code
create_tcm.itoken Man page Source code
create_tcm.list Man page Source code
create_vocabulary Man page Source code
create_vocabulary.character Man page Source code
create_vocabulary.itoken Man page Source code
create_vocabulary.list Man page Source code
detect_ngrams Source code
dist2 Man page Source code
dist_internal Source code
distances Man page
euclidean_dist Source code
euclidean_dist_internal Source code
fit Man page Source code
fit.Matrix Man page Source code
fit.matrix Man page Source code
fit_transform Man page Source code
fit_transform.Matrix Man page Source code
fit_transform.matrix Man page Source code
get_dtm Man page Source code
get_idf Man page Source code
get_tcm Man page Source code
get_tf Man page Source code
glove Man page Source code
hash_vectorizer Man page Source code
hasher Source code
idir Man page Source code
ifiles Man page Source code
itoken Man page Source code
itoken.character Man page Source code
itoken.iterator Man page Source code
itoken.list Man page Source code
jaccard_sim Source code
mc_reduce Source code
mc_triplet_rds_sum Source code
movie_review Man page
normalize Man page Source code
onAttach Source code
onUnload Source code
pdist2 Man page Source code
prepare_analogy_questions Man page Source code
print.text2vec_vocabulary Source code
prune_vocabulary Man page Source code
psim2 Man page Source code
rbind_dgTMatrix Source code
reexports Man page
regexp_tokenizer Man page Source code
rowMaxs Source code
rowMins Source code
sim2 Man page Source code
similarities Man page
space_tokenizer Man page Source code
split_into Man page Source code
split_vector Source code
text2vec Man page
text2vec-package Man page
to_lda_c Source code
tokenizers Man page
total_likelihood Source code
transform Man page
transform.Matrix Man page Source code
transform.matrix Man page Source code
transform_binary Man page Source code
transform_filter_commons Man page Source code
transform_tf Man page Source code
transform_tfidf Man page Source code
vectorizers Man page
vocab_vectorizer Man page Source code
vocabulary Man page Source code
word_tokenizer Man page Source code

Files

inst
inst/doc
inst/doc/glove.html
inst/doc/glove.R
inst/doc/text-vectorization.Rmd
inst/doc/text-vectorization.R
inst/doc/files-multicore.html
inst/doc/files-multicore.R
inst/doc/text-vectorization.html
inst/doc/files-multicore.Rmd
inst/doc/glove.Rmd
tests
tests/testthat.R
tests/testthat
tests/testthat/test-utils.R
tests/testthat/test-tcm.R
tests/testthat/utf8.r
tests/testthat/test-distances.R
tests/testthat/test-lsa.R
tests/testthat/test-hash-corpus.R
tests/testthat/test-iterators.R
tests/testthat/not-test-doc2vec.R
tests/testthat/test-s3-interface.R
tests/testthat/test-vocab-high-level.R
tests/testthat/test-vocab-corpus.R
src
src/Makevars
src/Vocabulary.h
src/matrix_utils.cpp
src/VocabCorpus.h
src/utils.cpp
src/GloveFitter.cpp
src/LDA_gibbs.cpp
src/SparseTripletMatrix.h
src/GloveFit.h
src/Vocabulary.cpp
src/text2vec.h
src/HashCorpus.cpp
src/uint_hash.cpp
src/VocabCorpus.cpp
src/Makevars.win
src/RcppExports.cpp
src/Corpus.h
src/HashCorpus.h
NAMESPACE
NEWS.md
data
data/movie_review.RData
data/datalist
R
R/utils.R
R/vocabulary.R
R/distance_RWMD.R
R/model_GloVe.R
R/model_LDA.R
R/text2vec.R
R/data.R
R/vectorizers.R
R/model_LSA.R
R/RcppExports.R
R/models_S3.R
R/analogies.R
R/models_R6.R
R/tcm.R
R/dtm.R
R/tokenizers.R
R/transformers.R
R/iterators.R
R/zzz.R
R/model_tfidf.R
R/distance.R
vignettes
vignettes/text-vectorization.Rmd
vignettes/files-multicore.Rmd
vignettes/glove.Rmd
README.md
MD5
build
build/vignette.rds
DESCRIPTION
man
man/create_dtm.Rd
man/transform.Rd
man/transform_filter_commons.Rd
man/fit.Rd
man/distances.Rd
man/vectorizers.Rd
man/ifiles.Rd
man/get_tf.Rd
man/tokenizers.Rd
man/split_into.Rd
man/get_idf.Rd
man/get_tcm.Rd
man/create_vocabulary.Rd
man/check_analogy_accuracy.Rd
man/text2vec.Rd
man/prepare_analogy_questions.Rd
man/transform_tf.Rd
man/TfIdf.Rd
man/LatentSemanticAnalysis.Rd
man/as.lda_c.Rd
man/reexports.Rd
man/GlobalVectors.Rd
man/fit_transform.Rd
man/create_corpus.Rd
man/get_dtm.Rd
man/itoken.Rd
man/glove.Rd
man/movie_review.Rd
man/LatentDirichletAllocation.Rd
man/create_tcm.Rd
man/normalize.Rd
man/similarities.Rd
man/RelaxedWordMoversDistance.Rd
man/prune_vocabulary.Rd
LICENSE
text2vec documentation built on May 19, 2017, 6:54 a.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.