text2vec: Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Install the latest version of this package by entering the following in R:
install.packages("text2vec")
AuthorDmitriy Selivanov [aut, cre], Lincoln Mullen [ctb]
Date of publication2016-10-04 17:48:07
MaintainerDmitriy Selivanov <selivanov.dmitriy@gmail.com>
LicenseGPL (>= 2) | file LICENSE
Version0.4.0
http://text2vec.org

View on CRAN

Man pages

as.lda_c: Converts document-term matrix sparse matrix to 'lda_c' format

check_analogy_accuracy: Checks accuracy of word embeddings on the analogy task

create_corpus: Create a corpus

create_dtm: Document-term matrix construction

create_tcm: Term-co-occurence matrix construction

create_vocabulary: Creates a vocabulary of unique terms

distances: Pairwise Distance Matrix Computation

fit: Fits model to data

fit_transform: Fit model to data, then transform it

get_dtm: Extract document-term matrix

get_idf: Inverse document-frequency scaling matrix

get_tcm: Extract term-co-occurence matrix

get_tf: Term-frequency scaling matrix

GlobalVectors: Creates Global Vectors word-embeddings model.

glove: Fit a GloVe word-embedded model

ifiles: Creates iterator over text files from the disk

itoken: Iterators over input objects

LatentDirichletAllocation: Creates Latent Dirichlet Allocation model.

LatentSemanticAnalysis: Latent Semantic Analysis model

movie_review: IMDB movie reviews

normalize: Matrix normalization

prepare_analogy_questions: Prepares list of analogy questions

prune_vocabulary: Prune vocabulary

reexports: Objects exported from other packages

RelaxedWordMoversDistance: Creates model which can be used for calculation of "relaxed...

similarities: Pairwise Similarity Matrix Computation

split_into: Split a vector for parallel processing

text2vec: text2vec

TfIdf: TfIdf

tokenizers: Simple tokenization functions, which performs string...

transform: Transforms Matrix-like object using 'model'

transform_filter_commons: Remove terms from a document-term matrix

transform_tf: Scale a document-term matrix

vectorizers: Vocabulary and hash vectorizers

Functions

\%>\% Man page
as.lda_c Man page
char_tokenizer Man page
check_analogy_accuracy Man page
create_corpus Man page
create_dtm Man page
create_dtm.itoken Man page
create_dtm.list Man page
create_tcm Man page
create_tcm.itoken Man page
create_tcm.list Man page
create_vocabulary Man page
create_vocabulary.character Man page
create_vocabulary.itoken Man page
create_vocabulary.list Man page
dist2 Man page
distances Man page
fit Man page
fit.matrix Man page
fit.Matrix Man page
fit_transform Man page
fit_transform.matrix Man page
fit_transform.Matrix Man page
get_dtm Man page
get_idf Man page
get_tcm Man page
get_tf Man page
GlobalVectors Man page
glove Man page
GloVe Man page
hash_vectorizer Man page
idir Man page
ifiles Man page
itoken Man page
itoken.character Man page
itoken.iterator Man page
itoken.list Man page
LatentDirichletAllocation Man page
LatentSemanticAnalysis Man page
LDA Man page
LSA Man page
movie_review Man page
normalize Man page
pdist2 Man page
prepare_analogy_questions Man page
prune_vocabulary Man page
psim2 Man page
reexports Man page
regexp_tokenizer Man page
RelaxedWordMoversDistance Man page
RWMD Man page
sim2 Man page
similarities Man page
space_tokenizer Man page
split_into Man page
text2vec Man page
text2vec-package Man page
TfIdf Man page
tokenizers Man page
transform Man page
transform_binary Man page
transform_filter_commons Man page
transform.matrix Man page
transform.Matrix Man page
transform_tf Man page
transform_tfidf Man page
vectorizers Man page
vocabulary Man page
vocab_vectorizer Man page
word_tokenizer Man page

Files

inst
inst/doc
inst/doc/glove.html
inst/doc/glove.R
inst/doc/text-vectorization.Rmd
inst/doc/text-vectorization.R
inst/doc/files-multicore.html
inst/doc/files-multicore.R
inst/doc/text-vectorization.html
inst/doc/files-multicore.Rmd
inst/doc/glove.Rmd
tests
tests/testthat.R
tests/testthat
tests/testthat/test-utils.R tests/testthat/test-tcm.R tests/testthat/utf8.r tests/testthat/test-distances.R tests/testthat/test-lsa.R tests/testthat/test-hash-corpus.R tests/testthat/test-iterators.R tests/testthat/not-test-doc2vec.R tests/testthat/test-s3-interface.R tests/testthat/test-vocab-high-level.R tests/testthat/test-vocab-corpus.R
src
src/Makevars
src/Vocabulary.h
src/matrix_utils.cpp
src/VocabCorpus.h
src/utils.cpp
src/GloveFitter.cpp
src/LDA_gibbs.cpp
src/SparseTripletMatrix.h
src/GloveFit.h
src/Vocabulary.cpp
src/text2vec.h
src/HashCorpus.cpp
src/uint_hash.cpp
src/VocabCorpus.cpp
src/Makevars.win
src/RcppExports.cpp
src/Corpus.h
src/HashCorpus.h
NAMESPACE
NEWS.md
data
data/movie_review.RData
data/datalist
R
R/utils.R R/vocabulary.R R/distance_RWMD.R R/model_GloVe.R R/model_LDA.R R/text2vec.R R/data.R R/vectorizers.R R/model_LSA.R R/RcppExports.R R/models_S3.R R/analogies.R R/models_R6.R R/tcm.R R/dtm.R R/tokenizers.R R/transformers.R R/iterators.R R/zzz.R R/model_tfidf.R R/distance.R
vignettes
vignettes/text-vectorization.Rmd
vignettes/files-multicore.Rmd
vignettes/glove.Rmd
README.md
MD5
build
build/vignette.rds
DESCRIPTION
man
man/create_dtm.Rd man/transform.Rd man/transform_filter_commons.Rd man/fit.Rd man/distances.Rd man/vectorizers.Rd man/ifiles.Rd man/get_tf.Rd man/tokenizers.Rd man/split_into.Rd man/get_idf.Rd man/get_tcm.Rd man/create_vocabulary.Rd man/check_analogy_accuracy.Rd man/text2vec.Rd man/prepare_analogy_questions.Rd man/transform_tf.Rd man/TfIdf.Rd man/LatentSemanticAnalysis.Rd man/as.lda_c.Rd man/reexports.Rd man/GlobalVectors.Rd man/fit_transform.Rd man/create_corpus.Rd man/get_dtm.Rd man/itoken.Rd man/glove.Rd man/movie_review.Rd man/LatentDirichletAllocation.Rd man/create_tcm.Rd man/normalize.Rd man/similarities.Rd man/RelaxedWordMoversDistance.Rd man/prune_vocabulary.Rd
LICENSE

Questions? Problems? Suggestions? or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.