cleanNLP: A Tidy Data Model for Natural Language Processing
Version 1.5.2

Provides a set of fast tools for converting a textual corpus into a set of normalized tables. Users may make use of a Python back end with 'spaCy' or the Java back end 'CoreNLP' . A minimal back end with no external dependencies is also provided. Exposed annotation tasks include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and word embeddings. Summary statistics regarding token unigram, part of speech tag, and dependency type frequencies are also included to assist with analyses.

AuthorTaylor B. Arnold [aut, cre]
Date of publication2017-04-12 22:28:43 UTC
MaintainerTaylor B. Arnold <taylor.arnold@acm.org>
LicenseGPL-3
Version1.5.2
URL https://statsmaths.github.io/cleanNLP/
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("cleanNLP")

Getting started

Package overview
README.md
A Data Model for the NLP Pipeline
Exploring the State of the Union Addresses: A Case Study with cleanNLP
Introduction to the cleanNLP package

Popular man pages

cleanNLP-package: cleanNLP: A Tidy Data Model for Natural Language Processing
combine_documents: Combine a set of annotations
from_CoNLLU: Reads a CoNLL-U or CoNLL-X File
get_coreference: Access coreferences from an annotation object
get_document: Access document meta data from an annotation object
get_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame
init_spaCy: Interface for initializing up the spaCy backend
See all...

All man pages Function index File listing

Man pages

annotate: Run the annotation pipeline on a set of documents
cleanNLP-package: cleanNLP: A Tidy Data Model for Natural Language Processing
combine_documents: Combine a set of annotations
dep_frequency: Universal Dependency Frequencies
doc_id_reset: Reset document ids
download_core_nlp: Download java files needed for CoreNLP
extract_documents: Extract documents from an annotation object
from_CoNLLU: Reads a CoNLL-U or CoNLL-X File
get_coreference: Access coreferences from an annotation object
get_dependency: Access dependencies from an annotation object
get_document: Access document meta data from an annotation object
get_entity: Access named entities from an annotation object
get_sentence: Access sentence-level annotations
get_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame
get_token: Access tokens from an annotation object
get_vector: Access word embedding vector from an annotation object
init_coreNLP: Interface for initializingthe coreNLP backend
init_spaCy: Interface for initializing up the spaCy backend
init_tokenizers: Interface for initializing the tokenizers backend
obama: Annotation of Barack Obama's State of the Union Addresses
pos_frequency: Universal Part of Speech Code Frequencies
print.annotation: Print a summary of an annotation object
read_annotation: Read annotation files from disk
tidy_pca: Compute Principal Components and store as a Data Frame
word_frequency: Most frequent English words
write_annotation: Write annotation files to disk

Functions

annotate Man page Source code
annotate_with_r Source code
cleanNLP Man page
cleanNLP-package Man page
combine_documents Man page Source code
dep_frequency Man page
doc_id_reset Man page Source code
download_core_nlp Man page Source code
extract_documents Man page Source code
from_CoNLLU Man page Source code
get_coreference Man page Source code
get_dependency Man page Source code
get_document Man page Source code
get_entity Man page Source code
get_sentence Man page Source code
get_tfidf Man page Source code
get_token Man page Source code
get_vector Man page Source code
init_coreNLP Man page Source code
init_spaCy Man page Source code
init_tokenizers Man page Source code
nit_coreNLP_backend Source code
nit_spaCy_backend Source code
nit_tokenizers_backend Source code
obama Man page
onLoad Source code
pos_frequency Man page
print.annotation Man page Source code
read_annotation Man page Source code
setup_coreNLP_backend_raw Source code
tidy_pca Man page Source code
word_frequency Man page
write_annotation Man page Source code

Files

inst
inst/txt_files
inst/txt_files/obama.txt
inst/txt_files/bush.txt
inst/txt_files/clinton.txt
inst/py
inst/py/load_spacy.py
inst/extdata
inst/extdata/StanfordCoreNLP-arabic.properties
inst/extdata/StanfordCoreNLP-french.properties
inst/extdata/StanfordCoreNLP-chinese.properties
inst/extdata/StanfordCoreNLP-german.properties
inst/extdata/StanfordCoreNLP-spanish.properties
inst/extdata/cleanNLP-0.1.jar
inst/extdata/StanfordCoreNLP.properties
inst/extdata/CoreNLP-to-HTML.xsl
inst/extdata/StanfordCoreNLP-english-all.properties
inst/extdata/StanfordCoreNLP-english-fast.properties
inst/doc
inst/doc/schema.Rmd
inst/doc/schema.html
inst/doc/introduction.R
inst/doc/introduction.html
inst/doc/case_study.html
inst/doc/schema.R
inst/doc/introduction.Rmd
inst/doc/case_study.R
inst/doc/case_study.Rmd
tests
tests/testthat.R
tests/testthat
tests/testthat/test-tokenizers.R
tests/testthat/test-java.R
tests/testthat/test-python.R
tests/testthat/test-tools.R
tests/testthat/test-annotation_utils.R
NAMESPACE
data
data/dep_frequency.rda
data/word_frequency.rda
data/obama.rda
data/datalist
data/pos_frequency.rda
R
R/backend_tokenizers.R
R/accessors.R
R/onLoad.R
R/pkg.R
R/backend_coreNLP.R
R/data.R
R/backend_spaCy.R
R/tools.R
R/download.R
R/annotate.R
vignettes
vignettes/schema.Rmd
vignettes/figs
vignettes/figs/pca_plot.png
vignettes/figs/pca_plot.pdf
vignettes/figs/glmnet_plot.pdf
vignettes/figs/num_tokens.pdf
vignettes/figs/num_tokens.png
vignettes/figs/glmnet_plot.png
vignettes/figs/tm_sotu.png
vignettes/figs/tm_sotu.pdf
vignettes/introduction.Rmd
vignettes/case_study.Rmd
README.md
MD5
java
java/src
java/src/main
java/src/main/scripts
java/src/main/scripts/store.sh
java/src/main/java
java/src/main/java/edu
java/src/main/java/edu/richmond
java/src/main/java/edu/richmond/nlp
java/src/main/java/edu/richmond/nlp/AnnotationProcessor.java
java/src/main/java/edu/richmond/nlp/ConsoleOutputCapturer.java
java/src/main/java/edu/richmond/nlp/Writer
java/src/main/java/edu/richmond/nlp/Writer/CSVDependencyDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Writer/CSVDocumentDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Writer/CSVCoreferenceDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Writer/CSVNamedEntityDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Writer/CSVSentenceDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Writer/CSVTokenDocumentWriter.java
java/src/main/java/edu/richmond/nlp/Outputter
java/src/main/java/edu/richmond/nlp/Outputter/CSVCoreferenceOutputter.java
java/src/main/java/edu/richmond/nlp/Outputter/CSVTokenOutputter.java
java/src/main/java/edu/richmond/nlp/Outputter/CSVSentenceOutputter.java
java/src/main/java/edu/richmond/nlp/Outputter/CSVDocumentOutputter.java
java/src/main/java/edu/richmond/nlp/Outputter/CSVNamedEntityOutputter.java
java/src/main/java/edu/richmond/nlp/Outputter/CSVDependencyOutputter.java
java/pom.xml
build
build/vignette.rds
DESCRIPTION
man
man/download_core_nlp.Rd
man/print.annotation.Rd
man/annotate.Rd
man/pos_frequency.Rd
man/read_annotation.Rd
man/tidy_pca.Rd
man/get_token.Rd
man/write_annotation.Rd
man/combine_documents.Rd
man/obama.Rd
man/get_sentence.Rd
man/init_coreNLP.Rd
man/init_tokenizers.Rd
man/extract_documents.Rd
man/get_document.Rd
man/word_frequency.Rd
man/from_CoNLLU.Rd
man/cleanNLP-package.Rd
man/get_dependency.Rd
man/get_entity.Rd
man/dep_frequency.Rd
man/get_coreference.Rd
man/get_tfidf.Rd
man/get_vector.Rd
man/doc_id_reset.Rd
man/init_spaCy.Rd
cleanNLP documentation built on May 19, 2017, 12:11 p.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.