README.md

The textfeats package

This is an R package for computing text features for single words or longer documents.

Example Usage

Installing the package

install.packages("devtools")
library(devtools)

install_github("cookm346/textfeats")
library(textfeats)

Using the package

Here is some basic text we can use to demonstrate the package. The elements of this array can be single words or longer text including full documents (e.g., books).

x <- c("i am the WALRUS", "Jerk!!", "very very good")

Computing frequency based features

The package will compute fequency based features for each string, including:

count(x)

##                 n_words n_unique_words n_chars n_unique_chars n_periods n_commas n_question n_exclamation
## i am the WALRUS       4              4      15             13         0        0          0             0
## Jerk!!                1              1       6              5         0        0          0             2
## very very good        3              2      14              8         0        0          0             0

Computing sentiment based features

The package will also produce a set of sentiment based features. The valence, arousal, and dominance of individual words or longer documents can be summarized by utilizing the norms from Warriner et al (2013).

warriner(x)

##                 valence arousal dominance
## i am the WALRUS    5.79    3.95      5.23
## Jerk!!             2.43    6.45      4.84
## very very good     7.89    3.66      6.41

For strings with more than one word, the mean valence, arousal, and dominance is computed.

Computing parts of speech based features

The package will produce a set of features based on the parts of speech of the text (e.g., noun, verb, adjective). These parts of speech classes are extracted from the Moby parts of Speech database. Fifteen different classes exist in the database. The outout below is trucated.

pos(x)

##                 Noun Plural Noun_Phrase Verb_participle Verb_transitive   ...
## i am the WALRUS    2      0           0               1               0   ...
## Jerk!!             1      0           0               1               1   ...
## very very good     1      0           0               0               0   ...

For strings with more than one word, the sum for each parts of speech class is computed.

Computing concretness norms

The package will fetch concreteness norms for Brysbaert et al. (2014). For strings with more than one word, the mean concreteness for each word will be returned.

concreteness(x)

##                     Concreteness
## i am the WALRUS     2.796667
## Jerk!!              3.260000
## very very good      1.500000

Computing word embeddings

Finally, the package will generate a set of semantic vectors (i.e., word embeddings). Any set of word embeddings can be used. For strings with more than one word, the function will compute the mean of the word vectors in the string (i.e., the semantic gist of the document). The outout below is trucated.


# load(url("http://www.lingexp.uni-tuebingen.de/z2/LSAspaces/TASA.rda"))

semantics(x, TASA)

##                                [,1]        [,2]          [,3]          [,4]         [,5]          [,6]   ...
## i am the WALRUS    [1,] 0.049237273 0.064432684 -0.0106181763  0.0215646167 -0.000233423  0.0045926339   ...
## Jerk!!             [2,] 0.001102812 0.001143806 -0.0008592891 -0.0005428566  0.001024854 -0.0007934065   ...
## very very good     [3,] 0.049006431 0.009857605 -0.0135557641 -0.0073958977 -0.022383744  0.0004875048   ...



cookm346/textfeats documentation built on April 24, 2020, 9:50 p.m.