tCorpus-cash-feature_subset: Filter features
In corpustools: Managing, Querying and Analyzing Tokenized Text

tCorpus$feature_subset

R Documentation

Filter features

Description

Similar to using tCorpus$subset, but instead of deleting rows it only sets rows for a specified feature to NA. This can be very convenient, because it enables only a selection of features to be used in an analysis (e.g. a topic model) but maintaining the context of the full article, so that results can be viewed in this context (e.g. a topic browser).

Just as in subset, it is easy to use objects and functions in the filter, including the special functions for using term frequency statistics (see documentation for tCorpus$subset).

Usage:

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

feature_subset(column, new_column, subset)

Arguments

`column`	the column containing the feature to be used as the input
`subset`	logical expression indicating rows to keep in the tokens data. i.e. rows for which the logical expression is FALSE will be set to NA.
`new_column`	the column to save the filtered feature. Can be a new column or overwrite an existing one.
`min_freq`	an integer, specifying minimum token frequency.
`min_docfreq`	an integer, specifying minimum document frequency.
`max_freq`	an integer, specifying minimum token frequency.
`max_docfreq`	an integer, specifying minimum document frequency.
`min_char`	an integer, specifying minimum characters in a token
`max_char`	an integer, specifying maximum characters in a token

Examples

tc = create_tcorpus('a a a a b b b c c')

tc$feature_subset('token', 'tokens_subset1', subset = token_id < 5)
tc$feature_subset('token', 'tokens_subset2', subset = freq_filter(token, min = 3))

tc$tokens

corpustools documentation built on May 31, 2023, 8:45 p.m.