term_union: Combine terms in a dtm

View source: R/feature_preparation.r

term_unionR Documentation

Combine terms in a dtm

Description

Given a dtm and a similarity (adjacency) matrix, group clusters of similar terms (simmat > 0) into a single column. Column names will be concatenated, with a "|" seperator (read as OR)

Usage

term_union(dtm, simmat, as_dfm = T, verbose = F, sep = "|", par = NA)

Arguments

dtm

A quanteda dfm or a CsparseMatrix.

simmat

A similarity matrix in CsparseMatrix format. For instance, created with term_char_sim

as_dfm

If True, return as quanteda dfm

verbose

If True, report progress

sep

The separator used for pasting the terms

par

If TRUE, add parentheses to colnames before combining. This is mainly for internal use, as it allows specification if OR (term_union) and AND (term_intersect) operations are combined. If NA, this is based on whether parenthese are present.

Value

A CsparseMatrix or quanteda dfm

Examples

dfm = quanteda::dfm(c('That guy Gadaffi','Do you mean Kadaffi?',
                      'Nah more like Gadaffel','Not Kadaffel?'))
simmat = term_char_sim(colnames(dfm), same_start=0)
term_union(dfm, simmat, verbose = FALSE)

RNewsflow documentation built on May 31, 2023, 6:53 p.m.