top_feature_matrix: Create a top features per entity matrix from a numeric...
In manuelbickel/textility: Utility functions for text mining

top_feature_matrix

R Documentation

Create a top features per entity matrix from a numeric feature per entity distribution

Description

The function orders the elements of a matrix per row and returns the actual corresponding features within the top n rank range, i.e., column items, per row. Considering a document-entity term-feature matrix, it achieves the same as, e.g., topicmodels::terms or text2vec or text2vec::get_top_words but takes a raw matrix as input instead of a native model object of a specific package. Hence, apart from getting top n term-features per topic from a "raw" word probabilities per topic object of an LDA model (words as columns, topics as rows, entries are word probabilities) it might also help to find top n word-features per document given a document term matrix or top topic-features per document given a document topic matrix.

Usage

top_feature_matrix(entity_feature_matrix, n = 10, terms = NULL,
  include_all_ties = TRUE)

Arguments

`entity_feature_matrix`	A numeric matrix object. Each row represents an entity, e.g., document, each column a feature, e.g., term.
`n`	Number of highest rank number to consider for getting features per row. By default 10.
`include_all_ties`	By default `TRUE`. Output includes all ties for each rank, hence, output number of features may be higher than `n`. If set to `FALSE`, the number of rows of the output is limited to `n`. The highest output rank number may then be different for each entity.

Value

A character matrix with the top n features per row - hence, a top-feature-entity-matrix. For better readability, the output is "transposed" so that the entities appear as columns and features as rows. If include_all_ties = TRUE, the trailing elements of a column are set to NA if another entity has more top features (due to ties) than that entity. The number of top features per entity depends on their rank.

Examples


example for word topic distribution as output from LDA model
beta <- rbind(T1 = c(0.3,0.3,0.3, 0.1), T2 = c(0.19,0.3,0.5, 0.01),  T3 = c(0.3,0.5,0.19, 0.01))
top_feature_matrix(entity_feature_matrix = beta, n = 2, terms = c("A", "B", "C", "D"), include_all_ties = FALSE)
#      T1  T2  T3
# [1,] "A" "C" "B"
# [2,] "B" "B" "A"
case if no terms are specified and all ties shall be considered
top_feature_matrix(entity_feature_matrix = beta, n = 2, include_all_ties = TRUE)
# T1  T2  T3
# [1,] "1" "3" "2"
# [2,] "2" "2" "1"
# [3,] "3" NA  NA
# [4,] "4" NA  NA
# Warning message:
#   In top_feature_matrix(entity_feature_matrix = beta, n = 2, include_all_ties = TRUE) :
#   Input entity_feature_matrix has no colnames and no colnames to be used have been specified. Column indices were used as feature names.

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.