Description Usage Arguments Value
View source: R/feature_selection.R
A function that implements a number of feature selection methods for finding top words which distinguish between two classes.
1 2 3 4 5 | feature_selection(contingency_table, rows_to_compare = NULL, alpha = 1,
method = c("informed Dirichlet", "TF-IDF", "TF-IDF-log(tf)",
"TF-IDF-augmented(tf)"), maximum_top_words = 5000,
document_term_matrix = NULL, subsume_ngrams = FALSE,
ngram_subsumption_correlation_threshold = 0.9, rank_by_log_odds = FALSE)
|
contingency_table |
A contingency table generated by the 'contingency_table()' function. |
rows_to_compare |
A numeric vector containing the indicies of the rows in the contingency table we wish to compare against eachother. Defaults to NULL, in which case all rows are compared against eachother. |
alpha |
The Dirichlet hyperparameter to be used if method = "informed_Dirichlet". Suggested value is the average number of terms that appear in a document. If a small value is selected, then more (globally) common terms may be selected as top words. Increasing the value will select for less globally common words. Defaults to 1 (not usually a good choice for most analyses). |
method |
Defaults to "informed_Dirichlet", which implements the model described in section 3.5.1 of Monroe et al. "Fightin Words...". Can also be "TF-IDF", in which case canonical TF-IDF ranking is used. The user may also select "TF-IDF-log(tf)", in which case the TF term is logged following Manning and Schutze (1999, p.544), or "TF-IDF-augmented(tf)", in which case the TF term is augmented also following Manning and Schutze (1999, p.544). |
maximum_top_words |
Controls the maximum number of top words returned in each category. Defaults to 5000. |
document_term_matrix |
The document term matrix used to construct the contingency_table. Necessary if the user selects method = "TF-IDF". Defaults to NULL. |
subsume_ngrams |
Optional argument allowing the user to combine highly correlated ngrams in resulting output. Only useful if terms in the document term matrix can overlap. |
ngram_subsumption_correlation_threshold |
Defualts to 0.9, can be set higher or lower depending on the correlation threshold at which the user would like to subsume n-grams. |
rank_by_log_odds |
Only applicable for the "informed_Dirichlet" method. Defaults to FALSE. If TRUE, then terms are ranked by log odds instead of z-score. |
A list object containing two dataframes (one for each comparison category) with ranked top words. All words included in each dataset obtain a z-score greater in magnitude than 1.96.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.