feature_selection: A function that implements a number of feature selection...
In matthewjdenny/SpeedReader: High Performance Text Analysis

A function that implements a number of feature selection methods for finding top words which distinguish between two classes.

feature_selection(contingency_table, rows_to_compare = NULL, alpha = 1,
  method = c("informed Dirichlet", "TF-IDF", "TF-IDF-log(tf)",
  "TF-IDF-augmented(tf)"), maximum_top_words = 5000,
  document_term_matrix = NULL, subsume_ngrams = FALSE,
  ngram_subsumption_correlation_threshold = 0.9, rank_by_log_odds = FALSE)

`contingency_table`	A contingency table generated by the 'contingency_table()' function.
`rows_to_compare`	A numeric vector containing the indicies of the rows in the contingency table we wish to compare against eachother. Defaults to NULL, in which case all rows are compared against eachother.
`alpha`	The Dirichlet hyperparameter to be used if method = "informed_Dirichlet". Suggested value is the average number of terms that appear in a document. If a small value is selected, then more (globally) common terms may be selected as top words. Increasing the value will select for less globally common words. Defaults to 1 (not usually a good choice for most analyses).
`method`	Defaults to "informed_Dirichlet", which implements the model described in section 3.5.1 of Monroe et al. "Fightin Words...". Can also be "TF-IDF", in which case canonical TF-IDF ranking is used. The user may also select "TF-IDF-log(tf)", in which case the TF term is logged following Manning and Schutze (1999, p.544), or "TF-IDF-augmented(tf)", in which case the TF term is augmented also following Manning and Schutze (1999, p.544).
`maximum_top_words`	Controls the maximum number of top words returned in each category. Defaults to 5000.
`document_term_matrix`	The document term matrix used to construct the contingency_table. Necessary if the user selects method = "TF-IDF". Defaults to NULL.
`subsume_ngrams`	Optional argument allowing the user to combine highly correlated ngrams in resulting output. Only useful if terms in the document term matrix can overlap.
`ngram_subsumption_correlation_threshold`	Defualts to 0.9, can be set higher or lower depending on the correlation threshold at which the user would like to subsume n-grams.
`rank_by_log_odds`	Only applicable for the "informed_Dirichlet" method. Defaults to FALSE. If TRUE, then terms are ranked by log odds instead of z-score.

A list object containing two dataframes (one for each comparison category) with ranked top words. All words included in each dataset obtain a z-score greater in magnitude than 1.96.

matthewjdenny/SpeedReader documentation built on March 25, 2020, 5:32 p.m.