stylest_select_vocab: Select vocabulary using cross-validated out-of-sample...
In stylest: Estimating Speaker Style Distinctiveness

Description Usage Arguments Value Examples

Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.

stylest_select_vocab(
  x,
  speaker,
  filter = NULL,
  smooth = 0.5,
  nfold = 5,
  cutoff_pcts = c(50, 60, 70, 80, 90, 99),
  cutoffs_term_weights = NULL,
  fill_method = "value",
  fill_weight = 1,
  weight_varname = "mean_distance"
)

`x`	Corpus as text vector. May be a `corpus_frame` object
`speaker`	Vector of speaker labels. Should be the same length as `x`
`filter`	if not `NULL`, a `corpus` text_filter
`smooth`	value for smoothing. Defaults to 0.5
`nfold`	Number of folds for cross-validation. Defaults to 5
`cutoff_pcts`	Vector of cutoff percentages to test. Defaults to `c(50, 60, 70, 80, 90, 99)`
`cutoffs_term_weights`	Named list of dataframes of term weights, where the names correspond to the `cutoff_pcts`. Each dataframe should have one column $word and a second column $weight_varname containing the weight for the word. See the vignette for details.
`fill_method`	if `"value"` (default), `fill_weight` is used to fill any terms with `NA` weight. If `"mean"`, the mean term_weight should be used as the fill value
`fill_weight`	numeric value to fill in as weight for any term which does not have a weight specified in `term_weights`, default=`1.0`
`weight_varname`	Name of the column in each term_weights dataframe containing the weights, default=`"mean_distance"`

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation

## Not run: 
data(novels_excerpts)
stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, cutoff_pcts = c(50, 90))

## End(Not run)

$cutoff_pct_best
[1] 50

$cutoff_pcts
[1] 50 90

$miss_pct
         [,1]     [,2]
[1,] 60.00000 40.00000
[2,] 40.00000 80.00000
[3,] 75.00000 50.00000
[4,]  0.00000 50.00000
[5,] 66.66667 33.33333

$nfold
[1] 5

attr(,"class")
[1] "stylest_select_vocab"