stylest_select_vocab: Select vocabulary using cross-validated out-of-sample...

Description Usage Arguments Value Examples

View source: R/stylest_select_vocab.R

Description

Selects optimal vocabulary quantile(s) for model fitting using performance on predicting out-of-sampletexts.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
stylest_select_vocab(
  x,
  speaker,
  filter = NULL,
  smooth = 0.5,
  nfold = 5,
  cutoff_pcts = c(50, 60, 70, 80, 90, 99),
  cutoffs_term_weights = NULL,
  fill_method = "value",
  fill_weight = 1,
  weight_varname = "mean_distance"
)

Arguments

x

Corpus as text vector. May be a corpus_frame object

speaker

Vector of speaker labels. Should be the same length as x

filter

if not NULL, a corpus text_filter

smooth

value for smoothing. Defaults to 0.5

nfold

Number of folds for cross-validation. Defaults to 5

cutoff_pcts

Vector of cutoff percentages to test. Defaults to c(50, 60, 70, 80, 90, 99)

cutoffs_term_weights

Named list of dataframes of term weights, where the names correspond to the cutoff_pcts. Each dataframe should have one column $word and a second column $weight_varname containing the weight for the word. See the vignette for details.

fill_method

if "value" (default), fill_weight is used to fill any terms with NA weight. If "mean", the mean term_weight should be used as the fill value

fill_weight

numeric value to fill in as weight for any term which does not have a weight specified in term_weights, default=1.0

weight_varname

Name of the column in each term_weights dataframe containing the weights, default="mean_distance"

Value

List of: best cutoff percent with the best speaker classification rate; cutoff percentages that were tested; matrix of the mean percentage of incorrectly identified speakers for each cutoff percent and fold; and the number of folds for cross-validation

Examples

1
2
3
4
5
6
## Not run: 
data(novels_excerpts)
stylest_select_vocab(novels_excerpts$text, novels_excerpts$author, cutoff_pcts = c(50, 90))

## End(Not run)
  

Example output

$cutoff_pct_best
[1] 50

$cutoff_pcts
[1] 50 90

$miss_pct
         [,1]     [,2]
[1,] 60.00000 40.00000
[2,] 40.00000 80.00000
[3,] 75.00000 50.00000
[4,]  0.00000 50.00000
[5,] 66.66667 33.33333

$nfold
[1] 5

attr(,"class")
[1] "stylest_select_vocab"

stylest documentation built on March 5, 2021, 1:05 a.m.