size.penalize: Testing Minimal Sample Size for Text Classification

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/size.penalize.R


This function tests the ability of a given input text (or texts) to be correctly classified in a supervised machine-learning setup (e.g. Delta, SVM or NSC) when its length is limited. The procedure, introduced by Eder (2017), involves several iterations in which longer and longer samples are drawn from the text in question, and then they are tested against a training set. For very short samples, the obtained classification accuracy is quite low (obviously), but then it usually increases until it finally reaches a point of saturation. The function size.penalize is aimed at indentifying such a saturation point.


size.penalize(training.frequencies = NULL, test.frequencies = NULL,
              training.corpus = NULL, test.corpus = NULL,
              mfw = c(100, 200, 500), features = NULL, 
              path = NULL, corpus.dir = "corpus",
              sample.size.coverage = seq(100, 10000, 100),
              sample.with.replacement = FALSE,
              iterations = 100, classification.method = "delta",
              list.cutoff = 1000, ...)



using this optional argument, one can load a custom table containing frequencies/counts for several variables, e.g. most frequent words, across a number of text samples (for the training set). It can be either an R object (matrix or data frame), or a filename containing tab-delimited data. If you use an R object, make sure that the rows contain samples, and the columns – variables (words). If you use an external file, the variables should go vertically (i.e. in rows): this is because files containing vertically-oriented tables are far more flexible and easily editable using, say, Excel or any text editor. To flip your table horizontally/vertically use the generic function t().


using this optional argument, one can load a custom table containing frequencies/counts for the test set. Further details: immediately above.


another option is to pass a pre-processed corpus as an argument (here: the training set). It is assumed that this object is a list, each element of which is a vector containing one tokenized sample. The example shown below will give you some hints how to prepare such a corpus. Also, refer to help(load.corpus.and.parse)


if training.corpus is used, then you should also prepare a similar R object containing the test set.


how many most frequent words (or other units) should be used as features to test the classifier? The default value is c(100,200,500), to assess three different ranges of MFWs.


usually, a number of the most frequent features (words, word n-grams, character n-grams) are extracted automatically from the corpus, and they are used as variables for further analysis. However, in some cases it makes sense to use a set of tailored features, e.g. the words that are associated with emotions or, say, a specific subset of function words. This optional argument allows to pass either a filename containing your custom list of features, or a vector (R object) of features to be assessed.


if not specified, the current directory will be used for input/output procedures (reading files, outputting the results).


the subdirectory (within the current working directory) that contains the corpus text files. If not specified, the default subdirectory corpus will be used. This option is immaterial when an external corpus and/or external tables with frequencies are loaded.


the procedure iteratively tests classification accuracy for different sample sizes. Feel free to modify the default value c(100, 10000, 100), which tests samples for 100, 200, 300, ..., 10,000 words.


if a tested sample size is bigger than the text to be tested, then the procedure stops, obviously. To prevent such a situation, you might decide to draw your samples (n words) with replacement, which means that particular words can be picked more than once (default value is FALSE).


each sample size of a given text is tested by extracting randomly n words from the text in N iterations (default being 100). Since the procedure is random, a large(ish) number of iterations, say 100, allows for testing an actual behavior of a given sample size.


the option invokes one of the classification methods provided by the package stylo. Choose one of the following: "delta", "svm", "knn", "nsc", "naivebayes".


when texts are loaded from files, tokenized, and counted, it is all followed by building a table of frequencies. Since it is unlikely to analyze thousands of most frequent words (rather than 100 or, say, 500), it saves lots of time when the table of frequencies is trimmed. The default value is 1000 most frequent words.


any other argument, usually tokenization settings (via the parameters corpus.lang, features, ngram.size etc.), or hyperparameters of different classification methods, such as a distanse measure (for Delta), a cost function (for SVM), and so forth.


If no additional argument is passed, then the function tries to load text files from the default subdirectory corpus. The resulting object will then contain accuracy and diversity scores for all the texts.


The function returns an object of the class stylo.results: a list of variables, including classification accuracy scores for each tested text and each assessed sample size, Simpson's diversity index scores, and the names of the texts analyzed. Use the generic function summary to see the contents of the object. Use the generic function plot to generate a tailored plot conveniently.


Maciej Eder


Eder, M. (2017). Short samples in authorship attribution: A new approach. "Digital Humanities 2017: Conference Abstracts". Montreal: McGill University, pp. 221–24,

See Also

plot.sample.size, classify, imposters


## Not run: 

# standard usage (it builds a corpus from a set of text files):
results = size.penalize()

## End(Not run)

stylo documentation built on Dec. 6, 2020, 5:06 p.m.

Related to size.penalize in stylo...