corpora_package: corpora: Statistical Inference from Corpus Frequency Data
In corpora: Statistics and Data Sets for Corpus Frequency Data

corpora-package

R Documentation

corpora: Statistical Inference from Corpus Frequency Data

Description

The corpora package provides a collection of functions for statistical inference from corpus frequency data, as well as some convenience functions and example data sets.

It is a companion package to the open-source course Statistical Inference: a Gentle Introduction for Linguists and similar creatures originally developed by Marco Baroni and Stephanie Evert. Statistical methods implemented in the package are described and illustrated in the units of this course.

Starting with version 0.6 the package also includes best-practice implementations of various corpus-linguistic analysis techniques.

Details

An overview of some important functions and data sets included in the corpora package. See the package index for a complete listing.

Analysis functions

keyness() provides reference implementations for best-practice keyness measures, including the recommended LRC measure (Evert 2022)
am.score() computes various standard association measures for collocation analysis (Evert 2004, 2008) as well as user-defined formulae
binom.pval() is a vectorised function that computes p-values of the binomial test more efficiently than binom.test (using central p-values in the two-sided case)
fisher.pval() is a vectorised function that efficiently computes p-values of Fisher's exact test on 2\times 2 contingency tables for large samples (using central p-values in the two-sided case)
prop.cint() is a vectorised function that computes multiple binomial confidence intervals much more efficiently than binom.test
z.score() and z.score.pval() can be used to carry out a z-test for a single proportion (as an approximation to binom.test)
chisq() and chisq.pval() are vectorised functions that compute the test statistic and p-value of a chi-squared test for 2\times 2 contingency tables more efficiently than chisq.test

Utility functions

cont.table() creates 2\times 2 contingency tables for frequency comparison test that can be passed to chisq.test and fisher.test
sample.df() extracts random samples of rows from a data frame
qw() splits a string on whitespace or a user-specified regular expression (similar to Perl's qw// construct)
corpora.palette() provides some nice colour palettes (better than R's default colours)
rowVector() and colVector() convert a vector into a single-row or single-column matrix

Data sets

Several data sets based on the British National Corpus, including complete metadata for all 4048 text files (BNCmeta), per-text frequency counts for a number of linguistic corpus queries (BNCqueries), and relative frequencies of 65 lexico-grammatical features for each text (BNCbiber)
Frequency counts of passive constructions in all texts of the Brown and LOB corpora (BrownLOBPassives) for frequency comparison with regression models, complemented by distributional features (DistFeatBrownFam) as additional predictors
A small text corpus of Very Short Stories in the form of a data frame VSS, with one row for each token in the corpus.
Small example tables to illustrate frequency comparison of lexical items (BNCcomparison) and collocation analysis (BNCInChargeOf)
KrennPPV is a data set of German verb-preposition-noun collocation candidates with manual annotation of true positives and pre-computed association scores
Three functions for generating large synthetic data sets used in the SIGIL course: simulated.census(), simulated.language.course() and simulated.wikipedia()