This data set provides complete metadata for all 4048 texts of the British National Corpus (XML edition). See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.
The data have automatically been extracted from the original BNC source files. Some transformations were applied so that all attribute names and their values are given in a human-readable form. The Perl scripts used in the extraction procedure are available from http://cwb.sourceforge.net/download.php#import.
A data frame with 4048 rows and the columns listed below. Unless specified otherwise, columns are coded as factors.
BNC document ID; character vector
Title of the document; character vector
Number of words in the document; integer vector
Total number of tokens (including punctuation and deleted material); integer vector
Number of w-units (words); integer vector
Number of c-units (punctuation); integer vector
Number of s-units (sentences); integer vector
Age-group of respondent
Social class of respondent (NRS social grades)
Sex of respondent
Domicile of author
Sex of author
Estimated circulation size
Text mode (written/spoken)
David Lee's genre classification
Stefan Evert <firstname.lastname@example.org>
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.