Description Usage Arguments Details Value Author(s) See Also
Use read.table to get COCA word frequency table.
1 2 |
file |
Sent to |
sep |
The CoCA lexical frequency file is tab delimited. Value sent to |
na.strings |
Sent to |
quote |
Some fields in CoCA file contain "'". So remove that character from the
|
header |
The CoCA file includes a header. Value sent to |
fill |
Over-ride default value because the end of the header row in the CoCA frequency file has a stray tab, at least in my copy. |
skip |
Skip 2 comment rows at the top of the file. |
simpleWC |
If TRUE (the default) then add vector of simplified wordclasses to data.frame. See
|
... |
additional arguments will be passed to |
Mostly a convenience wrapper around read.table with reasonable defaults for reading the
Corpus of Contemporary American English word frequency file (corpus.byu.edu). The file
contains tab delimited text, with some idiosynchracies.
Contents of data.frame as documented in CoCA itself.
The following information is adapted from the spreadsheet version of the lexical frequency table that is distributed with CoCA itself.
This spreadsheet contains the 100,000 word list (http://www.wordfrequency.info/100k.asp) that is based on the Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca/) and other corpora (http://corpus.byu.edu).
This copy of the data cannot be shared with others. Note also that a small change has been made to the data in this spreadsheet to indentify you as the source of the spreadsheet.
The file includes a great deal of data from several different corpora. Column contents are listed below, by column name.
Column
WC Simplified word class, if requested. See simpleWC argument to this function.
ID Numerical word ID (rank order), 1-100,000
w1 Word form
L1 Lemma/stem (e.g. go for the words gone or went, or book for the word books, or quick for the word quicker)
c1 Part of speech. This is the first letter from the codes at http://ucrel.lancs.ac.uk/claws7tags.html
pc Percent of tokens that are capitalized. This lets you see whether the word occurs mainly in proper noun-like contexts, like Ravens (the Baltimore Ravens team v the actual animal), March (the month vs. a walk), Brown (the surname vs the adjective), Beach (in place names like Daytona Beach), AIDS (the disease vs e.g. visual aids), or Rice (the university or surname vs the food). Note that some words have a high degree of capitalization simply because they occur primarily at the beginning of sentences, e.g. Hello or Unfortunately.
spelling Whether the word is an American or British spelling
coca Raw frequency (# tokens) in the 450 million word Corpus of Contemporary American English (http://corpus.byu.edu/coca)
pcoca Frequency (per million words) in the 450 million word Corpus of Contemporary American English (http://corpus.byu.edu/coca)
pbnc Frequency (per million words) in the 100 million word British National Corpus (http://corpus.byu.edu/bnc)
psoap Frequency (per million words) in the 100 million word Corpus of American Soap Operas (http://corpus2.byu.edu/soap)
ph3-ph1 Frequency (per million words) in the Corpus of Historical American English (http://corpus.byu.edu/coha): 1950-1989, 1900-1949, and 1810-1899
pc1-pc5 Frequency (per million words) in COCA genres: spoken, fiction, popular magazines, newspapers, and academic journals
pb1-pb7 Frequency (per million words) in BNC genres: spoken, fiction, popular magazines, newspapers, non-academic journals, academic journals, and miscellaneous
tpcoca Percentage of COCA texts (0.00-1.00) that contain the word at least once.
tpbnc Percentage of BNC texts (0.00-1.00) that contain the word at least once.
tpsoap Percentage of SOAP texts (0.00-1.00) that contain the word at least once.
tph3-tpb7 Percentage of texts (0.00-1.00) that contain the word at least once: 1) COHA time periods 3) COCA genres 4) BNC genres
bnc-fb7 Raw token frequency in BNC, SOAP, COHA, COCA genres and BNC genres: the basis for Columns pcoca through pb7
tcoca-tb7 Raw number of texts in COCA, BNC, SOAP, COHA, COCA genres and BNC genres: the basis for Columns tpcoca through tpb7
a data.frame
Dave Braze davebraze@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.