Description Usage Arguments Details Value Author(s) See Also
Use read.table to get COCA word frequency table.
1 2 |
file |
Sent to |
sep |
The CoCA lexical frequency file is tab delimited. Value sent to |
na.strings |
Sent to |
quote |
Some fields in CoCA file contain "'". So remove that character from the
|
header |
The CoCA file includes a header. Value sent to |
fill |
Over-ride default value because the end of the header row in the CoCA frequency file has a stray tab, at least in my copy. |
skip |
Skip 2 comment rows at the top of the file. |
simpleWC |
If TRUE (the default) then add vector of simplified wordclasses to data.frame. See
|
... |
additional arguments will be passed to |
Mostly a convenience wrapper around read.table
with reasonable defaults for reading the
Corpus of Contemporary American English word frequency file (corpus.byu.edu). The file
contains tab delimited text, with some idiosynchracies.
Contents of data.frame as documented in CoCA itself.
The following information is adapted from the spreadsheet version of the lexical frequency table that is distributed with CoCA itself.
This spreadsheet contains the 100,000 word list (http://www.wordfrequency.info/100k.asp) that is based on the Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca/) and other corpora (http://corpus.byu.edu).
This copy of the data cannot be shared with others. Note also that a small change has been made to the data in this spreadsheet to indentify you as the source of the spreadsheet.
The file includes a great deal of data from several different corpora. Column contents are listed below, by column name.
Column
WC
Simplified word class, if requested. See simpleWC
argument to this function.
ID
Numerical word ID (rank order), 1-100,000
w1
Word form
L1
Lemma/stem (e.g. go for the words gone or went, or book for the word books, or quick for the word quicker)
c1
Part of speech. This is the first letter from the codes at http://ucrel.lancs.ac.uk/claws7tags.html
pc
Percent of tokens that are capitalized. This lets you see whether the word occurs mainly in proper noun-like contexts, like Ravens (the Baltimore Ravens team v the actual animal), March (the month vs. a walk), Brown (the surname vs the adjective), Beach (in place names like Daytona Beach), AIDS (the disease vs e.g. visual aids), or Rice (the university or surname vs the food). Note that some words have a high degree of capitalization simply because they occur primarily at the beginning of sentences, e.g. Hello or Unfortunately.
spelling
Whether the word is an American or British spelling
coca
Raw frequency (# tokens) in the 450 million word Corpus of Contemporary American English (http://corpus.byu.edu/coca)
pcoca
Frequency (per million words) in the 450 million word Corpus of Contemporary American English (http://corpus.byu.edu/coca)
pbnc
Frequency (per million words) in the 100 million word British National Corpus (http://corpus.byu.edu/bnc)
psoap
Frequency (per million words) in the 100 million word Corpus of American Soap Operas (http://corpus2.byu.edu/soap)
ph3-ph1
Frequency (per million words) in the Corpus of Historical American English (http://corpus.byu.edu/coha): 1950-1989, 1900-1949, and 1810-1899
pc1-pc5
Frequency (per million words) in COCA genres: spoken, fiction, popular magazines, newspapers, and academic journals
pb1-pb7
Frequency (per million words) in BNC genres: spoken, fiction, popular magazines, newspapers, non-academic journals, academic journals, and miscellaneous
tpcoca
Percentage of COCA texts (0.00-1.00) that contain the word at least once.
tpbnc
Percentage of BNC texts (0.00-1.00) that contain the word at least once.
tpsoap
Percentage of SOAP texts (0.00-1.00) that contain the word at least once.
tph3-tpb7
Percentage of texts (0.00-1.00) that contain the word at least once: 1) COHA time periods 3) COCA genres 4) BNC genres
bnc-fb7
Raw token frequency in BNC, SOAP, COHA, COCA genres and BNC genres: the basis for Columns pcoca through pb7
tcoca-tb7
Raw number of texts in COCA, BNC, SOAP, COHA, COCA genres and BNC genres: the basis for Columns tpcoca through tpb7
a data.frame
Dave Braze davebraze@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.