This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis \& Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).
Only bigrams that occur at least 5 times in the corpus are included.
A data frame with 24167 rows and the following columns:
unique ID of the bigram entry
the first word form in the bigram (character)
part-of-speech category of the first word (factor)
the second word form in the bigram (character)
part-of-speech category of the second word (factor)
co-occurrence frequency of the bigram (numeric)
occurrences of the first word without the second (numeric)
occurrences of the second word without the first (numeric)
number of bigram tokens containing neither the first nor the second word (numeric)
Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.
Some important POS codes are
R (adverb or particle),
W (wh-word) and
Stefan Evert <email@example.com>
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.