BrownBigrams: Bigrams of adjacent words from the Brown corpus

Description Usage Format Details Author(s) References

Description

This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis \& Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).

Only bigrams that occur at least 5 times in the corpus are included.

Usage

1

Format

A data frame with 24167 rows and the following columns:

id:

unique ID of the bigram entry

word1:

the first word form in the bigram (character)

pos1:

part-of-speech category of the first word (factor)

word2:

the second word form in the bigram (character)

pos2:

part-of-speech category of the second word (factor)

O11:

co-occurrence frequency of the bigram (numeric)

O12:

occurrences of the first word without the second (numeric)

O21:

occurrences of the second word without the first (numeric)

O22:

number of bigram tokens containing neither the first nor the second word (numeric)

Details

Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.

Some important POS codes are N (noun), V (verb), J (adjective), R (adverb or particle), I (preposition), D (determiner), W (wh-word) and M (modal).

Author(s)

Stefan Evert <stefan.evert@fau.de>

References

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.

Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.


corpora documentation built on May 2, 2019, 4:56 p.m.