This data frame provides unsupervised distributional features for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395–397).
Latent topic dimensions were obtained by a method similar to Latent Semantic Indexing (Deerwester et al. 1990), applying singular value decomposition to bag-of-words vectors for the 2500 texts in the extended Brown Family. Register dimensions were obtained with the same methodology, using vectors of part-of-speech frequencies (separately for all verb-related tags and all other tags).
A data frame with 2500 rows and the following 23 columns:
A unique ID for each text (also used as row name)
latent dimension scores for the first 9 topic dimensions
latent dimension scores for the first 9 register dimensions (excluding verb-related tags)
latent dimension scores for the first 4 register dimensions based only on verb-related tags
Stefan Evert (http://purl.org/stefan.evert)
Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391–407.
Xiao, Richard (2008). Well-known and influential corpora. In A. L<c3><bc>deling and M. Kyt<c3><b6> (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383–457. Mouton de Gruyter, Berlin.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.