DSM_Vectors | R Documentation |
A matrix of 50-dimensional pre-compiled DSM vectors for selected English content words, covering most of the words needed for several basic evaluation tasks included in the package.
Targets are given as disambiguated lemmas in the form <headword>_<pos>
, e.g. walk_V
and walk_N
.
DSM_Vectors
A numeric matrix with 1667 rows and 50 columns.
Row labels are disambiguated lemmas of the form <headword>_<pos>
, where the part-of-speech code is one of
N
(noun), V
(verb), J
(adjective) or R
(adverb).
Attribute "sigma"
contains singular values that can be used for post-hoc power scaling of the latent dimensions (see dsm.projection
).
The vocabulary of this DSM covers several basic evaluation tasks, including RG65
, WordSim353
and ESSLLI08_Nouns
, as well as the target nouns bank and vessel from SemCorWSD
. In addition, 40 nearest neighbours each of the words white_J
, apple_N
, kindness_N
and walk_V
are included.
Co-occurrence frequency data were extracted from a collection of Web corpora with a total size of ca. 9 billion words, using a L4/R4 surface window and 30,000 lexical words as feature terms. They were scored with sparse simple log-likelihood with an additional log transformation, normalized to Euclidean unit length, and projected into 1000 latent dimensions using randomized SVD (see rsvd
. For size reasons, the vectors have been compressed into 50 latent dimensions and renormalized.
nearest.neighbours(DSM_Vectors, "walk_V", 25) eval.similarity.correlation(RG65, DSM_Vectors) # fairly good # post-hoc power scaling: whitening (correspond to power=0 in dsm.projection) sigma <- attr(DSM_Vectors, "sigma") M <- scaleMargins(DSM_Vectors, cols=1 / sigma) eval.similarity.correlation(RG65, M) # very good
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.