select.lspace: Select Latent Semantic Spaces

View source: R/select.lspace.R

select.lspaceR Documentation

Select Latent Semantic Spaces

Description

Retrieve information and links to latent semantic spaces (sets of word vectors/embeddings) available at osf.io/489he, and optionally download their term mappings (osf.io/xr7jv).

Usage

select.lspace(query = NULL, dir = getOption("lingmatch.lspace.dir"),
  terms = NULL, get.map = FALSE, check.md5 = TRUE, mode = "wb")

Arguments

query

A character used to select spaces, based on names or other features. If length is over 1, get.map is set to TRUE. Use terms alone to select spaces based on term coverage.

dir

Path to a directory containing lma_term_map.rda and downloaded spaces;
will look in getOption('lingmatch.lspace.dir') and '~/Latent Semantic Spaces' by default.

terms

A character vector of terms to search for in the downloaded term map, to calculate coverage of spaces, or select by coverage if query is not specified.

get.map

Logical; if TRUE and lma_term_map.rda is not found in dir, the term map (lma_term_map.rda) is downloaded and decompressed.

check.md5

Logical; if TRUE (default), retrieves the MD5 checksum from OSF, and compares it with that calculated from the downloaded file to check its integrity.

mode

Passed to download.file when downloading the term map.

Value

A list with varying entries:

  • info: The version of osf.io/9yzca stored internally; a data.frame with spaces as row names, and information about each space in columns:

    • terms: number of terms in the space

    • corpus: corpus(es) on which the space was trained

    • model: model from which the space was trained

    • dimensions: number of dimensions in the model (columns of the space)

    • model_info: some parameter details about the model

    • original_max: maximum value used to normalize the space; the original space would be (vectors * original_max) / 100

    • osf_dat: OSF id for the .dat files; the URL would be https://osf.io/osf_dat

    • osf_terms: OSF id for the _terms.txt files; the URL would be https://osf.io/osf_terms

    • wiki: link to the wiki for the space

    • downloaded: path to the .dat file if downloaded, and '' otherwise.

  • selected: A subset of info selected by query.

  • term_map: If get.map is TRUE or lma_term_map.rda is found in dir, a copy of osf.io/xr7jv, which has space names as column names, terms as row names, and indices as values, with 0 indicating the term is not present in the associated space.

See Also

Other Latent Semantic Space functions: download.lspace(), lma_lspace(), standardize.lspace()

Examples

# just retrieve information about available spaces
spaces <- select.lspace()
spaces$info[1:10, c("terms", "dimensions", "original_max")]

# retrieve all spaces that used word2vec
w2v_spaces <- select.lspace("word2vec")$selected
w2v_spaces[, c("terms", "dimensions", "original_max")]

## Not run: 

# select spaces by terms
select.lspace(terms = c(
  "part-time", "i/o", "'cause", "brexit", "debuffs"
))$selected[, c("terms", "coverage")]

## End(Not run)

miserman/lingmatch documentation built on Jan. 19, 2024, 4:44 p.m.