lma_lspace: Latent Semantic Space (Embeddings) Operations
In lingmatch: Linguistic Matching and Accommodation

lma_lspace

R Documentation

Latent Semantic Space (Embeddings) Operations

Description

Map a document-term matrix onto a latent semantic space, extract terms from a latent semantic space (if dtm is a character vector, or map.space = FALSE), or perform a singular value decomposition of a document-term matrix (if dtm is a matrix and space is missing).

Usage

lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE,
  term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE,
  use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))

Arguments

`dtm`	A matrix with terms as column names, or a character vector of terms to be extracted from a specified space. If this is of length 1 and `space` is missing, it will be treated as `space`.
`space`	A matrix with terms as rownames. If missing, this will be the right singular vectors of a singular value decomposition of `dtm`. If a character, a file matching the character will be searched for in `dir` (e.g., `space = 'google'`). If a file is not found and the character matches one of the available spaces, you will be given the option to download it, as handled by `download.lspace`. If `dtm` is missing, the entire space will be loaded and returned.
`map.space`	Logical: if `FALSE`, the original vectors of `space` for terms found in `dtm` are returned. Otherwise `dtm` `%*%` `space` is returned, excluding uncommon columns of `dtm` and rows of `space`.
`fill.missing`	Logical: if `TRUE` and terms are being extracted from a space, includes terms not found in the space as rows of 0s, such that the returned matrix will have a row for every requested term.
`term.map`	A matrix with `space` as a column name, terms as row names, and indices of the terms in the given space as values, or a numeric vector of indices with terms as names, or a character vector of terms corresponding to rows of the space. This is used instead of reading in an "_terms.txt" file corresponding to a `space` entered as a character (the name of a space file).
`dim.cutoff`	If a `space` is calculated, this will be used to decide on the number of dimensions to be retained: `cumsum(d) / sum(d) < dim.cutoff`, where `d` is a vector of singular values of `dtm` (i.e., `svd(dtm)$d`). The default is `.5`; lower cutoffs result in fewer dimensions.
`keep.dim`	Logical: if `TRUE`, and a space is being calculated from the input, a matrix in the same dimensions as `dtm` is returned. Otherwise, a matrix with terms as rows and dimensions as columns is returned.
`use.scan`	Logical: if `TRUE`, reads in the rows of `space` with `scan`.
`dir`	Path to a folder containing spaces. Set a session default with `options(lingmatch.lspace.dir = 'desired/path')`.

Value

A matrix or sparse matrix with either (a) a row per term and column per latent dimension (a latent space, either calculated from the input, or retrieved when map.space = FALSE), (b) a row per document and column per latent dimension (when a dtm is mapped to a space), or (c) a row per document and column per term (when a space is calculated and keep.dim = TRUE).

Note

A traditional latent semantic space is a selection of right singular vectors from the singular value decomposition of a dtm (svd(dtm)$v[, 1:k], where k is the selected number of dimensions, decided here by dim.cutoff).

Mapping a new dtm into a latent semantic space consists of multiplying common terms: dtm[, ct] %*% space[ct, ], where ct = colnames(dtm)[colnames(dtm) %in% rownames(space)] – the terms common between the dtm and the space. This results in a matrix with documents as rows, and dimensions as columns, replacing terms.

Examples

text <- c(
  paste(
    "Hey, I like kittens. I think all kinds of cats really are just the",
    "best pet ever."
  ),
  paste(
    "Oh year? Well I really like cars. All the wheels and the turbos...",
    "I think that's the best ever."
  ),
  paste(
    "You know what? Poo on you. Cats, dogs, rabbits -- you know, living",
    "creatures... to think you'd care about anything else!"
  ),
  paste(
    "You can stick to your opinion. You can be wrong if you want. You know",
    "what life's about? Supercharging, diesel guzzling, exhaust spewing,",
    "piston moving ignitions."
  )
)

dtm <- lma_dtm(text)

# calculate a latent semantic space from the example text
lss <- lma_lspace(dtm)

# show that document similarities between the truncated and full space are the same
spaces <- list(
  full = lma_lspace(dtm, keep.dim = TRUE),
  truncated = lma_lspace(dtm, lss)
)
sapply(spaces, lma_simets, metric = "cosine")

## Not run: 

# specify a directory containing spaces,
# or where you would like to download spaces
space_dir <- "~/Latent Semantic Spaces"

# map to a pretrained space
ddm <- lma_lspace(dtm, "100k", dir = space_dir)

# load the matching subset of the space
# without mapping
lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir)

## or
lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir)

# load the full space
lss_100k <- lma_lspace("100k", dir = space_dir)

## or
lss_100k <- lma_lspace(space = "100k", dir = space_dir)

## End(Not run)

lingmatch documentation built on May 29, 2024, 11:48 a.m.