subset_dsm: Subsetting Distributional Semantic Models (wordspace)
In wordspace: Distributional Semantic Models in R

subset.dsm

R Documentation

Subsetting Distributional Semantic Models (wordspace)

Description

Filter the rows and/or columns of a DSM object according to user-specified conditions.

Usage


## S3 method for class 'dsm'
subset(x, subset, select, recursive = FALSE, drop.zeroes = FALSE,
       matrix.only = FALSE, envir = parent.frame(), run.gc = FALSE, ...)

Arguments

`x`	an object of class `dsm`
`subset`	Boolean expression or index vector selecting a subset of the rows; the expression can use variables `term` and `f` to access target terms and their marginal frequencies, `nnzero` for the number of nonzero elements in each row, further optional variables from the row information table, as well as global variables such as the sample size `N`
`select`	Boolean expression or index vector selecting a subset of the columns; the expression can use variables `term` and `f` to access feature terms and their marginal frequencies, `nnzero` for the number of nonzero elements in each column, further optional variables from the column information table, as well as global variables such as the sample size `N`
`recursive`	if `TRUE` and both `subset` and `select` conditions are specified, the `subset` is applied repeatedly until the DSM no longer changes. This is typically needed if conditions on nonzero counts or row/column norms are specified, which may be affected by the subsetting procedure.
`drop.zeroes`	if `TRUE`, all rows and columns without any nonzero entries after subsetting are removed from the model (nonzero counts are based on the score matrix S if available, raw cooccurrence frequencies M otherwise)
`matrix.only`	if `TRUE`, return only the selected subset of the score matrix S (if available) or frequency matrix M, not a full DSM object. This may conserve a substantial amount of memory when processing very large DSMs.
`envir`	environment in which the `subset` and `select` conditions are evaluated. Defaults to the context of the function call, so all variables visible there can be used in the expressions.
`run.gc`	whether to run the garbage collector after each iteration of a recursive subset (`recursive=TRUE`) in order to keep memory overhead as low as possible. This option should only be specified if memory is very tight, since garbage collector runs can be expensive (e.g. when there are many distinct strings in the workspace).
`...`	any further arguments are silently ignored

Value

An object of class dsm containing the specified subset of the model x.

If necessary, counts of nonzero elements for each row and/or column are updated automatically.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


print(DSM_TermContext$M)
model <- DSM_TermContext

subset(model, nchar(term) <= 4)$M     # short target terms
subset(model, select=(nnzero <= 3))$M # columns with <= 3 nonzero cells

subset(model, nchar(term) <= 4, nnzero <= 3)$M # combine both conditions

subset(model, nchar(term) <= 4, nnzero >= 2)$M # still three columns with nnzero < 2
subset(model, nchar(term) <= 4, nnzero >= 2, recursive=TRUE)$M

wordspace documentation built on Aug. 23, 2022, 1:06 a.m.