eval_clustering: Evaluate DSM on Clustering Task (wordspace)
In wordspace: Distributional Semantic Models in R

eval.clustering

R Documentation

Evaluate DSM on Clustering Task (wordspace)

Description

Performs evaluation on a word clustering task by comparing a flat clustering solution based on semantic distances with a gold classification.

Usage


eval.clustering(task, M, dist.fnc = pair.distances, ...,
                details = FALSE, format = NA, taskname = NA,
                scale.entropy = FALSE, n.clusters = NA,
                word.name = "word", class.name = "class")

Arguments

`task`	a data frame listing words and their classes, usually in columns named `word` and `class`
`M`	a scored DSM matrix, passed to `dist.fnc`
`dist.fnc`	a callback function used to compute distances between word pairs. It will be invoked with character vectors containing the components of the word pairs as first and second argument, the DSM matrix `M` as third argument, plus any additional arguments (`...`) passed to `eval.multiple.choice`. The return value must be a numeric vector of appropriate length. If one of the words in a pair is not represented in the DSM, the corresponding distance value should be set to `Inf`.
`...`	any further arguments are passed to `dist.fnc` and can be used e.g. to select a distance measure
`details`	if `TRUE`, a detailed report with information on each task item is returned (see “Value” below for details)
`format`	if the task definition specifies POS-disambiguated lemmas in CWB/Penn format, they can automatically be transformed into some other notation conventions; see `convert.lemma` for details
`taskname`	optional row label for the short report (`details=FALSE`)
`scale.entropy`	whether to scale cluster entropy values to the range [0, 1]
`n.clusters`	number of clusters. The (very sensible) default is to generate as many clusters as their are classes in the gold standard.
`word.name`	the name of the column of `task` containing words
`class.name`	the name of the column of `task` containing gold standard classes

Details

The test words are clustered using the “partitioning around medoids” (PAM) algorithm (Kaufman & Rousseeuw 1990, Ch. 2) based on their semantic distances. The PAM algorithm is used because it works with arbitrary distance measures (including neihbour rank), produces a stable solution (unlike most iterative algorithms) and has shown to be on par with state-of-the-art spherical k-means clustering (CLUTO) in evaluation studies.

Each cluster is automatically assigned a majority label, i.e. the gold standard class occurring most frequently in the cluster. This represents the best possible classification that can be derived from the clustering.

As evaluation metrics, clustering purity (accuracy of the majority classification) and entropy are computed. The latter is defined as a weighted average over the entropy of the class distribution within each cluster, expressed in bits. If scale.entropy=TRUE, the value is divided by the overall entropy of the class distribution in the gold standard, scaling it to the range [0, 1].

NB: The semantic distance measure selected with the extra arguments (...) should be symmetric. In particular, it is not very sensible to specify rank="fwd" or rank="bwd".

NB: Similarity measures are not supported by the current clustering algorithm. Make sure not to call dist.matrix (from dist.fnc) with convert=FALSE!

Value

The default short report (details=FALSE) is a data frame with a single row and the columns purity (clustering purity as a percentage), entropy (scaled or unscaled clustering entropy) and missing (number of words not found in the DSM).

The detailed report (details=TRUE) is a data frame with one row for each test word and the following columns:

`word`	the test word (character)
`cluster`	cluster to which the word has been assigned; all unknown words are collected in an additional cluster `"n/a"`
`label`	majority label of this cluster (factor with same levels as `gold`)
`gold`	gold standard class of the test word (factor)
`correct`	whether majority class assignment is correct (logical)
`missing`	whether word was not found in the DSM (logical)

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


eval.clustering(ESSLLI08_Nouns, DSM_Vectors, class.name="class2")

wordspace documentation built on Aug. 23, 2022, 1:06 a.m.

wordspace index

Package overview Distributional Semantics in R with the 'wordspace' Package

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wordspace
Distributional Semantic Models in R

eval_clustering: Evaluate DSM on Clustering Task (wordspace)
In wordspace: Distributional Semantic Models in R

Evaluate DSM on Clustering Task (wordspace)

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to eval_clustering in wordspace...

R Package Documentation

Browse R Packages

We want your feedback!

wordspace Distributional Semantic Models in R

eval_clustering: Evaluate DSM on Clustering Task (wordspace) In wordspace: Distributional Semantic Models in R

Evaluate DSM on Clustering Task (wordspace)

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to eval_clustering in wordspace...

R Package Documentation

Browse R Packages

We want your feedback!

wordspace
Distributional Semantic Models in R

eval_clustering: Evaluate DSM on Clustering Task (wordspace)
In wordspace: Distributional Semantic Models in R