eval.clustering | R Documentation |
Performs evaluation on a word clustering task by comparing a flat clustering solution based on semantic distances with a gold classification.
eval.clustering(task, M, dist.fnc = pair.distances, ..., details = FALSE, format = NA, taskname = NA, scale.entropy = FALSE, n.clusters = NA, word.name = "word", class.name = "class")
task |
a data frame listing words and their classes, usually in columns named |
M |
a scored DSM matrix, passed to |
dist.fnc |
a callback function used to compute distances between word pairs.
It will be invoked with character vectors containing the components of the word pairs as first and second argument,
the DSM matrix |
... |
any further arguments are passed to |
details |
if |
format |
if the task definition specifies POS-disambiguated lemmas in CWB/Penn format, they can automatically be transformed into some other notation conventions; see |
taskname |
optional row label for the short report ( |
scale.entropy |
whether to scale cluster entropy values to the range [0, 1] |
n.clusters |
number of clusters. The (very sensible) default is to generate as many clusters as their are classes in the gold standard. |
word.name |
the name of the column of |
class.name |
the name of the column of |
The test words are clustered using the “partitioning around medoids” (PAM) algorithm (Kaufman & Rousseeuw 1990, Ch. 2) based on their semantic distances. The PAM algorithm is used because it works with arbitrary distance measures (including neihbour rank), produces a stable solution (unlike most iterative algorithms) and has shown to be on par with state-of-the-art spherical k-means clustering (CLUTO) in evaluation studies.
Each cluster is automatically assigned a majority label, i.e. the gold standard class occurring most frequently in the cluster. This represents the best possible classification that can be derived from the clustering.
As evaluation metrics, clustering purity (accuracy of the majority classification) and entropy are computed.
The latter is defined as a weighted average over the entropy of the class distribution within each cluster, expressed in bits.
If scale.entropy=TRUE
, the value is divided by the overall entropy of the class distribution in the gold standard, scaling it to the range [0, 1].
NB: The semantic distance measure selected with the extra arguments (...
) should be symmetric.
In particular, it is not very sensible to specify rank="fwd"
or rank="bwd"
.
NB: Similarity measures are not supported by the current clustering algorithm. Make sure not to call dist.matrix
(from dist.fnc
) with convert=FALSE
!
The default short report (details=FALSE
) is a data frame with a single row and the columns
purity
(clustering purity as a percentage), entropy
(scaled or unscaled clustering entropy)
and missing
(number of words not found in the DSM).
The detailed report (details=TRUE
) is a data frame with one row for each test word and the following columns:
word |
the test word (character) |
cluster |
cluster to which the word has been assigned; all unknown words are collected in an additional cluster |
label |
majority label of this cluster (factor with same levels as |
gold |
gold standard class of the test word (factor) |
correct |
whether majority class assignment is correct (logical) |
missing |
whether word was not found in the DSM (logical) |
Stephanie Evert (https://purl.org/stephanie.evert)
Suitable gold standard data sets in this package: ESSLLI08_Nouns
Support functions: pair.distances
, convert.lemma
eval.clustering(ESSLLI08_Nouns, DSM_Vectors, class.name="class2")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.