clusterMatch | R Documentation |
Creates properly sized clusters for matching, using either
alphabetical or word embedding clustering. If using word embedding,
the function first creates a word embedding out of the provided
vectors, and then runs PCA on the matrix. It then takes the first
k
dimensions (where k
is provided by the user) and
k-means is run on that matrix to get the clusters.
clusterMatch(vecA, vecB, nclusters, max.n, word.embed, min.var, iter.max)
vecA |
The character vector from dataset A |
vecB |
The character vector from dataset B |
nclusters |
The number of clusters to create from the provided data. Either nclusters = NULL or max.n = NULL. |
max.n |
The maximum size of either dataset A or dataset B in the largest cluster. Either nclusters = NULL or max.n = NULL |
word.embed |
Whether to use word embedding clustering. Default is FALSE. |
min.var |
The minimum amount of explained variance (maximum = 1) a PCA dimension can provide in order to be included in k-means clustering when using word embedding. Default is .20. |
iter.max |
Maximum number of iterations for the k-means algorithm. |
clusterMatch
returns a list of length 3:
clusterA |
The cluster assignments for dataset A |
clusterB |
The cluster assignments for dataset B |
n.clusters |
The number of clusters created |
kmeans |
The k-means object output. |
pca |
The PCA object output. |
dims.pca |
The number of dimensions from PCA used for the k-means clustering. |
Ben Fifield <benfifield@gmail.com>
data(samplematch)
cl <- clusterMatch(dfA$firstname, dfB$firstname, nclusters = 3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.