dAllocate: Allocation of observations to pre-established cluster...

Description Usage Arguments Value See Also Examples

View source: R/dAllocate.R

Description

Here, observations of a dataset are allocated to a set of preestablished cluster centers. This is intended to be used for the test set in train-test dataset situations.

Usage

1
dAllocate(inDataFrame, clusterCenters, log2Off = FALSE, noZeroNum = TRUE)

Arguments

inDataFrame

A dataframe or matrix with the data that that the cluster centers will be allocated to. This data should be scaled in the same way as the data for the original depeche was scaled when it entered the algorithm, i.e. in the normal case, not at all.

clusterCenters

A matrix that needs to be inherited from a depeche run. It contains the information about which clusters and variables that have been sparsed away and where the cluster centers are located for the remaining clusters and variables.

log2Off

If the automatic detection for high kurtosis, and followingly, the log2 transformation, should be turned off.

noZeroNum

For internal use. Controls the that the internal algorithm returns a cluster with number 0.

Value

A vector with the same length as number of rows in the inDataFrame, where the cluster identity of each observation is noted.

See Also

depeche

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Retrieve some example data
data(testData)
## Not run: 
# Now arbitrarily (for the sake of the example) divide the data into a
# training- and a test set.
testDataSample <- sample(1:nrow(testData), size = 10000)
testDataTrain <- testData[testDataSample, ]
testDataTest <- testData[-testDataSample, ]

# Run the depeche function for the train set

x_depeche_train <- depeche(testDataTrain[, 2:15],
    maxIter = 20,
    sampleSize = 1000
)

# Allocate the test dataset to the centers of the train dataset
x_depeche_test <- dAllocate(testDataTest[, 2:15],
    clusterCenters = x_depeche_train$clusterCenters
)

# And finally plot the two groups to see how great the overlap was:
trainTablePerId <- apply(as.matrix(table(
    testDataTrain$ids,
    x_depeche_train$clusterVector
)), 1, function(x) x / sum(x))
trainTableCollapsed <- apply(trainTablePerId, 1, sum)
trainTableFraction <- trainTableCollapsed / sum(trainTableCollapsed)
testTablePerId <- apply(
    as.matrix(table(testDataTest$ids, x_depeche_test)),
    1, function(x) x / sum(x)
)
testTableCollapsed <- apply(testTablePerId, 1, sum)
testTableFraction <- testTableCollapsed / sum(testTableCollapsed)
xmatrix <- t(cbind(trainTableFraction, testTableFraction))
library(gplots)
barplot2(xmatrix, beside = TRUE, legend = rownames(xmatrix))
title(main = "Difference between train and test set")
title(xlab = "Clusters")
title(ylab = "Fraction")

## End(Not run)

DepecheR documentation built on Nov. 8, 2020, 5:44 p.m.