labeling: labeling for clusters

View source: R/labeling.R

labelingR Documentation

labeling for clusters

Description

This function is used to label the clusters obtained from cd.cluster.

Usage

labeling(Y, Q, cd.cluster.object, method = c("2b","2a","1","3"),perm=NULL)

Arguments

Y

A required N \times J response matrix with binary elements (1=correct, 0=incorrect), where N is the number of examinees and J is the number of items.

Q

A required J \times K binary item-by-attribute association matrix (Q-matrix), where K is the number of attributes. The j^{th} row of the matrix is an indicator vector, 1 indicating attributes are required and 0 indicating attributes are not required to master item j.

cd.cluster.object

An object of cd.cluster.

method

The algorithm used for labeling. It should be one of "1","2a", "2b" and "3" corresponding to four different labeling methods in Chiu and Ma (2013). The default is "2b". See details for more information.

perm

The data matrix of the partial orders of the attribute patterns.

Details

Because cluster analysis such as K-means or HACA can only classify examinees into unlabeled clusters, labeling algorithms are needed to identify the underlying attribute patterns of each latent cluster. Four labeling algorithms proposed in Chiu and Ma (2013) can be implemented using this function.

The first method is the Inconsistency Index method (method="1"). The Inconsistency Index, IC, quantifies the amount of deviation of an ordering of clusters due to a specific \bm{W} (See details in cd.cluster) from an arrangement of clusters that is suggested by simple assumptions about the (possible) underlying model. Among all feasible assignments of attribute patterns to clusters, the one that minimizes IC is chosen. Refer to Chiu and Ma (2013) for details. Note that this method appears to be more time-consuming when K is large and thus only the cases of K=3 and K=4 are implemented in the current function. To implement this algorithm, the partial order matrix of the attribute patterns should be provided. See perm for details.

For method="2a" and method="2b", the label of a latent class is obtained by minimizing the average distance between observed responses and ideal responses. Specifically, let \bm{y}=(y_1, y_2, \ldots, y_J) be the observed response pattern for a particular examinee and \bm{\eta}=(\eta_1,\eta_2,\ldots,\eta_J) be the ideal response pattern corresponding to a particular attribute pattern \bm{\alpha}. The Weighted Hamming distance d between \bm{y} and \bm{\eta} is given by

d(\bm{y}, \bm{\eta})=\sum_{j=1}^J\frac{1}{\bar{p_j}(1-\bar{p_j})}|y_j-\eta_j|.

where \bar{p_j} denotes the proportion correction on the j^{th} item. Then the best label or attribute pattern (\hat{\bm{\alpha}}) can be obtained through

\hat{\bm{\alpha}}=\mbox{arg} \min_{\bm{\alpha}_k \in \Omega}D.

where D is the average weighted Hamming distance within each cluster and \Omega is the set of \bm{\alpha}. In practice, the largest cluster will be labeled first and the smallest cluster will be labeled last.

For method="2a", The selected label \bm{\alpha} will be eliminated from \Omega after each labeling iteration, implying that different clusters will obtain different labels.

For method="2b", The selected label \bm{\alpha} will not be eliminated from \Omega after each labeling iteration, implying that different clusters may obtain the same label.

For method="3", it combines the technique of the partial order and "2a" method such that some labels can be eliminated from \Omega before each labeling iteration. Refer to Chiu and Ma (2013) for details.

It should be noted that method "1", "2a" and "3" all assume that different latent clusters are distinct in nature, which means different clusters will be given different labels using these methods. But method "2b" relaxes this assumption and allow the same label for different clusters. In addition, method "1" and "3" may be used when number of clusters is 2^K only. If it is not the case, method "2a" or method "2b" should be used.

Value

att.pattern

A N \times K binary attribute patterns, where N is the number of examinees and K is the number of attributes.

att.dist

A 2^K \times 2 data frame, where the first column is the attribute pattern, the second column is its frequency.

References

Chiu, C. Y., Douglas, J. A., & Li, X. (2009). Cluster analysis for cognitive diagnosis: theory and applications. Psychometrika, 74(4), 633-665.

Chiu, C. Y., & Ma, W. (2013). Assignment of clusters to attribute profiles for cognitive diagnosis. Manuscript in preparation.

See Also

print.labeling, cd.cluster, npar.CDM

Examples

#Labeling based on simulated data and Q matrix
data(sim.dat)
data(sim.Q)

# Information about the dataset
N <- nrow(sim.dat) #number of examinees
J <- nrow(sim.Q) #number of items
K <- ncol(sim.Q) #number of attributes

# Assume 2^K latent clusters
cluster.obj <- cd.cluster(sim.dat, sim.Q)
# Different clusters may have the same attribute patterns
labeled.obj.2b <- labeling(sim.dat, sim.Q, cluster.obj, method="2b")
# Different clusters mhave different attribute patterns
labeled.obj.2a <- labeling(sim.dat, sim.Q, cluster.obj, method="2a")
# labeling using method 1
data(perm3)  #since the number of attributes in this example is 3, perm3 is used here
labeled.obj.1 <- labeling(sim.dat, sim.Q, cluster.obj, method="1",perm=perm3)
remove(perm3) #remove perm3 

# Assume 5 attribute patterns exist
M <- 5
cluster.obj <- cd.cluster(sim.dat, sim.Q, method="HACA", HACA.cut=M) 
labeled.obj <- labeling(sim.dat, sim.Q, cluster.obj, method="2b")



ACTCD documentation built on Nov. 10, 2023, 1:12 a.m.