This article is a brief introduction to kcmeans
.
library(kcmeans) set.seed(51944)
To illustrate kcmeans
, consider simulating a small dataset with a continuous outcome variable y
, two observed predictors -- a categorical variable Z
and a continuous variable X
-- and an (unobserved) Gaussian error. As in Wiemann (2023), the reduced form has an unobserved lower-dimensional representation dependent on the latent categorical variable Z0
.
# Sample parameters nobs = 800 # sample size # Sample data X <- rnorm(nobs) Z <- sample(1:20, nobs, replace = T) Z0 <- Z %% 4 # lower-dimensional latent categorical variable y <- Z0 + X + rnorm(nobs)
kcmeans
is then computed by combining the categorical feature with the continuous feature. By default, the categorical feature is the first column. Alternatively, the column corresponding to the categorical feature can be set via the which_is_cat
argument. Computation is very quick -- indeed the dynamic programming algorithm of the leveraged Ckmeans.1d.dp
package is polynomial in the number of values taken by the categorical feature Z
. See also ?kcmeans
for details.
system.time({ kcmeans_fit <- kcmeans(y = y, X = cbind(Z, X), K = 4) })
## user system elapsed ## 0.784 0.027 0.668
We may now use the predict.kcmeans
method to construct fitted values and/or compute predictions of the lower-dimensional latent categorical feature Z0
. See also ?predict.kcmeans
for details.
# Predicted values for the outcome + R^2 y_hat <- predict(kcmeans_fit, cbind(Z, X)) round(1 - mean((y - y_hat)^2) / mean((y - mean(y))^2), 3)
## [1] 0.695
# Predicted values for the latent categorical feature + missclassification rate Z0_hat <- predict(kcmeans_fit, cbind(Z, X), clusters = T) - 1 mean((Z0 - Z0_hat)!=0)
## [1] 0
Finally, it is also straightforward to compute standard errors for the final coefficients, e.g., using summary.lm
:
# Compute the linear regression object and call summary.lm lm_fit <- lm(y ~ as.factor(Z0_hat) + X) summary(lm_fit)
## ## Call: ## lm(formula = y ~ as.factor(Z0_hat) + X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.1205 -0.6916 0.0544 0.6700 3.4201 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.03897 0.07434 0.524 0.6 ## as.factor(Z0_hat)1 0.88393 0.10265 8.611 <2e-16 *** ## as.factor(Z0_hat)2 1.88314 0.10271 18.334 <2e-16 *** ## as.factor(Z0_hat)3 3.01094 0.10636 28.310 <2e-16 *** ## X 1.04636 0.03541 29.549 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.03 on 795 degrees of freedom ## Multiple R-squared: 0.6954, Adjusted R-squared: 0.6939 ## F-statistic: 453.7 on 4 and 795 DF, p-value: < 2.2e-16
Wiemann T (2023). "Optimal Categorical Instruments." https://arxiv.org/abs/2311.17021
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.