knitr::opts_chunk$set(echo = TRUE) library(NNS) library(data.table) data.table::setDTthreads(1L) options(mc.cores = 1) RcppParallel::setThreadOptions(numThreads = 1) Sys.setenv("OMP_THREAD_LIMIT" = 1)
library(NNS) library(data.table) require(knitr) require(rgl)
Below are some examples demonstrating unsupervised learning with NNS clustering and nonlinear regression using the resulting clusters. As always, for a more thorough description and definition, please view the References.
NNS.part()NNS.part is both a partitional and hierarchical clustering method. NNS iteratively partitions the joint distribution into partial moment quadrants, and then assigns a quadrant identification (1:4) at each partition.
NNS.part returns a data.table of observations along with their final quadrant identification. It also returns the regression points, which are the quadrant means used in NNS.reg.
x = seq(-5, 5, .05); y = x ^ 3 for(i in 1 : 4){NNS.part(x, y, order = i, Voronoi = TRUE, obs.req = 0)}
NNS.part offers a partitioning based on $x$ values only NNS.part(x, y, type = "XONLY", ...), using the entire bandwidth in its regression point derivation, and shares the same limit condition as partitioning via both $x$ and $y$ values.
for(i in 1 : 4){NNS.part(x, y, order = i, type = "XONLY", Voronoi = TRUE)}
Note the partition identifications are limited to 1's and 2's (left and right of the partition respectively), not the 4 values per the $x$ and $y$ partitioning.
NNS.part(x,y,order = 4, type = "XONLY")
The right column of plots shows the corresponding regression (plus endpoints and central point) for the order of NNS partitioning.
```r,results='hide'}
for(i in 1 : 3){NNS.part(x, y, order = i, obs.req = 0, Voronoi = TRUE, type = "XONLY") ; NNS.reg(x, y, order = i, ncores = 1)}
# NNS Regression `NNS.reg()` **`NNS.reg`** can fit any $f(x)$, for both uni- and multivariate cases. **`NNS.reg`** returns a self-evident list of values provided below. ## Univariate: ```r NNS.reg(x, y, ncores = 1)
Multivariate regressions return a plot of $y$ and $\hat{y}$, as well as the regression points ($RPM) and partitions ($rhs.partitions) for each regressor.
f = function(x, y) x ^ 3 + 3 * y - y ^ 3 - 3 * x y = x ; z <- expand.grid(x, y) g = f(z[ , 1], z[ , 2]) NNS.reg(z, g, order = "max", plot = FALSE, ncores = 1)
NNS.reg can inter- or extrapolate any point of interest. The NNS.reg(x, y, point.est = ...) parameter permits any sized data of similar dimensions to $x$ and called specifically with NNS.reg(...)$Point.est.
NNS.reg also provides a dimension reduction regression by including a parameter NNS.reg(x, y, dim.red.method = "cor", ...). Reducing all regressors to a single dimension using the returned equation NNS.reg(..., dim.red.method = "cor", ...)$equation.
NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1)$equation
a = NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1, plot = FALSE)$equation
Thus, our model for this regression would be:
$$Species = \frac{r round(a$Coefficient[1],3)Sepal.Length r round(a$Coefficient[2],3)Sepal.Width +r round(a$Coefficient[3],3)Petal.Length +r round(a$Coefficient[4],3)Petal.Width}{4} $$
NNS.reg(x, y, dim.red.method = "cor", threshold = ...) offers a method of reducing regressors further by controlling the absolute value of required correlation.
NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1)$equation
a = NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1, plot = FALSE)$equation
Thus, our model for this further reduced dimension regression would be:
$$Species = \frac{\: r round(a$Coefficient[1],3)Sepal.Length + r round(a$Coefficient[2],3)Sepal.Width +r round(a$Coefficient[3],3)Petal.Length +r round(a$Coefficient[4],3)Petal.Width}{3} $$
and the point.est = (...) operates in the same manner as the full regression above, again called with NNS.reg(...)$Point.est.
NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est
For a classification problem, we simply set NNS.reg(x, y, type = "CLASS", ...).
NOTE: Base category of response variable should be 1, not 0 for classification problems.
NNS.reg(iris[ , 1 : 4], iris[ , 5], type = "CLASS", point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est
NNS.stack()The NNS.stack routine cross-validates for a given objective function the n.best parameter in the multivariate NNS.reg function as well as the threshold parameter in the dimension reduction NNS.reg version. NNS.stack can be used for classification:
NNS.stack(..., type = "CLASS", ...)
or continuous dependent variables:
NNS.stack(..., type = NULL, ...).
Any objective function obj.fn can be called using expression() with the terms predicted and actual, even from external packages such as Metrics.
NNS.stack(..., obj.fn = expression(Metrics::mape(actual, predicted)), objective = "min").
NNS.stack(IVs.train = iris[ , 1 : 4], DV.train = iris[ , 5], IVs.test = iris[1 : 10, 1 : 4], dim.red.method = "cor", obj.fn = expression( mean(round(predicted) == actual) ), objective = "max", type = "CLASS", folds = 1, ncores = 1)
Folds Remaining = 0 Current NNS.reg(... , threshold = 0.80 ) MAX Iterations Remaining = 1 Current NNS.reg(... , threshold = 0.40 ) MAX Iterations Remaining = 0 Current NNS.reg(. , n.best = 1 ) MAX Iterations Remaining = 12 Current NNS.reg(. , n.best = 2 ) MAX Iterations Remaining = 11 Current NNS.reg(. , n.best = 3 ) MAX Iterations Remaining = 10 Current NNS.reg(. , n.best = 4 ) MAX Iterations Remaining = 9 Current NNS.reg(. , n.best = 5 ) MAX Iterations Remaining = 8 $OBJfn.reg [1] 0.9733333 $NNS.reg.n.best [1] 1 $probability.threshold [1] 0.547 $OBJfn.dim.red [1] 0.9666667 $NNS.dim.red.threshold [1] 0.8 $reg [1] 1 1 1 1 1 1 1 1 1 1 $reg.pred.int NULL $dim.red [1] 1 1 1 1 1 1 1 1 1 1 $dim.red.pred.int NULL $stack [1] 1 1 1 1 1 1 1 1 1 1 $pred.int NULL
Given multicollinearity is not an issue for nonparametric regressions as it is for OLS, in the case of an ill-fit univariate model a better option may be to increase the dimensionality of regressors with a copy of itself and cross-validate the number of clusters n.best via:
NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ...).
set.seed(123) x = rnorm(100); y = rnorm(100) nns.params = NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ncores = 1)
set.seed(123) x = rnorm(100); y = rnorm(100) nns.params = list() nns.params$NNS.reg.n.best = 100
NNS.reg(cbind(x, x), y, n.best = nns.params$NNS.reg.n.best, point.est = cbind(x, x), residual.plot = TRUE, ncores = 1, confidence.interval = .95)
Smoothness is not required for curve fitting, but the NNS.reg function offers an optional smoothed fit. This feature applies a smoothing spline to regression points generated internally using the partitioning method described earlier.
NNS.reg(x, y, smooth = TRUE)
Imputation in NNS is a direct application of nearest neighbor regression. When values of $y$ are missing, we use the observed $(X,y)$ pairs as the training set and the predictors of the missing rows as point.est.
A key insight is that even in univariate regressions, NNS.reg benefits from the increasing dimensions trick: by duplicating the predictor into a multivariate form, e.g. cbind(x, x), the distance function underlying NNS.reg operates in a 2-D space. This sharpened distance metric allows a more robust donor selection, effectively turning univariate imputation into a special case of multivariate nearest neighbor regression.
For multivariate predictors, the same form applies directly — supply the full set of observed predictors in $x$, the observed responses in $y$, and the incomplete rows in point.est. With order = "max", n.best = 1, the imputation is always 1-NN donor-based: each missing $y$ is filled in by the response of its closest donor under the NNS hybrid distance. This ensures imputations remain strictly within the support of the observed data.
set.seed(123) # Univariate predictor with nonlinear signal n <- 400 x <- sort(runif(n, -3, 3)) y <- sin(x) + 0.2 * x^2 + rnorm(n, 0, 0.25) # Induce ~25% MCAR missingness in y miss <- rbinom(n, 1, 0.25) == 1 y_mis <- y y_mis[miss] <- NA # ---- Increasing dimensions trick ---- # Duplicate x so the distance operates in a 2D space: cbind(x, x). # This sharpens nearest-neighbor selection even in a nominally univariate setting. x2_train <- cbind(x[!miss], x[!miss]) x2_miss <- cbind(x[miss], x[miss]) # 1-NN donor imputation with NNS.reg y_hat_uni <- NNS::NNS.reg( x = x2_train, # predictors (duplicated x) y = y[!miss], # observed responses point.est = x2_miss, # rows to impute order = "max", # dependence-maximizing order n.best = 1, # 1-NN donor plot = FALSE )$Point.est # Fill back y_completed_uni <- y_mis y_completed_uni[miss] <- y_hat_uni # Plot observed vs imputed (NNS 1-NN) plot(x, y, pch = 1, col = "steelblue", cex = 1.5, lwd = 2, xlab = "x", ylab = "y", main = "NNS 1-NN Imputation") points(x[miss], y_hat_uni, col = "red", pch = 15, cex = 1.3) legend("topleft", legend = c("Observed", "Imputed (NNS 1-NN)"), col = c("steelblue", "red"), pch = c(1, 15), pt.lwd = c(2, NA), bty = "n")
{width="600" height="600"}
set.seed(123) # Multivariate predictors with nonlinear & interaction structure n <- 800 X <- cbind( x1 = rnorm(n), x2 = runif(n, -2, 2), x3 = rnorm(n, 0, 1) ) f <- function(x1, x2, x3) 1.1*x1 - 0.8*x2 + 0.5*x3 + 0.6*x1*x2 - 0.4*x2*x3 + 0.3*sin(1.3*x1) y <- f(X[,1], X[,2], X[,3]) + rnorm(n, 0, 0.4) # Induce ~30% MCAR missingness in y miss <- rbinom(n, 1, 0.30) == 1 y_mis <- y y_mis[miss] <- NA # Training (observed) vs rows to impute X_obs <- X[!miss, , drop = FALSE] y_obs <- y[!miss] X_mis <- X[ miss, , drop = FALSE] # 1-NN donor imputation with NNS.reg y_hat_mv <- NNS::NNS.reg( x = X_obs, # all observed predictors y = y_obs, # observed responses point.est = X_mis, # rows to impute order = "max", # dependence-maximizing order n.best = 1, # 1-NN donor plot = FALSE )$Point.est # Completed vector y_completed_mv <- y_mis y_completed_mv[miss] <- y_hat_mv # Plot observed vs imputed (multivariate, NNS 1-NN) plot(seq_along(y), y, pch = 1, col = "steelblue", cex = 1.5, lwd = 2, xlab = "Observation index", ylab = "y", main = "NNS 1-NN Multivariate Imputation") # Overlay imputed values points(which(miss), y_hat_mv, pch = 15, col = "red", cex = 1.2) # Legend legend("topleft", legend = c("Observed", "Imputed (NNS 1-NN)"), col = c("steelblue", "red"), pch = c(1, 15), pt.lwd = c(2, NA), bty = "n")
{width="600" height="600"}
If the user is so motivated, detailed arguments further examples are provided within the following:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.