cknn: Find comparable properties using clustered nearest neighbors
In OllieS8/bearing:

Description Usage Arguments Details Value Note

Sales comparables are recent sales which have the same (or very similar) characteristics as a target (unsold) property. They are frequently used by assessors, real estate agents, and appraisers to determine the fair market value of a home. However, finding comparable properties at scale can be difficult.

This function can be used to quickly find comparables for any number of unsold properties. It can also be used more generally to find similar properties that are nearby each other, regardless of whether or not they sold.

See the documentation site for example usage.

cknn(
  data,
  lon,
  lat,
  m = 5,
  k = 10,
  l = 0.5,
  var_weights = NULL,
  keep_data = TRUE,
  ...
)

`data`	A data frame containing the variables to cluster on. Should contain both numerics and factors. Numerics should be unscaled. Lat/lon should NOT be included.
`lon`	A numeric vector of longitude values, reprojected into planar coordinates specific to the target area. See here for details on reprojection using R.
`lat`	A numeric vector of latitude values, reprojected into planar coordinates specific to the target area. See here for details on reprojection using R.
`m`	The number of clusters to create using the `kproto` function.
`k`	The number of nearest neighbors to return for each row of input data.
`l`	Hyperparameter representing the trade-off between distance and characteristics in kNN matching. Must be >= 0 and <= 1. Value equal to 1 will match on distance only, while value equal to 0 will disregard distance and match on characteristics only. Default 0.5 (equal weight).
`var_weights`	Value(s) passed to `lambda` input of `kproto`. See details.
`keep_data`	Logical for whether original data should be included in the returned object.
`...`	Arguments passed on to `kproto`, most commonly `iter.max`.

The cknn algorithm works in two stages:

Divide the full set of sales into m clusters according to each property's characteristics. This mimics the process of market segmentation or separating properties into different classes. This clustering is done using the k-prototypes function kproto from the clustMixType library. See the clustMixType whitepaper for more information.
For each property i, find the k nearest neighbors within i's cluster, minimizing the distance over planar coordinates and Euclidean distance to all cluster centers, This is accomplished with the fast kNN function from kNN.

Options for inputs to var_weights include:

A named list with names corresponding to column names in the input data. Names not included in the list are assumed to have a value of 1. These named values are multiplied by the variance estimates created by lambdaest. Higher values will weight variables more heavily during clustering.
A p long unnamed vector, where p is equal to the number columns in the input data. These weights are not multiplied by the variance estimates created by lambdaest.
A single unnamed numeric value. This value trades off the relative importance of numeric versus categorical variables. Higher values will more heavily weight categorical variables, while a value of 0 replicates standard k-means (numerics only).
A NULL value. This uses the default estimates produced by lambdaest. All variables are weighted equally.

Object of class cknn containing:

`kproto`	`kproto` object containing clusters, centroids, etc.
`knn`	List of `k` nearest neighbors for each row in the input data.
`knn_idx`	Lookup for translating in-cluster index positions to row indices from the input data. Used by predict method.
`lon`	Unaltered input longitude vector. Used by predict method for scaling new input data.
`lat`	Unaltered input latitude vector. Used by predict method for scaling new input data.
`var_weights`	Unaltered variable weights used to construct the cknn model.
`m`	Number of clusters created by `kproto`.
`k`	Number of nearest neighbors returned by `kNN`.
`l`	Hyperparameter used for distance/characteristics trade-off.
`data`	Unaltered input data frame. Used by predict method for scaling new input data. Only returned if `keep_data` is `TRUE`.

Input data should be thoroughly cleaned. Outliers in numeric vectors and factors with rare levels can both affect clustering performance. Outlier values should be removed. Rare factor levels should be collapsed into a single level or removed.

OllieS8/bearing documentation built on Dec. 31, 2020, 3:23 p.m.