knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The dspace (divide space) package is designed for spatial segmentation/clustering/regionalization of point or polygon data of class 'sf'. It uses igraph to build a topological network/graph connecting points or polygons centroids. The nodes of the graph are connected with their nearest neighbors based on defined criteria (eg. nearest neighbors or queen/rook neighborhood - see...) and the weight of the edge is calculated as similarity between two nodes. To find network communities which represent regions/spatial clusters/submarkets, it uses fast greedy algorithm (add citation) .

This package has been developed mainly for real estate market segmentation (submarket delimitation) but can be easily adapted to any variety of spatial data.

The package uses following functions:

Main functions and data flow

Where to start

library(dspace)
data(realEstate)
modularity <- find_no_clusters(realEstate)
plot(modularity)
realEstate$class <- regionalize(realEstate, k = 5, explain = FALSE)

Overview of the algorithm

Dspace works with point or polygon datasets to regionalize/divide space into spatially coherent regions/clusters. The mechanism behind the algorithm uses network communities for data fixed in geographic 2-dimensional space. If the input dataset contains points, then, based on their position in 2-dimensional space, a graph network is being build that connects points based on their number of nearest neighbors default value n.neigh = 8 (in future development other means of building network will be implemented like dnearestneigh). If polygons are the input data then graph can be built based on the queen/rook neighborhood of the polygons (queen = TRUE).

When a graph connecting all of the neighboring points/polygons is created a similarity metric is calculated between the neighboring objects. This similarity can be calculated based on different distance metrics: "mahalanobis" or "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". If method is one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski", see ?dist for details, because this function is used to compute the distance. If method is set to "mahalanobis", the mahalanobis distance is computed between neighboring points. If method is a function, this function is used to compute the distance.

Based on a graph with weights assigned to the edges between the objects (points and polygons) a fast greedy algorithm is used to determine spatial clusters (communities in a graph).

Finding number of clusters

To find appropriate number of clusters a find_no_clusters() function has been created. It creates a rapid division of data into a defined number of regions (by default from 2 to 30 regions) and calculates for each division their modularity metric (needs reference for modularity). By default the function returns a named vector which can be applied to the function plot_modularity() that uses ggplot2 to visualize changes in modularity for each division/split of the data.

Assessing the division into clusters

For assessing the goodness of division/split while using regionalize() function (or point_ds()/polygon_ds()) an argument explain can be set to TRUE. While it is set to TRUE, a randomForest model is being trained to establish how much the division of the regions could be implied based on the variables from the dataset itself without taking into account their spatial distribution. this absolutely needs refinement



dabrowskia/dspace documentation built on July 3, 2020, 8:47 p.m.