In this work, we provide the framework to analyze multiresolution partitions (e.g. country, provinces, subdistrict) where each individual data point belongs to only one partition in each layer (e.g. i belongs to subdistrict A, province P, and country Q).
We assume that a partition in a higher layer subsumes lower-layer partitions (e.g. a nation is at the 1st layer subsumes all provinces at the 2nd layer).
Given N individuals that have a pair of real values (x,y) that generated from independent variable X and dependent variable Y. Each individual i belongs to one partition per layer.
Our goal is to find which partitions at which highest level that all individuals in the these partitions share the same linear model Y=f(X) where f is a linear function.
The framework deploys the Minimum Description Length principle (MDL) to infer solutions.
For the newest version on github, please call the following command in R terminal.
remotes::install_github("DarkEyes/MRReg")
This requires a user to install the "remotes" package before installing MRReg.
In the first step, we generate a simulation dataset.
All simulation types have three layers except the type 4 has four layers.
The type-1 simulation has all individuals belong to the same homogeneous partition in the first layer.
The type-2 simulation has four homogeneous partitions in a second layer. Each partition has its own models.
The type-3 simulation has eight homogeneous partitions in a third layer. Each partition has its own models
The type-4 simulation has one homogeneous partition in a second layer, four homogeneous partitions in a third layer, and eight homogeneous partitions in a fourth layer. Each partition has its own model.
In this example, we use type-4 simulation.
library(MRReg)
# Generate simulation data type 4 by having 100 individuals per homogeneous partition.
DataT<-SimpleSimulation(100,type=4)
gamma <- 0.05 # Gamma parameter
out<-FindMaxHomoOptimalPartitions(DataT,gamma)
Then we plot the optimal homogeneous tree.
plotOptimalClustersTree(out)
The red nodes are homogeneous partitions. All children of a homogeneous partition node share the same linear model.
Lastly, we can print the result in text form.
PrintOptimalClustersResult(out, selFeature = TRUE)
The result is below.
[1] "========== List of Optimal Clusters =========="
[1] "Layer2,ClS-C1:clustInfoRecRatio=0.08,modelInfoRecRatio=0.72, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C11:clustInfoRecRatio=0.10,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer3,ClS-C12:clustInfoRecRatio=0.10,modelInfoRecRatio=0.70, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer3,ClS-C13:clustInfoRecRatio=0.10,modelInfoRecRatio=0.68, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer3,ClS-C14:clustInfoRecRatio=0.09,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C21:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 2
[1] "Layer4,ClS-C22:clustInfoRecRatio=NA,modelInfoRecRatio=0.58, eta(C)cv=1.00"
[1] "Selected features"
[1] 3
[1] "Layer4,ClS-C23:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 4
[1] "Layer4,ClS-C24:clustInfoRecRatio=NA,modelInfoRecRatio=0.46, eta(C)cv=1.00"
[1] "Selected features"
[1] 5
[1] "Layer4,ClS-C25:clustInfoRecRatio=NA,modelInfoRecRatio=0.55, eta(C)cv=1.00"
[1] "Selected features"
[1] 6
[1] "Layer4,ClS-C26:clustInfoRecRatio=NA,modelInfoRecRatio=0.60, eta(C)cv=1.00"
[1] "Selected features"
[1] 7
[1] "Layer4,ClS-C27:clustInfoRecRatio=NA,modelInfoRecRatio=0.63, eta(C)cv=1.00"
[1] "Selected features"
[1] 8
[1] "Layer4,ClS-C28:clustInfoRecRatio=NA,modelInfoRecRatio=0.61, eta(C)cv=1.00"
[1] "Selected features"
[1] 9
[1] "min eta(C)cv:1.000000"
Note for selected features: 1 is reserved for an intercept, and d is a selected feature if Y[i] ~ X[i,d-1] in linear model. Note that the clustInfoRecRatio values are always NA for last-layer partitions.
INPUT: DataT$clsLayer[i,k] is the cluster label of ith individual in kth cluster layer.
OUTPUT: out$Copt[p,1] is equal to k implies that a cluster that is a pth member of the maximal homogeneous partition is at kth layer and the cluster name in kth layer is Copt[p,2]
Chainarong Amornbunchornvej, Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong Thajchayapong (2021). Identifying Linear Models in Multi-Resolution Population Data using Minimum Description Length Principle to Predict Household Income. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1-30. https://doi.org/10.1145/3424670
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.