Hierarchical Climate Regionalization
Description
HiClimR is a tool for Hierarchical Climate Regionalization applicable to any correlationbased clustering. Climate regionalization is the process of dividing an area into smaller regions that are homogeneous with respect to a specified climatic metric. Several features are added to facilitate the applications of climate regionalization (or spatiotemporal analysis in general) and to implement cluster validation with an objective tree cutting to find an optimal number of clusters for a userspecified confidence level. These include options for preprocessing and postprocessing as well as efficient code execution for large datasets and options for splitting big data and computing only the uppertriangular half of the correlation/dissimilarity matrix to overcome memory limitations. Hybrid hierarchical clustering reconstructs the upper part of the tree above a cut to get the best of the available methods. Multivariate clustering (MVC) provides options for filtering all variables before preprocessing, detrending and standardization of each variable, and applying weights for the preprocessed variables. The correlation distance for MVC represents the (weighted) average of distances between all variables.
HiClimR
is the main function that calls all helper functions. It adds
several features and a new clustering method (called, regional linkage) to
hierarchical clustering in R (hclust
function in stats library):
data regridding (grid2D
function), coarsening spatial resolution
(coarseR
function), geographic masking (geogMask
function),
data filtering by mean and/or variance thresholds, data preprocessing (detrending,
standardization, and PCA), faster correlation function with preliminary big data support
(fastCor
function), hybrid hierarchical clustering, multivariate clustering
(MVC), cluster validation (validClimR
and minSigCor
functions),
and visualization of region maps. Badr et. al (2015) describes the regionalization
algorithms, features, and data processing tools included in the package and presents a
demonstration application in which the package is used to regionalize Africa on the
basis of interannual precipitation variability.
HiClimR is applicable to any correlationbased clustering.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  HiClimR(
# Input data matrix (N spatial elements x M observations)
x = list(),
# Coarsening spatial resolution
lon=NULL, lat=NULL, lonStep=1, latStep=1,
# Geographic masking:
geogMask=FALSE, gMask=NULL, continent=NULL, region=NULL, country=NULL,
# Data thresholds:
meanThresh = if(class(x) == "list") vector("list", length(x)) else list(NULL),
varThresh = if(class(x) == "list") as.list(rep(0, length(x))) else list(0),
# Data preprocessing:
detrend = if(class(x) == "list") as.list(rep(FALSE, length(x))) else list(FALSE),
standardize = if(class(x) == "list") as.list(rep(FALSE, length(x))) else list(FALSE),
weightedVar = if(class(x) == "list") as.list(rep(1, length(x))) else list(1),
nPC=NULL,
# Clustering options:
method="ward", hybrid=FALSE, kH=NULL, members=NULL,
# Big data support:
nSplit = 1, upperTri = TRUE, verbose = TRUE,
# Cluster validation:
validClimR=TRUE, rawStats=TRUE, k=NULL, minSize=1, alpha=0.05,
# Graphical options:
plot=TRUE, dendrogram = TRUE, colPalette=NULL, hang=1, labels=FALSE, pch = 15, cex = 1
)

Arguments
x 
an ( 
lon 
a vector of longitudes with length 
lat 
a vector of latitudes with length 
lonStep 
an integer greater than or equal to 
latStep 
an integer greater than or equal to 
geogMask 
a logical: if 
gMask 
A vector of indices for the spatial elements to be masked,
as required by 
continent 

region 

country 

meanThresh 

varThresh 
zero or a threshold for the temporal variance: This is
used with 
detrend 
a logical: should the data be detrended before clustering?
Detrending (removing the linear trend) is important when variations from
temporal point to another is of interest (e.g., interannual variability).
The columns of the data matrix 
standardize 
a logical: should the data be standardized before
clustering? The standardized data makes use of the mean of equallyweighted
objects within each cluster (cluster mean = mean of standardized variables
within the cluster). Otherwise, the mean of raw data will be used (cluster
mean = mean of raw variables within the cluster). The variance of the mean
is updated at each agglomeration step.
For MultiVariate Clustering (MVC), 
weightedVar 
a list of positive wights ( 
nPC 

method 
the agglomeration method to be used. This should be (an
unambiguous abbreviation of) one of 
hybrid 
a logical: should the upper part of the tree be reconstructed
using 
kH 

members 

nSplit 
integer number greater than or equal to one, to split the data matrix into

upperTri 
logical to compute only the uppertriangular half of the correlation
matrix if 
verbose 
logical to print processing information if 
validClimR 
a logical: If 
rawStats 
a logical: should validation indices be computed based on the raw data or PCAfiltered data? 
k 

minSize 
minimum cluster size. The 
alpha 
confidence level: the default is 
plot 
logical to call the plotting method if 
dendrogram 
logical to enable or disable dendrogram plotting. 
colPalette 
a color palette or a list of colors such as that generated
by 
hang 
The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0. 
labels 
A character vector of labels for the leaves of the
tree. By default the row names or row numbers of the original data are
used. If 
pch 
Either an integer specifying a symbol or a single character to
be used as the default in plotting points. See 
cex 
A numerical value giving the amount by which plotting symbols should
be magnified relative to the 
Details
HiClimR
function is based on hclust
, which now uses an
optimized algorithm to deal with only the upper/lower triangularhalf of the symmetric
dissimilarity matrix instead of the old algorithm that uses the full matrix in the
merging steps. It performs a hierarchical cluster analysis using Pearson correlation
distance dissimilarity for the N objects being clustered. Initially, each object
is assigned to its own cluster and then the algorithm proceeds iteratively, at each
stage joining the two most similar clusters, continuing until there is just a single
cluster. At each stage distances between clusters are recomputed by a dissimilarity
update formula according to the particular clustering method being used.
All clustering methods in hclust
are included. The regional
linkage method mainimizes intercluster correlations between cluster means
(see Badr et al. 2015
). Ward's minimum variance method aims at finding
compact, spherical clusters. The complete linkage method finds similar clusters.
The single linkage method (which is closely related to the minimal spanning tree)
adopts a ‘friends of friends’ clustering strategy. The other methods can be
regarded as aiming for clusters with characteristics somewhere between the single and
complete link methods. Note however, that methods "median"
and "centroid"
are not leading to a monotone distance measure, or equivalently the
resulting dendrograms can have so called inversions (which are hard to interpret).
The regional
linkage method is explained in the context of a spatiotemporal
problem, in which N
spatial elements (e.g., weather stations) are divided
into k
regions, given that each element has a time series of length M
.
It is based on interregional correlation distance between the temporal means of
different regions (or elements at the first merging step). It modifies the update
formulae of average
linkage method by incorporating the standard deviation
of the merged region timeseries, which is a function of the correlation between the
individual regions, and their standard deviations before merging. It is equal to the
average of their standard deviations if and only if the correlation between the two
merged regions is 100%
. In this special case, the regional
linkage
method is reduced to the classic average
linkage clustering method.
If members != NULL
, then d
is taken to be a
dissimilarity matrix between clusters instead of dissimilarities
between singletons and members
gives the number of observations
per cluster. This way the hierarchical cluster algorithm can be
‘started in the middle of the dendrogram’, e.g., in order to
reconstruct the part of the tree above a cut (see examples).
Dissimilarities between clusters can be efficiently computed (i.e.,
without hclust
itself) only for a limited number of
distance/linkage combinations, the simplest one being squared
Euclidean distance and centroid linkage. In this case the
dissimilarities between the clusters are the squared Euclidean
distances between cluster means.
In hierarchical cluster displays, a decision is needed at each merge to
specify which subtree should go on the left and which on the right.
Since, for n observations there are n1 merges,
there are 2^{(n1)} possible orderings for the leaves
in a cluster tree, or dendrogram.
The algorithm used in hclust
is to order the subtree so that
the tighter cluster is on the left (the last, i.e., most recent,
merge of the left subtree is at a lower value than the last
merge of the right subtree).
Single observations are the tightest clusters possible,
and merges involving two observations place them in order by their
observation sequence number.
Value
An object of class HiClimR and hclust, which describes the tree produced by the clustering process. The object is a list with the following components:
merge 
an n1 by 2 matrix.
Row i of 
height 
a set of n1 real values (nondecreasing for
ultrametric trees).
The clustering height: that is, the value of
the criterion associated with the clustering

order 
a vector giving the permutation of the original
observations suitable for plotting, in the sense that a cluster
plot using this ordering and matrix 
labels 
labels for each of the objects being clustered. 
method 
the cluster method that has been used. 
call 
the call which produced the result. 
dist.method 
the distance that has been used to create 
skip 
a vector of 
PCA 
if 
coords 
an ( 
nvars 
number of variables used for multivariate clustering (MVC). 
ncols 
number of columns for each variable. 
data 
the preprocessed data used for clustering will be stored here.
If 
mask 
a vector of the indices of masked spatial points by both geographic masking and data thresholds. 
treeH 
An object of class hclust, which describes the upper part of
the tree reconstructed by the hybrid clustering process if 
If validClimR = TRUE
, an object of class HiClimR, which produces
indices for validating the tree produced by the clustering process, will be merged
in the dendrogram tree above. This object is a list with the following components:
cutLevel 
the minimum significant correlation used for objective tree cut together with the corresponding confidence level. 
clustMean 
the cluster means which are the region's mean timeseries for all selected regions. 
clustSize 
cluster sizes for all selected regions. 
clustFlag 
a flag 
interCor 
intercluster correlations for all selected regions. It is the intercluster correlations between cluster means. The maximum intercluster correlation is a measure for separation or contiguity, and it is used for objective tree cut (to find the "optimal" number of clusters). 
intraCor 
intracluster correlations for all selected regions. It is the intracluster correlations between the mean of each cluster and its members. The average intracluster correlation is a weighted average for all clusters, and it is a measure for homogeneity. 
diffCor 
difference between intracluster correlation and maximum intercluster correlation for all selected regions. 
statSum 
overall statistical summary for i 
region 
ordered regions vector of size 
regionID 
ordered regions ID vector of length equals the selected number
of clusters, after excluding the small clusters defined by 
There are print
, plot
and identify
(see identify.hclust
) methods and
rect.hclust()
functions for hclust
objects.
Author(s)
Hamada Badr <badr@jhu.edu>, Ben Zaitchik <zaitchik@jhu.edu>, and
Amin Dezfuli <dez@jhu.edu>. The HiClimR
is a modification of
hclust
function, which is based on Fortran code
contributed to STATLIB by F. Murtagh.
References
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 110, http://dx.doi.org/10.1007/s1214501502217.
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): Hierarchical Climate Regionalization, CRAN, http://cran.rproject.org/package=HiClimR.
Wilks, D. S. (2011): Statistical methods in the atmospheric sciences, Academic press.
Gordon, A. D. (1999): Classification. Second Edition. London: Chapman and Hall / CRC
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988): The New S Language. Wadsworth & Brooks/Cole. (S version.)
Murtagh, F. (1985): “Multidimensional Clustering Algorithms”, in COMPSTAT Lectures 4. Wuerzburg: PhysicaVerlag (for algorithmic details of algorithms used).
Hartigan, J. A. (1975): Clustering Algorithms. New York: Wiley.
Everitt, B. (1974): Cluster Analysis. London: Heinemann Educ. Books.
Anderberg, M. R. (1973): Cluster Analysis for Applications. Academic Press: New York.
Sneath, P. H. A. and R. R. Sokal (1973): Numerical Taxonomy. San Francisco: Freeman.
McQuitty, L.L. (1966): Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26, 825831.
Source Code: https://github.com/hsbadr/HiClimR
See Also
HiClimR
, validClimR
, geogMask
,
coarseR
, fastCor
, grid2D
,
minSigCor
. identify.hclust
, rect.hclust
,
cutree
, dendrogram
, and kmeans
.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177  require(HiClimR)
##
# Typical use of HiClimR for singlevariate clustering: #
##
## Load the test data included/loaded in the package (1 degree resolution)
x < TestCase$x
lon < TestCase$lon
lat < TestCase$lat
## Generate/check longitude and latitude mesh vectors for gridded data
xGrid < grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon < c(xGrid$lon)
lat < c(xGrid$lat)
## SingleVariate Hierarchical Climate Regionalization
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "regional", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## For more examples: https://github.com/hsbadr/HiClimR#examples
## Not run:
##
# Additional Examples: #
##
## Use Ward's method
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## Use data splitting for big data
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 10, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## Use hybrid WardRegional method
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = TRUE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## Check senitivity to kH for the hybrid method above
##
# Typical use of HiClimR for multivariate clustering: #
##
## Load the test data included/loaded in the package (1 degree resolution)
x1 < TestCase$x
lon < TestCase$lon
lat < TestCase$lat
## Generate/check longitude and latitude mesh vectors for gridded data
xGrid < grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon < c(xGrid$lon)
lat < c(xGrid$lat)
## Test if we can replicate singlevariate region map with repeated variable
y < HiClimR(x=list(x1, x1), lon = lon, lat = lat, lonStep = 1, latStep = 1,
geogMask = FALSE, continent = "Africa", meanThresh = list(10, 10),
varThresh = list(0, 0), detrend = list(TRUE, TRUE), standardize = list(TRUE, TRUE),
nPC = NULL, method = "regional", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## Generate a random matrix with the same number of rows
x2 < matrix(rnorm(nrow(x1) * 100, mean=0, sd=1), nrow(x1), 100)
## MultiVariate Hierarchical Climate Regionalization
y < HiClimR(x=list(x1, x2), lon = lon, lat = lat, lonStep = 1, latStep = 1,
geogMask = FALSE, continent = "Africa", meanThresh = list(10, NULL),
varThresh = list(0, 0), detrend = list(TRUE, FALSE), standardize = list(TRUE, TRUE),
weightedVar = list(1, 1), nPC = NULL, method = "regional", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## You can apply all clustering methods and options
##
# Miscellaneous examples to provide more information about functionality and usage #
# of the helper functions that can be used separately or for other applications. #
##
## Load test case data
x < TestCase$x
## Generate longitude and latitude mesh vectors
xGrid < grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon < c(xGrid$lon)
lat < c(xGrid$lat)
## Coarsening spatial resolution
xc < coarseR(x = x, lon = lon, lat = lat, lonStep = 2, latStep = 2)
lon < xc$lon
lat < xc$lat
x < xc$x
## Use fastCor function to compute the correlation matrix
t0 < proc.time(); xcor < fastCor(t(x)); proc.time()  t0
## compare with cor function
t0 < proc.time(); xcor0 < cor(t(x)); proc.time()  t0
## Check the valid options for geographic masking
geogMask()
## geographic mask for Africa
gMask < geogMask(continent = "Africa", lon = lon, lat = lat, plot = TRUE,
colPalette = NULL)
## Hierarchical Climate Regionalization Without geographic masking
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "regional", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## With geographic masking (specify the mask produced bove to save time)
y < HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = TRUE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "regional", hybrid = FALSE, kH = NULL,
members = NULL, nSplit = 1, upperTri = TRUE, verbose = TRUE,
validClimR = TRUE, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL, hang = 1, labels = FALSE)
## Find minimum significant correlation at 95
rMin < minSigCor(n = nrow(x), alpha = 0.05, r = seq(0, 1, by = 1e06))
## Validtion of Hierarchical Climate Regionalization
z < validClimR(y, k = NULL, minSize = 1, alpha = 0.01,
plot = TRUE, colPalette = NULL)
## Apply minimum cluster size (minSize = 25)
z < validClimR(y, k = NULL, minSize = 25, alpha = 0.01,
plot = TRUE, colPalette = NULL)
## The optimal number of clusters, including small clusters
k < length(z$clustFlag)
## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks < sum(z$clustFlag)
## Dendrogram plot
plot(y, hang = 1, labels = FALSE)
## Tree cut
cutTree < cutree(y, k = k)
table(cutTree)
## Visualization for gridded data
RegionsMap < matrix(y$region, nrow = length(unique(y$coords[, 1])), byrow = TRUE)
colPalette < colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
"#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
image(unique(y$coords[, 1]), unique(y$coords[, 2]), RegionsMap, col = colPalette(ks))
## Visualization for gridded or ungridded data
plot(y$coords[, 1], y$coords[, 2], col = colPalette(max(Regions, na.rm = TRUE))[y$region],
pch = 15, cex = 1)
## End(Not run)
