knitr::opts_chunk$set( eval = TRUE, collapse = TRUE, comment = "#>", dpi = 150, fig.show = 'hold', fig.width = 5 ) options(scipen = 9999) options(conflicts.policy = list(warn = FALSE))
This article showcases different spatial representations of the outputs of distantia()
and momentum()
.
if (!requireNamespace("sf", quietly = TRUE)){ install.packages("sf") } if (!requireNamespace("mapview", quietly = TRUE)){ install.packages("mapview") } if (!requireNamespace("dplyr", quietly = TRUE)){ install.packages("dplyr") }
library(distantia, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) library(mapview, warn.conflicts = FALSE)
distantia()
FunctionsThe functions with the prefix distantia
quantify time series dissimilarity and provide results that can be represented as maps.
The examples in this article illustrate how to map dissimilarity scores using the datasets covid_prevalence
and covid_polygons
included with distantia
. The dataset comprises time series of weekly Covid-19 prevalence in several California counties.
tsl <- distantia::tsl_initialize( x = distantia::covid_prevalence, name_column = "name", time_column = "time" ) distantia::tsl_plot( tsl = tsl[1:10], columns = 2, guide = FALSE )
The map below displays the polygons in distantia::covid_counties
using mapview()
.
mapview::mapview( distantia::covid_counties, label = "name" )
The code in this section prepares several datasets to use in the different mapping examples:
df_psi
: a lock-step dissimilarity data frame with psi scores and p-values from restricted permutation tests.df_stats
: dissimilarity stats of each time series against all others.df_cluster
: a hierarchical clustering based on the dissimilarity scores in df_psi
.The lock-step dissimilarity analysis shown below includes p-values from a restricted permutation test. These p-values will be useful as criteria to select relevant mapping features.
#parallelization setup future::plan( future::multisession, workers = parallelly::availableCores() - 1 ) #lock-step dissimilarity analysis df_psi <- distantia::distantia( tsl = tsl, distance = "euclidean", lock_step = TRUE, repetitions = 1000, permutation = "restricted", block_size = 12 #weeks ) #disable parallelization future::plan( future::sequential ) #check resulting data frame df_psi |> dplyr::select(x, y, psi, p_value) |> dplyr::glimpse()
The code below aggregates psi scores in the data frame df_psi
to summarize the overall dissimilarity of each time series with all others.
df_stats <- distantia::distantia_stats( df = df_psi ) df_stats |> dplyr::select( name, mean ) |> dplyr::glimpse()
The function distantia_cluster_hclust()
runs a hierarchical clustering using the psi scores in df_psi
as clustering criteria.
df_cluster <- distantia::distantia_cluster_hclust( df = df_psi )$df df_cluster |> dplyr::select(name, cluster) |> dplyr::glimpse()
The function distantia_spatial()
transforms the result of distantia()
to an sf data frame with edges connecting time series coordinates or polygons. The result can be interpreted as a dissimilarity network.
sf_network <- distantia::distantia_spatial( df = df_psi, sf = distantia::covid_counties |> dplyr::select( name, geometry ), network = TRUE ) dplyr::glimpse(sf_network)
The resulting sf data frame has a field with the edge name, the columns x
and y
with the names of the connected time series, the psi scores and p-values of the dissimilarity data frame, a geometry column of type LINESTRING defining the network edges, and the length
of the edges.
This sf data frame can be mapped right away, but in this case there are too many pairs of counties to achieve a meaningful map.
mapview::mapview( sf_network, layer.name = "Psi", label = "edge_name", zcol = "psi", color = distantia::color_continuous() )
Focusing on particular aspects of the data at hand may help untangle this mess. For example, the code below subsets edges connecting with San Francisco and its most similar counties in terms of Covid19 prevalence.
#select pairs of counties counties <- c( "Los_Angeles", "San_Francisco", "Fresno", "San_Joaquin", "Monterey" ) sf_network_subset <- sf_network[ which( sf_network$x %in% counties & sf_network$y %in% counties ), ] #map country polygons and dissimilarity edges mapview::mapview( covid_counties, col.regions = NA, alpha.regions = 0, color = "black", label = "name", legend = FALSE, map.type = "OpenStreetMap" ) + mapview::mapview( sf_network_subset, layer.name = "Psi", label = "edge_name", zcol = "psi", lwd = 5, color = distantia::color_continuous( rev = TRUE ) )
The function distantia_spatial()
also has a one-to-many mode designed to help map the dissimilarity of one time series against all others.
sf_network <- distantia::distantia_spatial( df = df_psi, sf = distantia::covid_counties |> dplyr::select( name, geometry ), network = FALSE ) dplyr::glimpse(sf_network)
This option does not create edges, and instead uses the geometry introduced in the sf
argument. This sf data frame is mapped as follows: when one site in the "x" column is selected, the result shows the dissimilarity scores of all other sites against the selected one.
#subset one county sf_network_subset <- sf_network[ sf_network$x == "San_Francisco", ] #one-to-many visualization mapview::mapview( sf_network_subset, layer.name = "Psi", label = "y", zcol = "psi", col.regions = distantia::color_continuous( rev = TRUE ) )
Mapping the dissimilarity stats of each time series may help identify places that are somewhat special because they show a high dissimilarity with all others, or places that are average and have no distinctive features.
Merging the dissimilarity stats with the sf data frame containing the county polygons generates the spatial data required for the map.
sf_stats <- merge( x = distantia::covid_counties, y = df_stats ) |> dplyr::select( mean, name )
The map below uses warm colors to highlight places with a high dissimilarity with all others, such as Humboldt county in the north west of the state.
mapview::mapview( sf_stats, layer.name = "Psi mean", zcol = "mean", label = "name", col.regions = distantia::color_continuous( rev = TRUE ), alpha.regions = 1 )
This section shows how to map the similarity groups resulting from distantia::distantia_cluster_hclust()
or distantia::distantia_cluster_kmeans()
.
The first block in the code below joins the county polygons in distantia::covid_polygons
with the clustering data frame df_cluster
. This block also generates the variable "alpha" from the column "silhouette_width", which represents the strength of membership to the assigned clustering group.
The second block generates the map, using colors for cluster membership, and the variable alpha to code the strenght of membership. This setup allows identifying groups of similar counties while highlighting counties that somehow do not fully belong to their given group. This is the case of Butte, which is in the group 5, but with a very low silhouette score.
#join county polygons with clustering groups sf_cluster <- distantia::covid_counties |> dplyr::select(name, geometry) |> dplyr::inner_join( y = df_cluster, by = "name" ) |> #remap silhouette score to transparency values dplyr::mutate( alpha = f_rescale_local( x = silhouette_width, new_min = 0.1 ) ) mapview::mapview( sf_cluster, layer.name = "Group", zcol = "cluster", label = "name", col.regions = distantia::color_continuous( n = max(sf_cluster$cluster) ), alpha.regions = sf_cluster$alpha )
momentum()
The momentum
functions quantify the contribution to dissimilarity of individual variables in multivariate time series. The results of momentum
functions are slightly harder to map due to a high cardinality (number of time series times number of variables), but distantia
provides some tools to face this challenge.
This example uses the data eemian_pollen
and eemian_coordinates
, which comprises nine irregular time series of pollen counts in Central Uurope from the Last Interglacial (Eemian period). This data requires a Hellinger transformation to facilitate dissimilarity analyses.
tsl <- distantia::tsl_initialize( x = distantia::eemian_pollen, name_column = "name", time_column = "time" ) |> distantia::tsl_transform( f = distantia::f_hellinger ) distantia::tsl_plot( tsl = tsl, guide_columns = 4 )
The function momentum_dtw()
computes variable importance using a jackknife approach. The column "importance" represents contribution to similarity and dissimilarity with negative and positive values, respectively.
df_importance <- momentum_dtw( tsl = tsl ) head(df_importance, n = 20)
The function momentum_spatial()
, just like distantia_spatial()
, transforms the results of momentum
functions to an sf data frame with edges.
sf_network <- distantia::momentum_spatial( df = df_importance, sf = distantia::eemian_coordinates |> dplyr::select( name, geometry ), network = TRUE ) dplyr::glimpse(sf_network)
Notice that it separates the importance of each variable to a separate column, and also has two character columns named "most_similarity" and "most_dissimilarity" with the names of the variables with the highest and lowest importance scores.
The map below shows the pairs of most similar time series connected with edges coded after the name of the variable that contributes the most to their similarity.
mapview::mapview( distantia::eemian_coordinates, layer.name = "Eemian sites - m.a.s.l", label = "name", zcol = "elevation", col.regions = distantia::color_continuous() ) + mapview::mapview( sf_network |> dplyr::filter( psi < mean(psi) ), layer.name = "Contributes to Similarity", label = "most_similarity", zcol = "most_similarity", color = distantia::color_discrete( n = length(unique(sf_network$most_similarity)) ) )
Filtering the input even further allows to focus on specific features of the dataset at hand. For example, the map below highlights pairs of time series with high similarity driven mostly by the taxa "Frangula".
mapview::mapview( distantia::eemian_coordinates, layer.name = "Eemian sites - m.a.s.l", label = "name", zcol = "elevation", col.regions = distantia::color_continuous() ) + mapview::mapview( sf_network |> dplyr::filter( psi < mean(psi), most_similarity == "Frangula" ), layer.name = "Contributes to Similarity", label = "most_similarity", zcol = "most_similarity", color = distantia::color_discrete( n = 1 ) )
Conversely, we can also map what variable makes these sites more different.
mapview::mapview( distantia::eemian_coordinates, layer.name = "Eemian sites - m.a.s.l", label = "name", zcol = "elevation", col.regions = distantia::color_continuous() ) + mapview::mapview( sf_network |> dplyr::filter( psi < mean(psi), most_similarity == "Frangula" ), layer.name = "Contributes to Dissimilarity", label = "most_dissimilarity", zcol = "most_dissimilarity", color = distantia::color_discrete( n = length(unique(sf_network$most_dissimilarity)) ) )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.