knitr::opts_chunk$set(eval = FALSE,
                      echo = TRUE
                      )

The code, written in R, applies a spatial clustering process to a spatial point data set.

The percolate() function processes this data set to generate the clusters for a range of percolation radius values. It also generates and stores a set of data tables for use by the other functions, as well as optional external processing, as described below.

mapClusters() prints maps of the generated clusters, overlaid on a given shape file.

plotClustFreq() plots analysis results for the clusters showing the key characteristics of the clusters generated for a given point data set.

A working directory path needs to be defined before running the functions. The functions will create the following sub-directories off this path:

We suggest that you also create /source_data and /shape_files directories.

Graphical outputs are generated as png files with an A4 format.

Percolate

percolate(data, distance_table = NULL, limit, radius_unit, upper_radius, lower_radius, step_value)

The data needs to be provided as a .csv file in the format below:

PlcIndex,Easting,Northing

The PlcIndex field is a provided identity for each point, which is specified by the coordinates. The resolution of the coordinates is assumed to be in metres.

In addition to the spatial data, parameters are passed to the percolate function as below:

distance_table is normally NULL. This file is provided if you compute your own inter-point distances, for example to be weighted. See later for more details of this option.

upper_radius is the upper value of the percolation radius range to be used, scale is m x radius unit

lower_radius radius is the lower value of the radius range to be used, scale is m x radius unit

step_value is the step value to be used between these two values, scale is m x radius unit (NOTE: this value can be decimal, e.g. 0.2)

limit is the value above which distances between sites will not be stored, scale is m x radius unit. This is typically above the value at which all points are within a single cluster. This is used to avoid creating unnecessarily large files with redundant data.

radius_unit is the scale for computing the radius values. The assumption is that coordinates are in metres. A value of 1000 computes the radius in km, a value of 1 in metres etc.

The first function to be called, with the above example parameters, is:

“percolate(data, , 40, 2, 1, 50, 1000)"

The percolate program first computes an inter-point distance matrix. Given the dataset of points with XY coordinates, it computes a partial matrix of inter-point Euclidean distances using Pythagoras’ theorem. The limit parameter sets the maximum value stored, in this case to 50km. For example, with a spatial dataset covering the whole of Britain, there is little point in storing the distance between points at the extremes of the country. This reduces the data file sizes and speeds up data file handling. The distance for each pair of points is computed and stored only once.

Duplicate and null entries are identified, and if present these data points are stored in space delimited text files, ‘duplicate_entries.txt’ and ‘null_entries.txt’.

The inter-point distance table comprises a data frame consisting of: first point identity; second point identity; the distance between them, rounded to two decimal places of the unit value. This is stored as a space delimited text file called ‘nodes_list_d.txt’. The format is space separated with columns: ID1 ID2 d12 – the identifiers for point 1, point 2 and the distance between them.

This data is then used to generate the clusters. However, if so desired, a distance matrix may be provided to the program. So, for example, weighted values may be used rather than the Euclidean distance between points, or the computed Euclidean distances may be modified in some way (e.g. applying a factor or non-linear scaling). The distances should be rounded to a maximum of 2 decimal places. In this case, the program will skip the computation, and use the given data file, passed as a parameter, as for example:

“percolate(data, distance_table, 40, 2, 1, , 1000)”

Note that the various other parameters are still needed for the cluster computation, with the exception of the limit value, which is irrelevant.

The program now generates the clusters for each radius value required, starting at the largest radius value. The radius values to be used are computed from the upper and lower radius values, decremented by the given step value. In this example, the values start at 40 and decrement to 2, with a step value of 1; all are in km. This data is stored for use by the mapping function, in the working directory. For each radius value, data is extracted from the distance table, including only those point-pairs where the inter-point distance <= the current radius value, creating a sub-matrix of the original point data set. This sub-matrix is then considered as an edge list for a graph.

The following R functions from library ‘igraph’ are then applied to this sub-matrix as follows:

graph.edgelist() – creates a graph from the sub-matrix, with directed=TRUE to count the edges only once.

clusters() – identifies the clusters in the graph and generates a unique identity for each. This identity is then bound to each point (i.e. graph vertex) using the V() function.

A table is created, indexed by the given point identity (PlcIndex), with a column for each computed radius value containing the cluster identity for that radius value (assuming it is a member of a cluster, else it is NULL). As the computation progresses, columns are added to this table. This table is stored as “member_cluster_by_radius.csv”.

This table is then processed to generate analytics, so that for each radius value, the number of clusters, the maximum, mean and median number of points per cluster are all computed and stored in a file called “analysis_by_radius.csv”.

These tables are stored in the working data directory, for use by the other routines as well as for any desired additional processing that the user may undertake.

Mapping Clusters

If required the clusters can be mapped using the mapClusters() function.

This function will map the clusters created by percolation. It needs the following input:

mapClusters(shape, map_name, source_file_name, dpi=300)

shape – an imported shape file, read using the readOGR() function from the rgdal library, for example “Namibia.shp” (SpatialPolygonDataFrame), which will be used for plotting as a background map, and determining the coordinate system.

map_name – a string to be printed at the head of the maps, e.g. “Archaeological sites in Namibia”

source_file_name – a string to be printed at the bottom of the map as a reference to the source data used to generate the cluster maps, e.g. “point_data_Namibia.v4”

dpi - an integer, the resolution for the maps, defaults to 300. Higher resolution may be required for publications, for example, or lower values for exploratory plots.

The call for this example would be: mapClusters(shape, “Archaeological sites in Namibia”, “point_data_Namibia.v4”, 400)

The radius values are taken from the file generated by the percolate() function, to ensure consistency.

The map projection is extracted from the shape file and applied to the maps generated. This means the input point data needs to be in the same coordinate and projection system as the background map. A base map is first generated with all the points overlaying the shape file, as a basic check. Then, for each radius value a cluster map is generated; all points are plotted as a small cross in pale grey, and then overlaid with the points as dots coloured according to the rank of their cluster. The largest 15 clusters are coloured, red for the largest, blue for the next and so on. For clusters outside this grouping a mid-grey is used. Noise, that is points that are not within a cluster, are not overlaid and remain as a small grey cross. All the maps are stored in the /map directory. png is used as the map format, being highly portable and does not apply compression. Code for generating multi-page pdf files as an alternative is included in the function but is commented out.

Cluster Analysis

The final step, if required, is the plot the statistics for the data.

plotClutsFreq(source_file_name)

For example: plotClustFreq(“point_data_Namibia.v4”)

All of the data required is taken from the file “analysis_by_radius.csv” in the /working_data directory. The source file name is passed as a parameter for inclusion in the plots as a subtitle. These plots, are generated as png files and stored in the /analysis_results directory.

Three png-files are generated:

a) radius to maximum cluster size, b) radius to mean cluster size and c) radius to normalized max. cluster size

The three functions are provided separately, to more quickly identify issues with data, maps and formats, without having to run the entire suite each time.

Example for the whole process with example data set:

The example point coordinate data (named "hillforts") is taken from the:

Atlas of Hillforts of Britain and Ireland

Lock, G. and Ralston, I. 2017. Atlas of Hillforts of Britain and Ireland. [ONLINE] Available at: https://hillforts.arch.ox.ac.uk

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

The point coordinate data was extracted from the database for hillforts in Britain (England, Scotland, Wales and the Isle of Man) which have status of interpretation as "confirmed". This was extracted on: 25/07/2017

The example shape file (named "GB_shape") is an outline of Britain, taken from:

Ordnance Survey OS Open Data, uner a UK Open Government Licence: https://www.ordnancesurvey.co.uk/opendatadownload/products.html#OVERGB

This file Contains OS data © Crown Copyright (2020)

Included is as well a distance table, which has been derived by calculating the square root of the distance between the hillfort point data and adding it to the original value (called “hillfort_distances”). It will be used to show how to implement percolation analysis with a given distance table.

The three datasets will be loaded by

data(data)

It is recommended you set a working directory and a string naming your source file, which will be used to label maps and plots

#setwd()

# the source file name is a text string for inclusion in maps and plots
source_file_name <- "example_coords.csv"

For the percolate function several parameters as indicated above should be set. For clarity they are defined here beforehand.

# set up radius parameters
# edit the values to suit. All value units as defined by the radius unit
# In this case the radius unit is 1000 == km

limit <- 50
# no values larger than limit will be stored in the distance matrix
radius_unit <- 1000
# the base unit is metres, so a value of 1000 defines the unit as km, for example
upper_radius <- 40
# upper or maximum radius value used
lower_radius <- 2
# lower or minimum value radius unit used
step_value <- 1
# the step value for working between upper and lower radius values. 
# This may be a decimal value or an integer

With these parameters defined, call the percolate function. Note that the second parameter defaults to "points". Any other value will interpret the data as a distance table (see above for details). This function call will use the point data set to compute a distance table, then generate a cluster table for the radius and step values given.

percolate(hillforts,,limit,radius_unit,upper_radius,lower_radius,step_value)

Use this call INSTEAD if you pass a distance table:

percolate(hillforts,hillfort_distances,limit,radius_unit,upper_radius,lower_radius,step_value)

Now for mapping the clusters.

At first the parameters need to be set. For plotting the maps with an outline or background, a shape file is needed. In the data set, it is included with the name "GB_shape". If you wish to use another one, follow the steps detailed here. Note: the shape file needs to have the same coordinate system as for the point data! The coordinate system is extracted from the given shape file.

# Edit for the required shape file for plot background/outline.
# We recommend creating a sub-directory for shape files
shape_file_path <- "example\data.shp"
shape_file_name <- "example.shp"

# Load shape file
# Truncate shape file name to remove extension
layer_name <- substr(shape_file_name,1,(nchar(shape_file_name)-4))
file_name <- paste(shape_file_path,"/",shape_file_name,sep="")
shape <- readOGR(dsn=file_name,layer=layer_name)

Now set the other parameters needed:

# map name for plot title
map_name <- "Example archaeological data percolation plot"
# Set the resolution for the output files. 300 is usually 
#  adequate but higher may be needed for publications
dpi <- 300

Now all parameters are set, call the mapping function. This will plot a map of all points, followed by a cluster plot for each stepped value of the percolation radius. These will be put into a subdirectory "maps" off the root directory.

# Here source_file_name is used as indicated above
mapClusters(GB_shape,map_name,source_file_name,dpi)

Now you can plot the cluster frequency. This function creates the cluster frequency plots and places them in the cluster_analysis sub directory. The name parameter is used to title the plots.

# source_file_name is used to title the plots
plotClustFreq(source_file_name)

CONTACT DETAILS:

Simon Maddison simon.maddison@btinternet.com

Sophie C. Schmidt s.c.schmidt@uni-koeln.de



SCSchmidt/percopackage documentation built on May 2, 2020, 12:14 p.m.