source("code/before_script.R")
This chapter has minimal software prerequisites, as it primarily uses base R. Only, the sf\index{sf} package is used to check the results of an algorithm we will develop to calculate the area of polygons. In terms of prior knowledge, this chapter assumes you have an understanding of the geographic classes introduced in Chapter \@ref(spatial-class) and how they can be used to represent a wide range of input file formats (see Chapter \@ref(read-write)).
Chapter \@ref(intro) established that geocomputation is not only about using existing tools, but developing new ones, "in the form of shareable R scripts and functions". This chapter teaches these building blocks of reproducible code. It also introduces low-level geometric algorithms, of the type used in Chapter \@ref(gis). Reading it should help you to understand how such algorithms work and to write code that can be used many times, by many people, on multiple datasets. The chapter cannot, by itself, make you a skilled programmer. Programming is hard and requires plenty of practice [@abelson_structure_1996]:
To appreciate programming as an intellectual activity in its own right you must turn to computer programming; you must read and write computer programs --- many of them.
There are strong reasons for learning to program. Although this chapter does not teach programming itself --- see resources such as @wickham_advanced_2019, @gillespie_efficient_2016, and @xiao_gis_2016 which teach programming in R and other languages --- it does provide some starting points, focused on geometry data, that could form a good foundation for developing programming skills.
The chapter also demonstrates and highlights the importance of reproducibility\index{reproducibility}. The advantages of reproducibility go beyond allowing others to replicate your work: reproducible code is often better in every way than code written to be run only once, including in terms of computational efficiency, 'scalability' (the capability of code to run on large datasets) and ease of adapting and maintaining it.
Scripts are the basis of reproducible R code, a topic covered in Section \@ref(scripts).
Algorithms are recipes for modifying inputs using a series of steps, resulting in an output, as described in Section \@ref(geometric-algorithms).
To ease sharing and reproducibility, algorithms can be placed into functions.
That is the topic of Section \@ref(functions).
The example of finding the centroid\index{centroid} of a polygon will be used to tie these concepts together.
Chapter \@ref(geometry-operations) already introduced a centroid\index{centroid} function st_centroid()
, but this example highlights how seemingly simple operations are the result of comparatively complex code, affirming the following observation [@wise_gis_2001]:
One of the most intriguing things about spatial data problems is that things which appear to be trivially easy to a human being can be surprisingly difficult on a computer.
The example also reflects a secondary aim of the chapter which, following @xiao_gis_2016, is "not to duplicate what is available out there, but to show how things out there work".
If functions distributed in packages are the building blocks of R code, scripts are the glue that holds them together.
Scripts should be stored and executed in a logical order to create reproducible workflows\index{reproducibility}, manually or with workflow automation tools such as targets [@landau_targets_2021].
If you are new to programming, scripts may seem intimidating when you first encounter them, but they are simply plain text files.
Scripts are usually saved as a file with an extension representing the language they contain, such as .py
for scripts written in Python or .rs
for scripts written in Rust.
R scripts should be saved with a .R
extension and named to reflect what they do.
An example is 11-hello.R
, a script file stored in the code
folder of the book's repository.
11-hello.R
is a simple script containing only two lines of code, one of which is a comment:
# Aim: provide a minimal R script print("Hello geocompr")
The contents of this script are not particularly exciting, but they demonstrate the point: scripts do not need to be complicated.
Saved scripts can be called and executed in their entirety from the R command line with the source()
function, as demonstrated below.
The output of this command shows that the comment is ignored but print()
command is executed:
source("code/11-hello.R")
You can also call R scripts from system shells such as bash
and PowerShell
as follows:
Rscript code/11-hello.R
If your RScript
executable is configured so it is available, the above command will print Hello geocompr
in the system shell.
There are no strict rules on what can and cannot go into script files and nothing to prevent you from saving broken, non-reproducible code, so testing is important.
Lines of code that do not contain valid R should be commented out, by adding a #
to the start of the line, to prevent errors, as shown in line 1 of the 11-hello.R
script.
It is worth following some basic rules:
Ctrl+Shift+R
, which creates 'foldable' code section headingsif (!exists("x")) source("x.R")
(which would run the script file x.R
if the object x
is missing).
]Unless you organise your code into a package, it is hard to enforce reproducibility in R scripts, but there are tools that can help. By default, RStudio \index{RStudio} 'code checks' R scripts and underlines faulty code with a red wavy line, as shown in Figure \@ref(fig:codecheck). The repex package is another tool that can help with reproducibility.
knitr::include_graphics("images/codecheck.png")
``{block2 spellcheck, type='rmdnote'}
A useful tool for reproducibility is the **reprex** package.
Its main function
reprex()` tests lines of R code to check if they are reproducible, and it generates markdown output to facilitate communication on sites such as GitHub.
See the webpage reprex.tidyverse.org for details.
\index{reproducibility} The contents of this section apply to any type of R script. A particular consideration with scripts for geocomputation is that they tend to have external dependencies, such as the GDAL dependency needed for core R packages for working with geographic data, which we made heavy use of in Chapter \@ref(read-write) for data import and export. GIS software dependencies may be needed to run more specialist geoalgorithms, as outlined in Chapter \@ref(gis). Scripts for working with geographic data also often require input datasets to be available in specific formats. Such dependencies should be mentioned as comments or suitable place in the project of which it is a part, or described as dependencies with tools such as the **renv** package or Docker. 'Defensive' programming techniques and good error messages can save time by checking for dependencies and communicating with users if certain requirements are not met. If statements, implemented with `if ()` in R, can be used to send messages or run lines of code if, and only if, certain conditions are met. The following lines of code, for example, send a message to users if a certain file is missing: ```r if (!file.exists("required_geo_data.gpkg")) { message("No file, required_geo_data.gpkg is missing!") }
The work undertaken by the 11-centroid-alg.R
script is demonstrated in the reproducible example below, which creates a prerequisite object named poly_mat
, representing a square with sides 9 units in length.
This example shows that source()
works with URLs, assuming you have an internet connection.
If you do not, the same script can be called with source("code/11-centroid-alg.R")
, assuming that you have previously downloaded the github.com/geocompx/geocompr repository and that you are running R from the geocompr
folder.
poly_mat = cbind( x = c(0, 9, 9, 0, 0), y = c(0, 0, 9, 9, 0) ) # Short URL to code/11-centroid-alg.R in the geocompr repo source("https://t.ly/0nzj")
poly_mat = cbind( x = c(0, 9, 9, 0, 0), y = c(0, 0, 9, 9, 0) ) if (curl::has_internet()) { source("https://raw.githubusercontent.com/geocompx/geocompr/main/code/11-centroid-alg.R") } else { source("code/11-centroid-setup.R") }
Algorithms\index{algorithm} can be understood as the computing equivalent of a baking recipe. They are a complete set of instructions which, when undertaken on the inputs result in useful/tasty outcomes. Inputs are ingredients such as flour and sugar in the case of baking, data and input parameters in the case of algorithms. And while tasty cakes may result from a baking recipe, successful algorithms should have computational outcomes with environmental/social/other benefits. Before diving into a reproducible example, the brief history below shows how algorithms relate to scripts (covered in Section \@ref(scripts)) and functions (which can be used to generalize algorithms and make them more portable and easy-to-use, as we'll see in Section \@ref(functions)).
The word "algorithm"\index{algorithm} originated with the publication of an early math textbook in Baghdad in the year 825. The book was translated into Latin and became so popular that the author's last name, al-Khwārizmī, "was immortalized as a scientific term: Al-Khwarizmi became Alchoarismi, Algorismi and, eventually, algorithm" [@bellos_alex_2011]. In the computing age, algorithm\index{algorithm} refers to a series of steps that solves a problem, resulting in a predefined output. Inputs must be formally defined in a suitable data structure [@wise_gis_2001]. Algorithms often start as flow charts or pseudocode\index{pseudocode} showing the aim of the process before being implemented in code. To ease usability, common algorithms are often packaged inside functions, which may hide some or all of the steps taken (unless you look at the function's source code, see Section \@ref(functions)).
Geoalgorithms\index{geoalgorithm}, such as those we encountered in Chapter \@ref(gis), are algorithms that take geographic data in and, generally, return geographic results (alternative terms for the same thing include GIS algorithms and geometric algorithms). That may sound simple, but it is a deep subject with an entire academic field, Computational Geometry, dedicated to their study [@berg_computational_2008] and numerous books on the subject. @orourke_computational_1998, for example, introduces the subject with a range of progressively harder geometric algorithms using reproducible and freely available C code.
An example of a geometric algorithm is one that finds the centroid\index{centroid} of a polygon. There are many approaches to centroid\index{centroid} calculation, some of which work only on specific types of spatial data. For the purposes of this section, we choose an approach that is easy to visualize: breaking the polygon into many triangles and finding the centroid\index{centroid} of each of these, an approach discussed by @kaiser_algorithms_1993 alongside other centroid algorithms [and mentioned briefly in @orourke_computational_1998]. It helps to further break down this approach into discrete tasks before writing any code (subsequently referred to as step 1 to step 4, these could also be presented as a schematic diagram or pseudocode\index{pseudocode}):
These steps may sound straightforward, but converting words into working code requires some work and plenty of trial-and-error, even when the inputs are constrained: The algorithm will only work for convex polygons, which contain no internal angles greater than 180°, no star shapes allowed (packages decido and sfdct can triangulate non-convex polygons using external libraries, as shown in the algorithm vignette hosted at geocompx.org).
The simplest data structure representing a polygon is a matrix of x and y coordinates in which each row represents a vertex tracing the polygon's border in order where the first and last rows are identical [@wise_gis_2001]. In this case, we will create a polygon with five vertices in base R, building on an example from GIS Algorithms [@xiao_gis_2016 see github.com/gisalgs for Python code], as illustrated in Figure \@ref(fig:polymat):
# show where the data came from: source("code/11-centroid-setup.R")
# generate a simple matrix representation of a polygon: x_coords = c(10, 20, 12, 0, 0, 10) y_coords = c(0, 15, 20, 10, 0, 0) poly_mat = cbind(x_coords, y_coords)
Now that we have an example dataset, we are ready to undertake step 1 outlined above.
The code below shows how this can be done by creating a single triangle (T1
), that demonstrates the method; it also demonstrates step 2 by calculating its centroid\index{centroid} based on the formula $1/3(a + b + c)$ where $a$ to $c$ are coordinates representing the triangle's vertices:
# create a point representing the origin: Origin = poly_mat[1, ] # create 'triangle matrix': T1 = rbind(Origin, poly_mat[2:3, ], Origin) C1 = (T1[1,] + T1[2,] + T1[3,]) / 3
# (Note: drop = FALSE preserves classes, resulting in a matrix): C1_alternative = (T1[1, , drop = FALSE] + T1[2, , drop = FALSE] + T1[3, , drop = FALSE]) / 3
# initial plot: can probably delete this: old_par = par(pty = "s") plot(poly_mat, cex = 3) lines(poly_mat, lwd = 7) lines(T1, col = "#fa8072", lwd = 2) text(x = C1[1], y = C1[2], "C1", col = "#fa8072") par(old_par)
Step 3 is to find the area of each triangle, so a weighted mean accounting for the disproportionate impact of large triangles is accounted for. The formula to calculate the area of a triangle is as follows [@kaiser_algorithms_1993]:
$$ \frac{A_x ( B_y − C_y ) + B_x ( C_y − A_y ) + C_x ( A_y − B_y )}{ 2 } $$
Where $A$ to $C$ are the triangle's three points and $x$ and $y$ refer to the x and y dimensions.
A translation of this formula into R code that works with the data in the matrix representation of a triangle T1
is as follows (the function abs()
ensures a positive result):
# calculate the area of the triangle represented by matrix T1: abs(T1[1, 1] * (T1[2, 2] - T1[3, 2]) + T1[2, 1] * (T1[3, 2] - T1[1, 2]) + T1[3, 1] * (T1[1, 2] - T1[2, 2])) / 2
This code chunk outputs the correct result.^[ The result can be verified with the following formula (which assumes a horizontal base): area is half of the base width times height, $A = B * H / 2$. In this case $10 * 10 / 2 = 50$. ] The problem is that code is clunky and must by retyped if we want to run it on another triangle matrix. To make the code more generalizable, we will see how it can be converted into a function in Section \@ref(functions).
Step 4 requires steps 2 and 3 to be undertaken not just on one triangle (as demonstrated above) but on all triangles.
This requires iteration to create all triangles representing the polygon, as illustrated in Figure \@ref(fig:polycent).
lapply()
\index{loop!lapply} and vapply()
\index{loop!vapply} are used to iterate over each triangle here because they provide a concise solution in base R:^[
See ?lapply
for documentation and Chapter \@ref(location) for more on iteration.
]
i = 2:(nrow(poly_mat) - 2) T_all = lapply(i, function(x) { rbind(Origin, poly_mat[x:(x + 1), ], Origin) }) C_list = lapply(T_all, function(x) (x[1, ] + x[2, ] + x[3, ]) / 3) C = do.call(rbind, C_list) A = vapply(T_all, function(x) { abs(x[1, 1] * (x[2, 2] - x[3, 2]) + x[2, 1] * (x[3, 2] - x[1, 2]) + x[3, 1] * (x[1, 2] - x[2, 2]) ) / 2 }, FUN.VALUE = double(1))
# idea: show animated version on web version source("code/11-polycent.R")
We are now in a position to complete step 4 to calculate the total area with sum(A)
and the centroid\index{centroid} coordinates of the polygon with weighted.mean(C[, 1], A)
and weighted.mean(C[, 2], A)
(exercise for alert readers: verify these commands work).
To demonstrate the link between algorithms\index{algorithm} and scripts, the contents of this section have been condensed into 11-centroid-alg.R
.
We saw at the end of Section \@ref(scripts) how this script can calculate the centroid\index{centroid} of a square.
The great thing about scripting the algorithm is that it works on the new poly_mat
object (see exercises below to verify these results with reference to st_centroid()
):
source("code/11-centroid-alg.R")
The example above shows that low-level geographic operations can be developed from first principles with base R.
It also shows that if a tried-and-tested solution already exists, it may not be worth reinventing the wheel:
if we aimed only to find the centroid\index{centroid} of a polygon, it would have been quicker to represent poly_mat
as an sf object and use the preexisting sf::st_centroid()
function instead.
However, the great benefit of writing algorithms from first principles is that you will understand every step of the process, something that cannot be guaranteed when using other peoples' code.
A further consideration is performance: R may be slow compared with low-level languages such as C++\index{C++} for number crunching (see Section \@ref(software-for-geocomputation)) and optimization is difficult.
If the aim is to develop new methods, computational efficiency should not be prioritized.
This is captured in the saying "premature optimization is the root of all evil (or at least most of it) in programming" [@knuth_computer_1974].
Algorithm\index{algorithm} development is hard.
This should be apparent from the amount of work that has gone into developing a centroid\index{centroid} algorithm\index{algorithm} in base R\index{R} that is just one, rather inefficient, approach to the problem with limited real-world applications (convex polygons are uncommon in practice).
The experience should lead to an appreciation of low-level geographic libraries such as GEOS\index{GEOS} and CGAL\index{CGAL} (Computational Geometry Algorithms Library) which not only run fast but work on a wide range of input geometry types.
A great advantage of the open source nature of such libraries is that their source code\index{source code} is readily available for study, comprehension and (for those with the skills and confidence) modification.^[
The CGAL\index{CGAL} function CGAL::centroid()
is in fact composed of seven sub-functions as described at https://doc.cgal.org/latest/Kernel_23/groupcentroidgrp.html allowing it to work on a wide range of input data types, whereas the solution we created works only on a very specific input data type.
The source code underlying GEOS\index{GEOS} function Centroid::getCentroid()
can be found at https://github.com/libgeos/geos/search?q=getCentroid.
]
Like algorithms\index{algorithm}, functions take an input and return an output. Functions\index{function}, however, refer to the implementation in a particular programming language, rather than the 'recipe' itself. In R, functions\index{function} are objects in their own right, that can be created and joined together in a modular fashion. We can, for example, create a function that undertakes step 2 of our centroid\index{centroid} generation algorithm\index{algorithm} as follows:
t_centroid = function(x) { (x[1, ] + x[2, ] + x[3, ]) / 3 }
The above example demonstrates two key components of functions: (1) the function body, the code inside the curly brackets that define what the function does with the inputs; and (2) the arguments, the list of arguments the function works with --- x
in this case (the third key component, the environment, is beyond the scope of this section).
By default, functions return the last object that has been calculated (the coordinates of the centroid\index{centroid} in the case of t_centroid()
).^[
You can also explicitly set the output of a function by adding return(output)
into the body of the function, where output
is the result to be returned.
]
body(t_centroid) formals(t_centroid) environment(t_centroid)
The function now works on any inputs you pass it, as illustrated in the below command which calculates the area of the first triangle from the example polygon in the previous section (see Figure \@ref(fig:polycent)).
t_centroid(T1)
We can also create a function\index{function} to calculate a triangle's area, which we will name t_area()
:
t_area = function(x) { abs( x[1, 1] * (x[2, 2] - x[3, 2]) + x[2, 1] * (x[3, 2] - x[1, 2]) + x[3, 1] * (x[1, 2] - x[2, 2]) ) / 2 }
Note that after the function's creation, a triangle's area can be calculated in a single line of code, avoiding duplication of verbose code:
functions are a mechanism for generalizing code.
The newly created function\index{function} t_area()
takes any object x
, assumed to have the same dimensions as the 'triangle matrix' data structure we've been using, and returns its area, as illustrated on T1
as follows:
t_area(T1)
We can test the generalizability of the function\index{function} by using it to find the area of a new triangle matrix, which has a height of 1 and a base of 3:
t_new = cbind(x = c(0, 3, 3, 0), y = c(0, 0, 1, 0)) t_area(t_new)
A useful feature of functions is that they are modular.
Provided that you know what the output will be, one function can be used as the building block of another.
Thus, the functions t_centroid()
and t_area()
can be used as sub-components of a larger function\index{function} to do the work of the script 11-centroid-alg.R
: calculate the area of any convex polygon.
The code chunk below creates the function poly_centroid()
to mimic the behavior of sf::st_centroid()
for convex polygons.^[
Note that the functions we created are called iteratively in lapply()
\index{loop!lapply} and vapply()
\index{loop!vapply} function calls.
]
poly_centroid = function(poly_mat) { Origin = poly_mat[1, ] # create a point representing the origin i = 2:(nrow(poly_mat) - 2) T_all = lapply(i, function(x) {rbind(Origin, poly_mat[x:(x + 1), ], Origin)}) C_list = lapply(T_all, t_centroid) C = do.call(rbind, C_list) A = vapply(T_all, t_area, FUN.VALUE = double(1)) c(weighted.mean(C[, 1], A), weighted.mean(C[, 2], A)) }
# a slightly more complex version of the function with output set poly_centroid = function(poly_mat, output = "matrix") { Origin = poly_mat[1, ] # create a point representing the origin i = 2:(nrow(poly_mat) - 2) T_all = T_all = lapply(i, function(x) { rbind(Origin, poly_mat[x:(x + 1), ], Origin) }) C_list = lapply(T_all, t_centroid) C = do.call(rbind, C_list) A = vapply(T_all, t_area, FUN.VALUE = double(1)) centroid_coords = c(weighted.mean(C[, 1], A), weighted.mean(C[, 2], A)) if (output == "matrix") { return(centroid_coords) } else if (output == "area") return(sum(A)) }
poly_centroid(poly_mat)
Functions\index{function}, such as poly_centroid()
, can further be extended to provide different types of output.
To return the result as an object of class sfg
, for example, a 'wrapper' function can be used to modify the output of poly_centroid()
before returning the result:
poly_centroid_sfg = function(x) { centroid_coords = poly_centroid(x) sf::st_point(centroid_coords) }
We can verify that the output is the same as the output from sf::st_centroid()
as follows:
poly_sfc = sf::st_polygon(list(poly_mat)) identical(poly_centroid_sfg(poly_mat), sf::st_centroid(poly_sfc))
In this chapter we have moved quickly, from scripts to functions via the tricky topic of algorithms\index{algorithm}. Not only have we discussed them in the abstract, but we have also created working examples of each to solve a specific problem:
11-centroid-alg.R
was introduced and demonstrated on a 'polygon matrix'poly_centroid()
in the previous sectionEach of these may seem straightforward. However, skillful programming is complex and involves combining each element --- scripts, algorithms and functions --- into a system, with efficiency and style. The outcome should be robust and user-friendly tools that other people can use. If you are new to programming, as we expect most people reading this book will be, being able to follow and reproduce the results in the preceding sections is a major achievement. Programming takes many hours of dedicated study and practice before you become proficient.
The challenge facing developers aiming to implement new algorithms\index{algorithm} in an efficient way is put in perspective by considering the amount of work that has gone into creating a simple function that is not intended for use in production: in its current state, poly_centroid()
fails on most (non-convex) polygons!
This raises the question: how to generalize the function?
Two options are (1) to find ways to triangulate non-convex polygons (a topic covered in the online Algorithms Extended article hosted at geocompx.github.io/geocompkg/articles/) and (2) to explore other centroid algorithms that do not rely on triangular meshes.
A wider question is: is it worth programming a solution at all when high performance algorithms have already been implemented and packaged in functions such as st_centroid()
?
The reductionist answer in this specific case is 'no'.
In the wider context, and considering the benefits of learning to program, the answer is 'it depends'.
With programming, it's easy to waste hours trying to implement a method, only to find that someone has already done the hard work.
You can understand this chapter as a stepping stone towards geometric algorithm programming wizardry.
However, it can also be seen as a lesson in when to try to program a generalized solution, and when to use existing higher-level solutions.
There will surely be occasions when writing new functions is the best way forward, but there will also be times when using functions that already exist is the best way forward.
"Do not reinvent the wheel" applies as much, if not more, to programming than to other walks of life. A bit of research and thinking at the outset of a project can help decide where programming time is best spent. Three principles can also help maximize use of your effort when writing code, whether it's a simple script or package that is composed of hundreds of functions:
We cannot guarantee that this chapter will instantly enable you to create perfectly formed functions for your work. We are, however, confident that its contents will help you decide when is an appropriate time to try (when no other existing functions solve the problem, when the programming task is within your capabilities and when the benefits of the solution are likely to outweigh the time costs of developing it). By using the principles above, in combination with the practical experience of working through the examples above, you will build your scripting, package-writing and programming skills. First steps towards programming can be slow (the exercises below should not be rushed), but the long-term rewards can be large.
res = knitr::knit_child('_11-ex.Rmd', quiet = TRUE, options = list(include = FALSE, eval = FALSE)) cat(res, sep = '\n')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.