CAWaR - CAWa project R package"
In CAWaR: CAWa Project Tools

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# load packages
library(fieldRS)
library(raster)
library(ggplot2)
library(knitr)
library(kableExtra)
library(CAWaR)

CAWaR - Supporting Sustainable Water Usage Through Open-source Software

CAWaR^[CAWaR was developed at the University of Würzburg, Germany, as part of the project Research Network Water in Central Asia (CAWa) funded by the German Federal Foreign Office. It is part of the German Water Initiative for Central Asia (the so-called “Berlin process”), which the German Federal Foreign Office launched on 1 April 2008 in order to support peaceful, sustainable and mutually beneficial management of transboundary water resources in Central Asia.] was built for scientists and public authorities in Central Asia to guide crop type classification and mapping in support of water use efficiency. In the context of climate change, the risk of water scarcity is a lingering issue within arid regions such as Central Asia. Here, agricultural production is highly dependent on irrigation and sensitive in periods of water scarcity, which can potentially lead to food insecurity. As a result, understanding the spatial and temporal distribution of crops is essential offering and understanding of local water requirements and helping build more efficient water management plans.

Example data

CAWaR is closely related to fieldRS. As a result, it makes use of some of its example data, namely:

ndvi - RasterStack* with 5 time steps of Normalized Difference Vegetation Index (NDVI) images.

fieldData - SpatialPolygonsDataFrame containing ground truth data on crop types.
referenceProfiles - data.frame with NDVI profiles for target crop types.

Most data can be accessed with the `data()` function with the exception of the raster data. This is because raster data is stored in the temporary memory when loaded and it cannot be saved as an R object. Below we can see how to load each dataset into R.

``` {r message=FALSE} ndvi.ts <- brick(system.file("extdata", "ndvi.tif", package="fieldRS")) # NDVI raster time series data(fieldData) # ground truth data data(referenceProfiles) # target crop types NDVI profiles

```r
ndvi.ts <- extend(ndvi.ts, 30)

Can I use my ground-truth data?

To perform a cropland classification we require samples on different crop types that help us describe their temporal and spatial variability. To do so, we often travel to the field to collect ground-truth data using Geographic Positioning Systems (GPS) and visual interpretation. This data is valuable and provides us with precise information that is useful to train a classifier and validate its results. However, this data often comes with certain inconsistencies. When digitizing ground-truth data, errors such as the misspelling of classes and the overlapping of samples can occur and subsequently hinder future processing steps leading to time-consuming re-runs. To aid in the identification of errors, we developed a series of tools:

checkSamples() - Checks for the information provided by a shapefile containing ground-truth data and checks for relevant fields and their formats.
checkLabels() - Checks for misspellings in sample labels and helps correct them in an efficient and timely manner.
geCheck() - Checks for spatial overlaps between polygons.

Let's first start with `checkSamples()`. This function checks if a shapefile of ground-truth data contains relevant information fields. The function searches for the following variables:

sampler - Name of the responsible for the sample collection.
date - Date vector with sample acquisition date in yyyy-mm-dd format.
label - Character vector with class of filed sample.

While knowing the label is a clear requirement, knowing the date and the name of the sampler can also be useful. First, knowing the sampling date can clear doubts when checking samples visually. In areas where intra-annual crop rotation occurs, samples with date information can help us better decipher complex temporal profiles created by multiple growth cycles related to different crops. Second, knowing the name of the sampler can help us clear additional doubts. Often, different actors assume the responsibility for collecting and analyzing the ground-truth data. Consequently, the second may lack knowledge on the sampling sites that can be provided by the sampler. As shown below, calling `checkSamples()` will return a `data.frame` reporting on the existence of the nominated fields and the consistency of their format.

sampleTest <- checkSamples(fieldData)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(sampleTest, 1), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

<p align="justify" style="line-height:200%;">
Once we clear this step, we can check for other common issues. First, let us check for geometric errors. For example, if we conduct multiple field campaigns in the same area, we sometimes will find that we sampled a crop field more than once leading to duplicated samples. In order to address these issues, we can use `geCheck()`. The function will test all the polygons in a shapefile against each other, report on any existing overlaps and display spatial overlaps. Let us test the function.
</p>

``` {r}
sampleTest <- geCheck(fieldData)

<p align="justify" style="line-height:200%;">
The output shows us an empty `data.frame`. This is because `fieldData` has no geometric errors. However, what if two polygons overlap? To test this we will create a shapefile with two overlapping polygons and test `geCheck()` again.
</p>

``` {r}
# build polygons
p1 <- Polygons(list(Polygon(data.frame(x=c(1, 5, 10, 2, 1), y=c(10, 9, 8, 7, 10)))), ID=1)
p2 <- Polygons(list(Polygon(data.frame(x=c(2, 6, 5, 4, 2), y=c(10, 9, 7, 4, 10)))), ID=2)
p <- SpatialPolygons(list(p1, p2))

# check overlap
sampleTest <- geCheck(p)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(sampleTest$overlap.df, 1), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

plot(p) plot(sampleTest$overlap.shp, col="red", add=TRUE)

<p align="justify" style="line-height:200%;">
The function returns a `data.frame` and a `SpatialPolygons` object. The first shows the polygon indices that overlap with each other. The second shows the overlapping area between them - shown in red in the example figure.
</p>
<p align="justify" style="line-height:200%;">
After we checked for the content and geometric errors of the ground-truth data, we need to check the labels of the samples as misspellings can occur when digitizing ground-truth data. However, this task can be tiring and can perpetuate mistakes due to human error. When looking at hundreds (or even thousands) of samples, we often lose track of the existing labels and of our own corrections. To anticipate these issues, we can use `labelCheck()`. Initially, the function will provide the unique labels in a vector of labels. In the case of `fieldData`, this information is contained in the field `crop`. Running the function over this data, we obtain a `data.frame` with an analysis of the distribution of samples per unique label and a plot of the same data made with `ggplot`. This means that the plot is editable and customizable to e.g. a user’s scientific reports. The function provides the unique labels as a single vector. Looking at the output, we can check if there are any misspellings.
</p>

```r
sampleCorrect <- labelCheck(fieldData$crop)
sampleCorrect$labels # unique labels

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(sampleCorrect$label.count, 3), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive") sampleCorrect$label.count.plot

<p align="justify" style="line-height:200%;">
In this case, we do not seem to have any of these issues. However, for the purpose of this analysis, let us assume we wish to classify "wheat" and "not-wheat". This means we need to rename the classes "cotton" and "bare land". To achieve this, we can now provide `labelCheck()` with a corrected set of labels that matches the set of original labels in length. We will assign it to a new field in `fieldData` called `crop_2`. In this case, the output `labels` will report a full set of corrected labels instead of the unique occurrences. Additionally, as we saw before, we will obtain a plot with the count of samples per class.
</p>

```r
sampleCorrect <- labelCheck(fieldData$crop, sampleCorrect$labels, c("wheat", "not-wheat", "not-wheat"))
fieldData@data$crop_2 <- sampleCorrect$labels

<p align="justify" style="line-height:200%;">
The case we reported assumes that we do not know which classes are expected and that we need to check the output manually. However, when we already know which classes to expect, we can set the `auto` argument to `TRUE`. When doing so, the function will use a reference set of labels and compare them against the original ones using the `stringdist()` function of the R package with the same name. This function computes pairwise string distances between the target and reference labels and `labelCheck()` returns the label with the smallest distance.
</p>

</br>

### Extracting temporal profiles
<p align="justify" style="line-height:200%;">
The crop classification algorithm built in the scope of the CAWa project distinguishes crop types based on their unique phenological behaviour. To characterize the growth cycle of a particular crop, polygon-based ground-truth data is quite informative. Since single pixels can be misleading due to e.g. uneven growth patterns within a field, we use all pixels within a polygon and summarize them into a single time-series. However, when dealing with polygons, not all overlapping pixels are useful. Along the borders of a sampled field, we might note - as depicted in the image below - that some pixels are only partially covered and potentially shared by fields of different crops. Therefore, a simple averaging of the pixels might not be sufficient summarize the temporal variable of the crop within the polygon efficiently.
</p>

</br>

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE}
include_graphics("percentCover.jpg")

Example figure showing the change in the percent overlap between a polygon and a reference raster grid.

To address this issue, we need to use three functions. First, we are going to use `poly2sample()` from the `fieldRS` R package. This function loops through each polygon is a `SpatialPolygons` object and returns all overlapping pixels as a `SpatialPointsDataFrame`. Additionally, for each sample, the function will inform on the percent overlap between the corresponding pixel and the reference polygon. Remember to set the `preserve.id` argument to `TRUE` to make sure the polygon ID's are kept and reported.

fieldData2 <- poly2sample(fieldData, ndvi.ts, preserve.id=TRUE)

data(fieldData2)

The `SpatialPointsDataFrame` object can then be provided to `extractTS()`. This function will iterate through each polygon in a shapefile, extract the values of series of `RasterLayers` and derive mean time series for each polygon weighted by the percent cover of each corresponding pixel. If the raster objects are provided as a `RasterStack`, the function will use the `extract()` function of the `raster` package to extract its values. However, if your satellite data has different extents and/or projections (e.g. Landsat) reprojecting and cropping each raster layer can be time-consuming. To avoid this, you can provide a list of pre-read `RasterLayer` objects instead. In this care, `extractTS()` will use the `extract2()` function which iterates through each element in the list and uses the `extract()` function. Internally, the `extract()` function reprojects the reference samples to match the `RasterLayer` when necessary. Once this data is collected, `extractTS()` will estimate a weighted mean profile for each unique identifier. In this process, border pixels will contribute less for the final time series than the center ones minimizing the influence of mixed-pixels. Let's apply this function to `fieldData2`. As environmental predictors, we will use the `ndvi.ts` `RasterBrick` provided through `fieldRS`. The function returns a weighted-mean time-series as well as the original pixel - with coordinates and percent cover values - and a summary of the each, original polygon reporting on the min, max and mean cover and the count of pixels.

fieldDataTS <- extractTS(fieldData2, ndvi.ts, fieldData2$cover, fieldData2$id)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} data(fieldDataTS) kable_styling(kable(head(fieldDataTS$pixel.info, 5), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive") kable_styling(kable(head(fieldDataTS$polygon.info, 5), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive") kable_styling(kable(head(fieldDataTS$weighted.mean, 5), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

</br>

### Building reference profiles: How should each crop type look like?
<p align="justify" style="line-height:200%;">
Before, we learned how to deal with misspellings. Now, we will learn how to deal with mislabeling. This is a frequent issue. When on the field, some crops can look very similar - especially when in early stages of growth - becoming hard to distinguish. Therefore, the sampler can report on the wrong label. When using mislabeled samples for classification, we might face poor validation results and it can be hard - if not impossible - to improve the classification results as long as such mistakes remain uncorrected. Thus, it is important that we check each polygon a priori to assure its veracity. `analyseTS()`can help us in this task. The function will derive statistics for each unique class in `fieldData` and use the median of each time-step to build a plot. While we expect some samples are mislabeled, we assume most of them are correct and that the median offers an accurate depiction of a reference profile.
</p>

```r
checkTS1 <- analyseTS(as.data.frame(fieldDataTS$weighted.mean), fieldData$crop_2)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE, message=FALSE} checkTS1$plots[[1]] checkTS1$plots[[2]]

<p align="justify" style="line-height:200%;">
Additionally, the function will correlate the profile of every sample with the median profiles of each unique class. This will ease the time requirement of looking at every sample. Assuming that correct samples will be highly correlated with the median profiles, poor correlations will like point to samples for which labels need correction. If this happens, we can again use `checkLabels()` for the correction.
</p>

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE}
kable_styling(kable(head(checkTS1$r2, 5), digits=c(2,2), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

If the user wishes to check the profiles for every sample, `extractTS()` can still be useful. Instead of providing classes, the user can provide unique ID's for each polygon. This way, the function will build a plot for each sample which can be saved separately using `ggsave`.

checkTS2 <- analyseTS(as.data.frame(fieldDataTS$weighted.mean), as.character(1:length(fieldData)))

for (p in 1:length(fieldData)) {ggsave(checkTS2$plots[[p]], paste0("./", checkTS2$labels[p], ".png"), width=10, height=10, units="cm")}

Double-checking the quality of the NDVI temporal profiles is indeed necessary. In our experience, apart from the occasional misspelling, we often find mislabeling. Such mistakes hinder the quality of any predictive model and can be difficult to track. However, looking through thousands of plots can be exhausting leading us to commit additional errors. To minimize this workload, we can use `compareLabel()`. This function Will compare each row in a `data.frame` (e.g. weighted mean time-series derived with `extractTS()`) against a set of reference profiles (e.g. median profiles derived with `analizeTS()`). The function correlates the target and reference profiles and compares the labels for the profiles with the highest correlation. If not the same, the function will return NA. Additionally, the function reports on the distribution on NA values helping the user decide which profiles remain unclear. Below we seen an example.

# retrieve reference profiles as a data.frame
reference.profiles <- as.data.frame(do.call(rbind, lapply(checkTS1$y.statistics, function(i) {i$median})))

# compare original data and its statistical reference
cl <- compareLabel(as.data.frame(fieldDataTS$weighted.mean), reference.profiles, fieldData$crop_2, checkTS1$labels)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(cl$label.compare, 5), digits=c(2,2), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

<p align="justify" style="line-height:200%;">
The output shows us the original label (`x.label`) the most likely label based on the reference data (`y.label`), the pearson correlation between the target and reference (`pearson`) and a logical variable (`compare`) which shows if the original and predicted labels match.
</p>


</br>

### PhenoCropVal: A new concept of land cover validation!
<p align="justify" style="line-height:200%;">
Up to this point we completed the pre-processing of `fieldData` and derived reference profiles for the classes that should be distinguished. However, before we build a map from this, we might want to check how good we expect the classification to be. This avoids time-consuming raster processing and helps anticipate the need for further pre-processing. `phenoCropVal()` does just that. In practice, this algorithm will use `analyseTS()` to build reference profiles. The main difference is in how it separates training and validation data. The function requires a vector showing which samples to group. Then, it keeps each group for validation while the remaining samples are used to build reference profiles for each class using `analyseTS()`. The function will then call `phenoCropClass()` to classify the validation profiles and will identify the samples that passed and failed the test. The function repeats this process for each group in each class. For each class, the function will derive an F1-score as an accuracy measure, estimated from the total amount of true and false positives collected at each iteration.
</p>

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE}
include_graphics("spatialValidation.png")

Depiction of the validation process for one class. One group is kept for validation while the remaining ones are used for training.

To assure that neighboring - and therefore spatially auto-correlated - samples are not kept together during validation, we will first use `splitSamples()`. This function performs a spatial clustering approach labeling samples within predefined spatial distances as part of the same group. The function provides a vector with the region label for each sample and an account of the pixel frequency per sample.

fieldDataCluster <- splitSamples(fieldData, ndvi.ts, fieldData$crop_2, agg.radius=60)

data(fieldDataCluster)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(fieldDataCluster$region.frequency, 5), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive")

<p align="justify" style="line-height:200%;">
Now we can use `phenoCropVal()` to check the accuracy of our predicitons. The function will return a `data.frame` with class wise F1-scores as well and a plot based on it. 
</p>

```r
cropVal <- phenoCropVal(as.data.frame(fieldDataTS$weighted.mean), fieldData$crop_2, fieldDataCluster$region.id)

``` {r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} kable_styling(kable(head(cropVal$class.accuracy, 5), digits=c(2,2), format="html", align="c", full_width=TRUE), "stripped", bootstrap_options="responsive") cropVal$accuracy.plot

<p align="justify" style="line-height:200%;">
Additionally, the function will identify which samples were correctly and incorrectly classified through a logical vector named `sample.validation`. in combination with `fieldData`, this result helps us perceive the spatial distribution of errors and, in some cases, might even help us identify poor quality samples that missed the initial checks.
</p>

```r
fieldData$validation <- as.factor(cropVal$sample.validation)
spplot(fieldData["validation"])

{r, out.width="98%", fig.height=5, fig.width=10, dpi=600, fig.align="center", fig.show='hold', echo=FALSE} fieldData$validation <- as.factor(cropVal$sample.validation) spplot(fieldData["validation"])