knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
message("No package ggplot2 available. Code chunks using that package will not be evaluated.")
Usually, yield data comes with many noisy observations. This vignette will show
how to preprocess yield data to remove both, spatial and global outliers. The
protocol for error removal follows the protocol proposed by @Vega2019. Functions
from this package are used in FastMapping software [@Paccioretti2020]. For the
tutorial we will use the barley dataset that comes with the paar package.
The barley data contains barley grain yield which were obtained using
calibrated commercial yield monitors, mounted on combines equipped with DGPS.
The data is not a sf object format. We will convert it to an sf object first.
First, we will load the paar package, the sf package for spatial data
manipulation, ggplot2 for plotting, and the barley dataset that comes
with the paar package.
library(paar) library(sf) require(ggplot2) data("barley", package = 'paar')
The barley dataset is a data.frame object. We will convert it to a sf
object using the st_as_sf function. The coords argument specifies the
columns that contain the coordinates. The crs argument specifies the
coordinate reference system. The barley dataset is in UTM zone 20S.
barley_sf <- st_as_sf(barley, coords = c("X", "Y"), crs = 32720)
The barley_sf object is now an sf object. We can plot the data to visualize the yield data.
plot function can be used to plot the data.plot(barley_sf["Yield"])
ggplot2 package can be used to plot the data.ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
Let's see the yield values distribution.
hist function can be used to plot the histogram.hist(barley_sf$Yield, main = 'Yield values distribution')
ggplot2 package can be used to plot the histogram.ggplot(barley_sf) + geom_histogram(aes(x = Yield)) + theme_minimal()
The protocol proposed by [@Vega2019], is implemented in the function depurate
and consists of three steps:
1. Remove border observations (edges).
2. Remove global outliers (outliers).
3. Remove spatial outliers (inliers).
The depurate function takes an sf object as input and returns an object
of class paar. Any combination of the three steps can be done using
the depurate function. The argument to_remove specifies which steps to
perform. The argument y specifies the column name of the variable to be
cleaned. A field boundary is necessary to remove the edges observations.
If a polygon is not provided in the poly_border argument, the function will
make a hull, around the data and remove the observation that are 10m from the
hull. The hull is made using concaveman::concaveman function if the package
is installed, otherwise, the sf::st_convex_hull function is used.
barley_clean_paar <- depurate(barley_sf, y = 'Yield', toremove = c("edges", "outlier", "inlier"))
The depurate function returns an object of class paar. The paar object
contains the cleaned data ($depurated_data), and the condition of each
observation ($condition). If the condition is NA means that the observation
was not removed.
barley_clean_paar
The summary function can be used to get a summary of the percentage of
considered outlier and the number of observations removed. The summary
function returns a data.frame object.
summary_table <- summary(barley_clean_paar) summary_table
Filtered dataset can be extracted from the paar object using the $depurated_data
barley_clean <- barley_clean_paar$depurated_data
Final Yield values distribution can be plotted.
plot function can be used to plot yield values.plot(barley_clean["Yield"])
ggplot2 package can be used to plot yield values.ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
A comparison can be made between the original data and the cleaned data.
message('Package ggplot2 is not available.')
ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
Also, the distribution of the yield values can be compared.
ggplot(barley_sf, aes(x = Yield)) + geom_histogram()
ggplot(barley_clean, aes(x = Yield)) + geom_histogram()
The condition of each observation can be combined to the original data using the
cbind function. The paar object must be used as first argument in the
cbind function.
barley_sf <- cbind(barley_clean_paar, barley_sf)
The barley_sf object now contains the condition of each observation.
The condition column contains the condition of each observation. The
condition can be NA if the observation was not removed, edges if the
observation was removed in the edges step, outlier if the observation
was removed in the outliers step, and inlier if the observation was
removed in the inliers step. Results can be plotted to visualize the
observations.
plot function can be used to plot the condition of each observation.plot(barley_sf[,'condition'], col = as.numeric(as.factor(barley_sf$condition))) legend("topright", legend = levels(as.factor(barley_sf$condition)), fill = 1:4)
ggplot2 package can be used to plot the condition of each observation.ggplot(barley_sf) + geom_sf(aes(color = condition)) + scale_fill_viridis_d() + scale_color_discrete( labels = function(k) {k[is.na(k)] <- "normal"; k}, na.value = "#44214234") + theme_minimal()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.