knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
message("No package ggplot2 available. Code chunks using that package will not be evaluated.")
Usually, yield data comes with many noisy observations. This vignette will show
how to preprocess yield data to remove both, spatial and global outliers. The
protocol for error removal follows the protocol proposed by @Vega2019. Functions
from this package are used in FastMapping software [@Paccioretti2020]. For the
tutorial we will use the barley
dataset that comes with the paar
package.
The barley
data contains barley grain yield which were obtained using
calibrated commercial yield monitors, mounted on combines equipped with DGPS.
The data is not a sf
object format. We will convert it to an sf
object first.
First, we will load the paar
package, the sf
package for spatial data
manipulation, ggplot2
for plotting, and the barley
dataset that comes
with the paar
package.
library(paar) library(sf) require(ggplot2) data("barley", package = 'paar')
The barley
dataset is a data.frame
object. We will convert it to a sf
object using the st_as_sf
function. The coords
argument specifies the
columns that contain the coordinates. The crs
argument specifies the
coordinate reference system. The barley
dataset is in UTM zone 20S.
barley_sf <- st_as_sf(barley, coords = c("X", "Y"), crs = 32720)
The barley_sf
object is now an sf
object. We can plot the data to visualize the yield data.
plot
function can be used to plot the data.plot(barley_sf["Yield"])
ggplot2
package can be used to plot the data.ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
Let's see the yield values distribution.
hist
function can be used to plot the histogram.hist(barley_sf$Yield, main = 'Yield values distribution')
ggplot2
package can be used to plot the histogram.ggplot(barley_sf) + geom_histogram(aes(x = Yield)) + theme_minimal()
The protocol proposed by [@Vega2019], is implemented in the function depurate
and consists of three steps:
1. Remove border observations (edges).
2. Remove global outliers (outliers).
3. Remove spatial outliers (inliers).
The depurate
function takes an sf
object as input and returns an object
of class paar
. Any combination of the three steps can be done using
the depurate
function. The argument to_remove
specifies which steps to
perform. The argument y
specifies the column name of the variable to be
cleaned. A field boundary is necessary to remove the edges observations.
If a polygon is not provided in the poly_border
argument, the function will
make a hull, around the data and remove the observation that are 10m from the
hull. The hull is made using concaveman::concaveman
function if the package
is installed, otherwise, the sf::st_convex_hull
function is used.
barley_clean_paar <- depurate(barley_sf, y = 'Yield', toremove = c("edges", "outlier", "inlier"))
The depurate
function returns an object of class paar
. The paar
object
contains the cleaned data ($depurated_data
), and the condition of each
observation ($condition
). If the condition is NA
means that the observation
was not removed.
barley_clean_paar
The summary
function can be used to get a summary of the percentage of
considered outlier and the number of observations removed. The summary
function returns a data.frame
object.
summary_table <- summary(barley_clean_paar) summary_table
Filtered dataset can be extracted from the paar
object using the $depurated_data
barley_clean <- barley_clean_paar$depurated_data
Final Yield values distribution can be plotted.
plot
function can be used to plot yield values.plot(barley_clean["Yield"])
ggplot2
package can be used to plot yield values.ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
A comparison can be made between the original data and the cleaned data.
message('Package ggplot2 is not available.')
ggplot(barley_sf) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
ggplot(barley_clean) + geom_sf(aes(color = Yield)) + scale_color_viridis_c() + theme_minimal()
Also, the distribution of the yield values can be compared.
ggplot(barley_sf, aes(x = Yield)) + geom_histogram()
ggplot(barley_clean, aes(x = Yield)) + geom_histogram()
The condition of each observation can be combined to the original data using the
cbind
function. The paar
object must be used as first argument in the
cbind
function.
barley_sf <- cbind(barley_clean_paar, barley_sf)
The barley_sf
object now contains the condition of each observation.
The condition
column contains the condition of each observation. The
condition can be NA
if the observation was not removed, edges
if the
observation was removed in the edges step, outlier
if the observation
was removed in the outliers step, and inlier
if the observation was
removed in the inliers step. Results can be plotted to visualize the
observations.
plot
function can be used to plot the condition of each observation.plot(barley_sf[,'condition'], col = as.numeric(as.factor(barley_sf$condition))) legend("topright", legend = levels(as.factor(barley_sf$condition)), fill = 1:4)
ggplot2
package can be used to plot the condition of each observation.ggplot(barley_sf) + geom_sf(aes(color = condition)) + scale_fill_viridis_d() + scale_color_discrete( labels = function(k) {k[is.na(k)] <- "normal"; k}, na.value = "#44214234") + theme_minimal()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.