knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The spectralR
package (homepage and source code) is aimed to obtain, process, and visualize spectral reflectance data for the user-defined earth surface classes for visual exploring in which wavelengths the classes differ. User-defined classes might represent different habitat or vegetation types, crops, land uses, landscapes, or other territories. Input should be a shapefile with polygons of surface classes (it might be different habitat types, crops, or other things). So far, the single source of spectral data is Sentinel 2 Level 2A satellite mission optical bands pixel data, obtained through the Google Earth Engine service.
The initial purpose of spectralR
development was to quickly and visually explore the differences in spectral reflectance patterns for supervised vegetation (and, widely, habitat) mapping. While machine learning classification methods, such as RandomForest, typically utilize the full set of available spectral data (satellite bands), we observed that highly similar habitat types (e.g., different types of grasslands, floodplain habitats) might have similar reflectance values during one season and different for another, so time-series reflectance data are also needed. Without the preliminary selection of the most discriminating satellite bands for given habitat types in given conditions, the models became overfitted and required extensive machine resources for calculation.
The first version of our product was based on three code snippets, two written for R (shapefile preparation and visualization of the results) and one for Google Earth Engine for satellite image preparation and spectral data sampling itself. That approach, though, required a bunch of manual operations, and downloading big files on a local machine. So we moved to the recent rgee R package, which provides a bridge between R and Python API for Google Earth Engine. All the operations with satellite images now run in a cloud, and the obtained pixel data is visualized locally afterward. Therefore, the most resource-hungry functions do not overload your local machine despite the large amount of input data. However, a stable Internet connection is required for using the API.
For using rgee
you should have a Google Earth Engine account. If you don't, first register using your Google account.
Depending on the operating system you use and your current Python configuration, it may require the installation of additional R and Python packages for running rgee
. See the following links for instructions. See also Quick Start User's Guide, Official documentation, and the source code of rgee for solving issues that arise during installation and setting up.
We strongly encourage you to follow the official rgee
installation guide and pay attention to the messages that arrive during installation.
The overall workflow is following:
rgee
functionality to retrieve multi-band pixel data for class polygons from the Google Earth Engine service.ggplot2
approach.Essential requirements:
Stable Internet connection (for using API)
Installed and correctly pre-configured Python environment (v. 3.5 or above)
Active Google Earth Engine account
rgee
install.packages('rgee') # install regular version from CRAN, more stable # remotes::install_github("r-spatial/rgee") # to install development version, more recent
Load the library:
library(rgee)
It is necessary just once to complete the installation of the dependencies:
ee_install()
If something went wrong at this step, see rgee
's installation guide
Check non-R dependencies
ee_check()
rgee
developers recommend installing the version of the Earth Engine Python API which rgee
was tested with, using the following command. Although it calls "upgrade", it might actually downgrade your version of earthengine_api
. You may up(down)grade or not, but in our tests, newer versions of earthengine_api
worked flawlessly.
ee_install_upgrade()
Initialize the Google Earth Engine API for current session:
ee_Initialize()
At this step, you would be prompted to link your Google Earth Engine account with rgee
. Follow instructions in the R console, and you will be redirected to a web browser for logging into your Google Earth Engine account to confirm access rights. During this step, you should past an authorization code generated into R console to finalize authentication.
If everything is OK at the last step and you see a message of successful initiation of API with your GEE username in the console, - congratulations, you managed to install and configure rgee
!
Pay attention that rgee
uses earthengine_api
package, which depends on Python. That's why we need to set up a local Python environment for using rgee
.
Unfortunately, you have to repeat the environment setting and re-authorization each time Python gets updates. For actively updated operating systems, like regular versions of Ubuntu, that's quite annoying. On the other hand, this built-in repairing system protects you from accidentally breaking the earthengine
Python environment during installation or using any other Python-related tools, which is convenient, at least if you are a weak Python user like me. If you get an error message
"The current Python PATH: /home/user/.virtualenvs/rgee/bin/python does not have the Python package "earthengine-api" installed. Are you restarted/terminated your R session after install miniconda or run ee_install()?"
then proceed with the instruction that appeared in R console and will help you to delete your previous configuration and set it up again.
spectralR
was designed on top of rgee
, sf
, and ggplot2
R packages, which you should install and set up before using spectralR
. Whilst the other dependencies may be installed automatically during spectralR
installation process, it is strongly recommended to check also if packages geojsonio
, dplyr
, reshape2
, tibble
, and tidyr
are installed and work properly.
spectralR
can be installed from GitHub sources so far, although we are planning to land it on CRAN soon.
library(remotes) install_github("olehprylutskyi/spectralR")
We offer two datasets to illustrate the functionality of spectralR
through the selected use-cases. The first dataset, namely ‘Homilsha Forests’, illustrates the “small data”: small-size area, a small number of polygons (8 polygons) which belong to a few categories (5 land-cover classes). The area belongs to the Homilsha Forests National Park, situated in the Kharkiv region (Eastern Ukraine). The polygons were created manually in the GIS software. Habitat borders and assignments were based on a field survey and adjusted according to the satellite imaginary. The land cover categories correspond to the CORINE classification (level 3) (available via https://land.copernicus.eu/) but some of them have shortened names: cropland = 2.1.1 Non-irrigated arable land; coniferous forest = 3.1.2 Coniferous forest; meadow = 3.2.1 Natural grasslands; reed = 4.1.1 Inland marshes; water = 5.1.2 Water bodies.
The second dataset, ‘Buzkyi Gard’, represents “large data”: a large area with complicated topography, hundreds of polygons, and dozens of classes. End users don't expect to notice large differences in workflow, but under the hood we implemented two different algorithms for either "small" or "large" spatial data, which will be explained later.
# Reset R's brain before the new analysis session started. It will erase all the objects stored in # R memory, while keeping loaded libraries. rm(list = ls()) # Load required packages library(tidyverse) library(rgee) library(sf) library(geojsonio) library(reshape2) library(spectralR)
Function prepare.vector.data
takes a shapefile with polygons of different classes of surface (habitats, crops, vegetation, etc.) and retrieves the ready-for-sampling sf
object. One should specify the shapefile name (which should be within the working directory, using absolute paths that were not tested) and the name of the field that contains class labels. The function extracts geometries of all polygons, marks them with custom labels ('label'), and automatically assigns integer class ID ('class') for each entry. The last variable is required because the Google Earth Engine sampler respects only numerical class ID - don't delete any field! Don't panic, the resulting dataframe will be marked according to your custom labels, not hard-to-memorizable numbers.
While preparing a shapefile with custom polygons using desktop GIS software (e.d., in QGIS), it is highly recommended to follow the recommendation:
if possible, draw polygons in homogeneous landscapes and avoid class mixture;
keep geometries simple, if possible. Avoid multipolygons, holes inside polygons, etc.;
tiny polygons (same size as satellite imagery resolution or lesser) tend to produce a stronger "edge effect";
few big polygons are easier to process than a lot of small ones;
GEE has its own memory limitation, which may result in extended processing time.
# Extract polygons from shapefile and prepare sf object with proper structure sf_df <- prepare.vector.data(system.file("extdata/test_shapefile.shp", package = "spectralR"), "veget_type")
Explore the resulting spatial object:
head(sf_df)
The example above uses an internal test shapefile. To use your own data, put all the shapefile into your working directory and use the following syntax (uncomment the row and change file names before executing):
# sf_df <- prepare.vector.data("your-shapefile-within-working-directory-name.shp", "name-of-the-field-with-class-labels")
Function get.pixel.data
is the heart of spectralR
. It takes sf
polygon object obtained at the previous step and retrieves a data frame with brightness values for each pixel intersected with polygons for each optical band of Sentinel-2 sensor, marked according to the label of surface class from the polygons. get.pixel.data
starts with the conversion of the sf
object into GEE feature collection. Then it prepares a satellite image for pixel sampling, performs sampling, and finally exports the resulting object back into an R data frame (non-spatial now).
One of the most tricky and issues-causing steps of this procedure is the conversion between GEE feature collection and R sf
object. rgee
implements three ways to do so:
through the JSON translator (getInfo
, default)
using Google Drive (getInfo_to_asset
)
using Google Cloud Storage (gcs_to_asset
)
The first one is the most straightforward and quick but usable only for small data (less than 15000 entries, or 1.5 MB size local files). The other two require transient storage to store your data between conversions (because Earth Engine is Google's service and seamlessly integrated with both Google Drive and Google Cloud Storage).
In spectralR
, we use the first two methods. get.pixel.data
assesses the size of your sf
object or a feature collection, then activates either getInfo
or getInfo_to_asset
pathway. If your data is considered to be "small", getInfo
will be used, and you will notice nothing special. But, if your data is larger than our arbitrary threshold, the method getInfo_to_asset
will be activated, and you may be prompted to authorize in Google Drive and allow rgee
to access GDrive files and folders. In such a case, please follow instructions in the R console and then in your web browser. You may do it once you perform your first large query - your credits will be stored in your local earthengine
environment.
After authorization and first use, folder rgee_backup will be created in your Google Drive storage when all the intermediate files will be stored. If you run out of storage, get.pixel.data
won't be able to use your Google Drive as mediating storage, and you will receive an error. Though get.pixel.data
re-write objects each time it will be launched, we strongly recommend cleaning rgee_backup folder in your Drive from time to time manually or using the following command from the rgee
package:
ee_clean_container(name = "rgee_backup", type = "drive", quiet = FALSE)
To use the get.pixel.data
function, we need to specify some values:
polygons of surface classes as an sf
object, prepared at the
previous step;
starting day for Sentinel image collection in "YYYY-MM-DD" format. See Note 1 below;
final day for Sentinel image collection, in "YYYY-MM-DD" format;
cloud threshold (maximum percent of cloud-covered pixels per image) by which individual satellite imageries will be filtered;
a scale of resulting satellite images in meters (pixel size). See Note 2 below;
The resulting pixel data will be saved within a working directory and can be loaded during the following sessions.
Note 1. Particular satellite imagery is usually not ready for instant sampling -- it contains clouds, cloud shadows, aerosols, and may cover not all the territory of your interest. Another issue is that each particular pixel slightly differs in reflectance between images taken on different days due to differences in atmospheric conditions and angle of sunlight at the moments images were taken. The Google Earth Engine has its build-in algorithms for image pre-processing, atmospheric corrections and mosaicing, which allows one to obtain a ready-to-use, rectified image. The approach used in spectralR
is to find a median value for each pixel between several images within each of 10 optical bands, thereby making a composite image. To define a set of imageries between which we are going to calculate the median, we should set a timespan of image collection. Sentinel-2 apparatus takes a picture once every five days, so if you set up a month-long timesnap, you can expect that each pixel value will be calculated based on 5 to 6 values (remember, some images might appear unsatisfactory cloudy).
Note 2. The finest resolution for Sentinel data is 10 m while using a higher scale_value decreases the required computational resources and size of the resulting data frame. Although sampling satellite data is performed in a cloud, there are some limitations for geocalculations placed by GEE itself. If you are about to sample large areas, consider setting a higher 'scale' value (e.g., 100, 1000). Read more in GEE best practices.
# Get pixel data reflectance = get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 100) # Save pixel data for further sessions save(reflectance, file = "reflectance_test_data")
load(file = "../inst/testdata/reflectance_test_data.RData")
Here we choose the 100 m scale (pixel size for resulting imagery is 100x100 m), which resulted in a sampling dataset of 2060 rows. Finer pixel size would result in a larger sampling dataset, which would require using moderating storage (see use case 2).
load(file = "./reflectance_test_data") # restore previously saved pixel data
Let's have a look at the resulting data:
head(reflectance)
We have a data frame with the number of rows equal to the number of sampled "pixels" of satellite image and 10 variables with reflectance values for each optical band of Sentinel 2. Each sampled pixel has a label of a surface class of the user's polygon it intersected.
Note: spectralR
receives true reflectance values (between 0 and 1), transforming GEE's values (multiplied by 10000 for calculation convenience).
First of all, one should explore the quality and comprehensiveness of obtained pixel data.
# load(file = "./reflectance_test_data") # restore previously saved pixel data summary(factor(reflectance$label)) # how many pixels in each class?
For reliable results, keeping the similar size of each surface class is recommended. Classes represented by a few sampled pixels should be excluded from further analysis.
Visual overview of pixel data
# Number of spectral values for different classes ggplot(reflectance, aes(x=label))+ geom_bar()+ labs(x = 'Classes of surface', y = 'Number of pixels', title = "Total pixel data for different classes of surface", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
Function spectral.reflectance.curve
transform the data and plot smoother curves for each surface class, using ggplot2
's geom_smooth()
aesthetics. For large (thousands of rows) data (and we need large data for reliable conclusions!), ggplot2
uses the GAM method for drawing a trendline.
Depending on the data size, the processing may take some time.
# Make a ggplot object p1 <- spectral.curves.plot(reflectance) # Default plot p1
Since the output of spectral.curves.plot
is ggplot
object, we can apply any tools provided by ggplot2
package to make it more visually pleasing.
# Customized plot p1+ labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Surface classes", fill = "Surface classes", title = "Spectral reflectance curves for different surface classes", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
You can save the plot as a *.png (or other standard graphical formats) file using ggsave
or png()
functions.
ggsave("Spectral_curves_usecase1.png", width=16, height=12, unit="cm", dpi=300)
Function stat.summary.plot
make a plot with a statistical summary of reflectance values (mean, mean-standard deviation, mean+standard deviation) for defined classes of the surface. Given reflectance data as input, the function returns a ggplot2 object with basic visual aesthetics. Default aesthetics are line with a statistical summary for each satellite band (geom_line + geom_pointrange).
Wavelength values (nm) are acquired from the mean known value for each optical band of the Sentinel 2 sensor.
# Make a ggplot object p2 <- stat.summary.plot(reflectance) # Default plot p2
Add a touch of customization.
# Customized plot p2 + labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Surface classes", title = "Reflectance for different surface classes", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Save the plot as a *.png file
ggsave("Statsummary_usecase1.png", width=16, height=12, unit="cm", dpi=300)
Function violin.plot
helps to visualize a reflectance as violin plots for each surface class per satellite bands. It gets reflectance data as input and returns ggplot2 object with basic visual aesthetics. The default aesthetics is geom_violin.
# Make a ggplot object p3 <- violin.plot(reflectance) # Customized plot p3 + labs(x='Surface class',y='Reflectance', fill="Surface classes", title = "Reflectance for different surface classes", caption='Data: Sentinel-2 Level-2A')+ theme_minimal() # Save the plot as a *.png file ggsave("Violins_usecase1.png", width=22, height=16, unit="cm", dpi=300)
Buzkyi Gard National Park is located in Southern Bug valley (south-western Ukraine, 31.08325,47.90720). The National Park is characterized by unique steppe zone landscapes with diverse vegetation and flora (Shyriaieva et al., 2022). Natural habitats with high conservation value, i.e., a wide range of dry grasslands, rocky outcrops, and steppe forests, are concentrated in river valleys and surrounded by croplands. The habitat types were defined using precise vegetation data (phytosociological relevés with GPS location) and field mapping data for the artificial habitats. Then, the new revised version of EUNIS expert system, version 2021-06-01 with additions from the old EUNIS 2007 for non-revised habitat groups and the National Habitat Catalogue of Ukraine were used for the classification. As a result, the dataset contains 380 hand-drawn polygons of 26 different EUNIS habitat types.
Environment preparation
# Reset R's brain before the new analysis session started. It will erase all the objects stored in # R memory while keeping loaded libraries. rm(list = ls()) # Load required packages library(tidyverse) library(rgee) library(sf) library(geojsonio) library(reshape2) library(spectralR)
Upload and process vector data.
# Extract polygons from shapefile and prepare sf object with proper structure sf_df <- prepare.vector.data(system.file("extdata/SouthernBuh-habitats_shapefile.shp", package = "spectralR"), "eunis_2020")
Explore the resulting spatial object:
head(sf_df)
Get reflectance values
reflectance = get.pixel.data(sf_df, "2019-05-15", "2019-06-30", 10, 10) # save pixel data for further sessions save(reflectance, file = "reflectance_BG_data")
load(file = "../inst/testdata/reflectance_BG_data.RData")
The habitats of the Southern Buh River Valley often are represented by small areas (especially rocky outcrops, grasslands, etc.). Therefore, many polygons are relatively small -- up to 1 ha -- and have an irregular shape. We recommend decreasing scale values (pixel size) in such cases, approaching the minimum value (10 m). On the other hand, that will require more computational power and processing time.
Quantitative overview of pixel data
summary(factor(reflectance$label)) # how many pixels in each class?
With "big data", spectralR
uses your Google Drive as intermittent storage between R and GEE. Please take care of having enough storage in the Drive and purge rgee_backup directory manually from time to time.
Spectral reflectance curves for different habitat types
# Create basic ggplot object p1 <- spectral.curves.plot(reflectance) # Plotting p1 + labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
The plot above, unlike the one from use-case 1, looks messy and hard to read, because of a large number of surface classes (habitat types). To make the resulting plot easier to comprehend, both functions spectral.curves.plot
and stat.summary.plot
have an argument target_classes, which takes as an input the list of the classes of surface which should be highlighted, while others will be turned in gray, as a background. Defaults value is NULL - all the classes would be colourized, as above. Let's give it a try.
p1_1 <- spectral.curves.plot( data = reflectance, target_classes = list("C2.2", "C2.3", "J5") ) # Plotting p1_1 + labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different habitat types\nSouthern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
Not to mention, that the user is able to filter input data to obtain what they want.
Suppose we need only grasslands in our reflectance plot (habitats which code started from R) and want to highlight the differences/similarities in spectral reflectance of Continental dry grasslands (true steppe, type R1B).
# Filter reflectance data grasslands <- reflectance %>% filter(label == c("R12", "R1A", "R1B", "R21", "R36")) # Construct a ggplot object with basic aestetics p1_2 <- spectral.curves.plot( data = grasslands, target_classes = list("R1B") ) # Plotting p1_2 + labs(x = 'Wavelength, nm', y = 'Reflectance', colour = "Habitat types", fill = "Habitat types", title = "Spectral reflectance curves for different types of\ngrassland habitats, Southern Bug National park, Ukraine", caption = 'Data: Sentinel-2 Level-2A')+ theme_minimal()
Statistical summary for each habitat type
# Create basic ggplot object p2 <- stat.summary.plot(reflectance) # Plotting p2 + labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Habitat types", title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Same way, create a plot that highlighted only tall-helophyte beds (type Q51), compared to two common other floodplain habitat types (R21 Mesic lowland pasture, T11 Riparian forest).
# Filter reflectance data floodplain <- reflectance %>% filter(label == c("Q51", "R21", "T11")) # Create basic ggplot object p2_1 <- stat.summary.plot( data = floodplain, target_classes = list("Q51") ) # Plotting p2_1 + labs(x = 'Sentinel-2 bands', y = 'Reflectance', colour = "Habitat types", title = "Reflectance of tall-helophyte beds,\nSouthern Bug National park, Ukraine", caption='Data: Sentinel-2 Level-2A\nmean ± standard deviation')+ theme_minimal()
Create a violin plots for given habitat types
# Create basic ggplot object p3 <- violin.plot(reflectance) # Plotting p3 + labs(x = "Habitat type", y = "Reflectance", fill = "Habitat types", title = "Reflectance for different habitat types\nSouthern Bug National park, Ukraine", caption ='Data: Sentinel-2 Level-2A')+ theme_minimal()
You also can save and/or transform resulting ggplot objects as you wish, using ggplot2
and tidyverse
syntax.
If you have any feedback, please let us know via GitHub Issues or e-mail.
spectralR
development team. GPL-3.0 License. Updated 2022-06
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.