In e-sensing/sits-docs: Documentation for the SITS Package Sensing Data Cubes

devtools::load_all(".")
library(sits)
library(sitsdata)

Introduction

This vignette complements the main sits vignette and provides additional information about handling time series data in SITS.

Data structures for satellite time series

The sits package requires a set of time series data, describing properties in spatio-temporal locations of interest. For land use classification, this set consists of samples provided by experts that take in-situ field observations or recognize land classes using high resolution images. The package can also be used for any type of classification, provided that the timeline and bands of the time series (used for training) match that of the data cubes.

For handling time series, the package uses a sits tibble to organize time series data with associated spatial information. A tibble is a generalization of a data.frame, the usual way in R to organise data in tables. Tibbles are part of the tidyverse, a collection of R packages designed to work together in data manipulation [@Wickham2017]. As a example of how the sits tibble works, the following code shows the first three lines of a tibble containing $1,882$ labelled samples of land cover in Mato Grosso state of Brazil. The samples contain time series extracted from the MODIS MOD13Q1 product from 2000 to 2016, provided every $16$ days at $250$-meter spatial resolution in the Sinusoidal projection. Based on ground surveys and high resolution imagery, it includes samples of nine classes: "Forest", "Cerrado", "Pasture", "Soybean-fallow", "Fallow-Cotton", "Soybean-Cotton", "Soybean-Corn", "Soybean-Millet", and "Soybean-Sunflower".

# data set of samples
data(samples_matogrosso_mod13q1)
samples_matogrosso_mod13q1[1:3,]

A sits tibble contains data and metadata. The first six columns contain the metadata: spatial and temporal information, label assigned to the sample, and the data cube from where the data has been extracted. The spatial location is given in longitude and latitude coordinates for the "WGS84" ellipsoid. For example, the first sample has been labelled "Cerrado, at location ($-58.5631$, $-13.8844$), and is considered valid for the period (2007-09-14, 2008-08-28). Informing the dates where the label is valid is crucial for correct classification. In this case, the researchers involved in labeling the samples chose to use the agricultural calendar in Brazil, where the spring crop is planted in the months of September and October, and the autumn crop is planted in the months of February and March. For other applications and other countries, the relevant dates will most likely be different from those used in the example. The time_series column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band.

Utilities for handling time series

The sits package provides functions for data manipulation and displaying information for sits tibbles. For example, sits_labels_summary() shows the labels of the sample set and their frequencies.

sits_labels_summary(samples_matogrosso_mod13q1)

In many cases, it is useful to relabel the data set. For example, there may be situations when one wants to use a smaller set of labels, since samples in one label on the original set may not be distinguishable from samples with other labels. We then could use sits_relabel(), which requires a conversion list (for details, see ?sits_relabel).

Given that we have used the tibble data format for the metadata and and the embedded time series, one can use the functions from dplyr, tidyr and purrr packages of the tidyverse [@Wickham2017] to process the data. For example, the following code uses sits_select() to get a subset of the sample data set with two bands (NDVI and EVI) and then uses the dplyr::filter() to select the samples labelled either as "Cerrado" or "Pasture".

# select NDVI band
samples_ndvi <- sits_select(samples_matogrosso_mod13q1, 
                            bands = "NDVI")

# select only samples with Cerrado label
samples_cerrado <-
    dplyr::filter(samples_ndvi, 
                  label == "Cerrado")

Time series visualisation

Given a small number of samples to display, plot tries to group as many spatial locations together. In the following example, the first 15 samples of "Cerrado" class refer to the same spatial location in consecutive time periods. For this reason, these samples are plotted together.

# plot the first 15 samples
plot(samples_cerrado[1:15,])

For a large number of samples, where the amount of individual plots would be substantial, the default visualization combines all samples together in a single temporal interval (even if they belong to different years). All samples with the same band and label are aligned to a common time interval. This plot is useful to show the spread of values for the time series of each band. The strong red line in the plot shows the median of the values, while the two orange lines are the first and third interquartile ranges. The documentation of plot.sits() has more details about the different ways it can display data.

# plot all cerrado samples together
plot(samples_cerrado)

Obtaining time series data from data cubes

To get a time series in SITS, one has to create a data cube first, as described above. Users can request one or more time series points from a data cube by using sits_get_data(). This function provides a general means of access to image time series. Given a data cube, the user provides the latitude and longitude of the desired location, the bands, and the start date and end date of the time series. If the start and end dates are not provided, it retrieves all the available period. The result is a tibble that can be visualized using plot().

The SITS package enables uses to create data cube based on files. In this case, these files should be organized as raster stacks. A raster stack is a single-layer raster object. Each file refer to one time instance and one spectral band. To allow users to create data cubes based on files, SITS needs to know the names of satellite and sensor, and the names of the directory that contains the files.

# Obtain a raster cube with 23 instances for one year
# Select the band "ndvi", "evi" from images available in the "sitsdata" package
data_dir <- system.file("extdata/sinop", package = "sitsdata")

# create a raster metadata file based on the information about the files
raster_cube <- sits_cube(
    source     = "LOCAL",
    satellite  = "TERRA",
    sensor     = "MODIS",
    name       = "Sinop",
    data_dir   = data_dir,
    parse_info = c("X1", "X2", "band", "date"),
)

# a point in the transition forest to pasture in Northern MT
# obtain a time series from the raster cube for this point
series.tb <- sits_get_data(cube      = raster_cube,
                           longitude = -55.57320, 
                           latitude  = -11.50566,
                           bands     = c("NDVI", "EVI"))
plot(series.tb)

A useful case is when a set of labelled samples are available to be used as a training data set. In this case, one usually has trusted observations which are labelled and commonly stored in plain text CSV files. Function sits_get_data() can get a CSV file path as an argument. The CSV file must provide, for each time series, its latitude and longitude, the start and end dates, and a label associated to a ground sample. An example of a CSV file used is shown below:

# retrieve a list of samples described by a CSV file
samples.csv <- system.file("extdata/samples/samples_sinop_crop.csv",
                           package = "sits")
# get the points from a data cube in raster brick format
points <- sits_get_data(raster_cube, file = samples.csv)


# show the tibble with the points
points

A common situation is when users have samples available as shapefiles in point format. Since shapefiles contain only geometries, we need to provide information about the start and end times for which each label is valid. in this case, one should use the function sits_get_data() to retrieve data from a data cube based on the contents of the shapefile. The parameter shp_attr (optional) indicates the name of the column on the shapefile which contains the label to be associated to each time series; the parameter .n_shp_pol (defaults to 20) determines the number of samples to be extracted from each polygon.

# define the input shapefile (consisting of POLYGONS)
shp_file <- system.file("extdata/shapefiles/agriculture/parcel_agriculture.shp", 
                        package = "sits")

# set the start and end dates 
start_date <- "2013-09-14"
end_date   <- "2014-08-29"

# define the name of attribute of the shapefile that contains the label
shp_attr <- "ext_na"

# define the number of samples to extract from each polygon
.n_shp_pol <- 10

# read the points in the shapefile and produce a CSV file
data <- sits_get_data(cube       = raster_cube, 
                      file       = shp_file, 
                      start_date = start_date, 
                      end_date   = end_date, 
                      shp_attr   = shp_attr, 
                      .n_shp_pol = .n_shp_pol)
data