devtools::load_all(".") library(sits) library(sitsdata)
This vignette complements the main sits vignette and provides additional information about handling time series data in SITS.
The sits
package requires a set of time series data, describing properties in spatio-temporal locations of interest. For land use classification, this set consists of samples provided by experts that take in-situ field observations or recognize land classes using high resolution images. The package can also be used for any type of classification, provided that the timeline and bands of the time series (used for training) match that of the data cubes.
For handling time series, the package uses a sits tibble
to organize time series data with associated spatial information. A tibble
is a generalization of a data.frame
, the usual way in R to organise data in tables. Tibbles are part of the tidyverse
, a collection of R packages designed to work together in data manipulation [@Wickham2017]. As a example of how the sits
tibble works, the following code shows the first three lines of a tibble containing $1,882$ labelled samples of land cover in Mato Grosso state of Brazil. The samples contain time series extracted from the MODIS MOD13Q1 product from 2000 to 2016, provided every $16$ days at $250$-meter spatial resolution in the Sinusoidal projection. Based on ground surveys and high resolution imagery, it includes samples of nine classes: "Forest", "Cerrado", "Pasture", "Soybean-fallow", "Fallow-Cotton", "Soybean-Cotton", "Soybean-Corn", "Soybean-Millet", and "Soybean-Sunflower".
# data set of samples data(samples_matogrosso_mod13q1) samples_matogrosso_mod13q1[1:3,]
A sits tibble
contains data and metadata. The first six columns contain the metadata: spatial and temporal information, label assigned to the sample, and the data cube from where the data has been extracted. The spatial location is given in longitude and latitude coordinates for the "WGS84" ellipsoid. For example, the first sample has been labelled "Cerrado, at location ($-58.5631$, $-13.8844$), and is considered valid for the period (2007-09-14, 2008-08-28). Informing the dates where the label is valid is crucial for correct classification. In this case, the researchers involved in labeling the samples chose to use the agricultural calendar in Brazil, where the spring crop is planted in the months of September and October, and the autumn crop is planted in the months of February and March. For other applications and other countries, the relevant dates will most likely be different from those used in the example. The time_series
column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band.
The sits
package provides functions for data manipulation and displaying information for sits
tibbles. For example, sits_labels_summary()
shows the labels of the sample set and their frequencies.
sits_labels_summary(samples_matogrosso_mod13q1)
In many cases, it is useful to relabel the data set. For example, there may be situations when one wants to use a smaller set of labels, since samples in one label on the original set may not be distinguishable from samples with other labels. We then could use sits_relabel()
, which requires a conversion list (for details, see ?sits_relabel
).
Given that we have used the tibble data format for the metadata and and the embedded time series, one can use the functions from dplyr
, tidyr
and purrr
packages of the tidyverse
[@Wickham2017] to process the data. For example, the following code uses sits_select()
to get a subset of the sample data set with two bands (NDVI and EVI) and then uses the dplyr::filter()
to select the samples labelled either as "Cerrado" or "Pasture".
# select NDVI band samples_ndvi <- sits_select(samples_matogrosso_mod13q1, bands = "NDVI") # select only samples with Cerrado label samples_cerrado <- dplyr::filter(samples_ndvi, label == "Cerrado")
Given a small number of samples to display, plot
tries to group as many spatial locations together. In the following example, the first 15 samples of "Cerrado" class refer to the same spatial location in consecutive time periods. For this reason, these samples are plotted together.
# plot the first 15 samples plot(samples_cerrado[1:15,])
For a large number of samples, where the amount of individual plots would be substantial, the default visualization combines all samples together in a single temporal interval (even if they belong to different years). All samples with the same band and label are aligned to a common time interval. This plot is useful to show the spread of values for the time series of each band. The strong red line in the plot shows the median of the values, while the two orange lines are the first and third interquartile ranges. The documentation of plot.sits()
has more details about the different ways it can display data.
# plot all cerrado samples together plot(samples_cerrado)
To get a time series in SITS, one has to create a data cube first, as described above. Users can request one or more time series points from a data cube by using sits_get_data()
. This function provides a general means of access to image time series. Given a data cube, the user provides the latitude and longitude of the desired location, the bands, and the start date and end date of the time series. If the start and end dates are not provided, it retrieves all the available period. The result is a tibble that can be visualized using plot()
.
The SITS package enables uses to create data cube based on files. In this case, these files should be organized as raster stacks
. A raster stack is a single-layer raster object. Each file refer to one time instance and one spectral band. To allow users to create data cubes based on files, SITS needs to know the names of satellite and sensor, and the names of the directory that contains the files.
# Obtain a raster cube with 23 instances for one year # Select the band "ndvi", "evi" from images available in the "sitsdata" package data_dir <- system.file("extdata/sinop", package = "sitsdata") # create a raster metadata file based on the information about the files raster_cube <- sits_cube( source = "LOCAL", satellite = "TERRA", sensor = "MODIS", name = "Sinop", data_dir = data_dir, parse_info = c("X1", "X2", "band", "date"), ) # a point in the transition forest to pasture in Northern MT # obtain a time series from the raster cube for this point series.tb <- sits_get_data(cube = raster_cube, longitude = -55.57320, latitude = -11.50566, bands = c("NDVI", "EVI")) plot(series.tb)
A useful case is when a set of labelled samples are available to be used as a training data set. In this case, one usually has trusted observations which are labelled and commonly stored in plain text CSV files. Function sits_get_data()
can get a CSV file path as an argument. The CSV file must provide, for each time series, its latitude and longitude, the start and end dates, and a label associated to a ground sample. An example of a CSV file used is shown below:
# retrieve a list of samples described by a CSV file samples.csv <- system.file("extdata/samples/samples_sinop_crop.csv", package = "sits") # get the points from a data cube in raster brick format points <- sits_get_data(raster_cube, file = samples.csv) # show the tibble with the points points
A common situation is when users have samples available as shapefiles in point format. Since shapefiles contain only geometries, we need to provide information about the start and end times for which each label is valid. in this case, one should use the function sits_get_data()
to retrieve data from a data cube based on the contents of the shapefile. The parameter shp_attr
(optional) indicates the name of the column on the shapefile which contains the label to be associated to each time series; the parameter .n_shp_pol
(defaults to 20) determines the number of samples to be extracted from each polygon.
# define the input shapefile (consisting of POLYGONS) shp_file <- system.file("extdata/shapefiles/agriculture/parcel_agriculture.shp", package = "sits") # set the start and end dates start_date <- "2013-09-14" end_date <- "2014-08-29" # define the name of attribute of the shapefile that contains the label shp_attr <- "ext_na" # define the number of samples to extract from each polygon .n_shp_pol <- 10 # read the points in the shapefile and produce a CSV file data <- sits_get_data(cube = raster_cube, file = shp_file, start_date = start_date, end_date = end_date, shp_attr = shp_attr, .n_shp_pol = .n_shp_pol) data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.