In r-tidy-remote-sensing/tidyrgee: 'tidyverse' Methods for 'Earth Engine'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(tidyrgee)
library(rgee)
library(dplyr)
ee_Initialize()

Intro

In addition to building nice dplyresque/tidy syntax wrappers around rgee/GEE functions. We have decided to explore the possibility of introducing a new framework which includes a new class object: "tidyee".

To use this framework your ImageCollection or Image has to be converted to tidyee class using the new as_tidy_ee function as shown below:

modis_ic <- ee$ImageCollection("MODIS/006/MOD13Q1")
modis_ic_tidy <- as_tidyee(modis_ic)

As you can see below the new object (modis_ic_tidy) is a named list (of class "tidyee") containing :

ee_ob: the original ee_object (in this case ImageCollection)
vrt: virtual table holding key properties of the original ee_object

modis_ic_tidy$ee_ob
modis_ic_tidy$vrt

The virtual table (vrt) data.frame allows us to leverage all the power and functionality of dplyr to filter, mutate,group, etc. An S3 class method filter.tidyee has been written which, essentially, first filters the vrt based on conditions supplied to the filter argument and then uses the filtered data.frame (vrt) to filter/subset the ImageCollection. The vrt comes with several pre-defined columns useful for filtering(date, year, month), but mutate can be used to add any new columns/categories for filtering.

Below is an example using months to filter

# library(tidyverse)
# library(tidyrgee)
# filter |> debugonce()

modis_march_april <- modis_ic_tidy |> 
  filter(month %in% c(3,4))

DEMO of Slice slice respects R's 1-based indexing rather than GEE 0-based indexing

modis_sliced <- modis_ic_tidy |> 
  slice(1:2)
# modis_sliced$ee_ob$size()$getInfo()

Now we show an example of mutating a new category and then filtering the tidyee by that column

modis_filt_growing_season <- modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  filter(crop_cycle=="planting")


modis_split_crop_cycle <- modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(crop_cycle) |> 
  group_split()

modis_split_crop_cycle[[2]]$ee_ob$aggregate_array("tidyee_index")$getInfo()
modis_split_crop_cycle[[2]]$ee_ob$aggregate_array("system:time_start")$getInfo()
modis_split_crop_cycle[[2]]$ee_ob$size()$getInfo()


external_group_output <- modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(crop_cycle) |> 
  summarise(
    stat="mean",join_bands=F
  )


external_group_output$ee_ob$size()$getInfo()

external_group_output_multi <- modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(crop_cycle) |> 
  summarise(
    stat=list("mean","sd","min"),join_bands=F
  )


external_group_output_multi_joined <- modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(crop_cycle) |> 
  summarise(
    stat=list("mean","sd")
  )

modis_ic_tidy |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(year,crop_cycle) |> 
  summarise(
    stat=list("mean","sd")
  )

Here we show how you can perform pixel-level summary statistics with summarise function. This is typically referred to as compositing in GEE documentation as well as other remote sensing literature.

modis_mean_by_yrmo <- modis_ic_tidy |> 
  group_by(year,month) |> 
  summarise(stat = list("median","sd"))


modis_mean_by_yrmo <- modis_ic_tidy |> 
  select("NDVI","EVI") |> 
  group_by(year,month) |> 
  summarise(stat = "mean")


modis_mean_by_yrmo$ee_ob$map(
  function(img){
    ex_bnames <- img$bandNames()
    ex_bnames_renamed <- ex_bnames$map(
      rgee::ee_utils_pyfunc(function(bname){
       ee$String(bname)$replace("_mean$","_m")

      })
    )
      img$select(ex_bnames,ex_bnames_renamed)
  }
    )$first()$bandNames()$getInfo()


modis_mean_by_yrmo$ee_ob$first()

It's nice that you can summarise by multiple different statistics at once if you want.

modis_mean_and_sd_by_yrmo <- modis_ic_tidy |> 
  select("NDVI") |> 
  group_by(year,month) |> 
  summarise(stat = list("mean","sd"))

modis_mean_and_sd_by_yrmo

Next we will show how you can mutate a new category and then group by and summarise to that category. This is nice and seems to be working well. However, we still need to work out to deal with disappearing attributes after running dplyr verbs like group_split. This does not seem to affect the results, but just the print methods. Could be a potential solution in vctrs package. However, it might make more sense to just store band_names as col/list-col instead of attribute.

modis_ic_tidy |> 
  # select("NDVI") |> 
  mutate(crop_cycle= case_when( 
    month %in% c(4,5)~ "land prep",
    month %in% c(6,7)~ "planting",
    month %in% c(8,9,10)~"growing",
    month ==11~ "harvesting",
    TRUE ~ "other"
      )
           ) |> 
  group_by(crop_cycle) |> 
  summarise(
    stat="mean"
  )

select & inner_join example

modis_monthly_baseline_mean <- modis_ic_tidy |> 
  select("NDVI") |> 
  filter(year %in% 2000:2015) |> 
  group_by(month) |> 
  summarise(stat="mean")

modis_monthly_baseline_sd <- modis_ic_tidy |> 
  select("NDVI") |> 
  filter(year %in% 2000:2015) |> 
  group_by(month) |> 
  summarise(stat="sd")

modis_monthly_baseline <- modis_monthly_baseline_mean |> 
  inner_join(modis_monthly_baseline_sd, by="month")

modis_monthly_baseline

point_sample_buffered <- tidyrgee::bgd_msna |> 
  dplyr::sample_n(3) |> 
  sf::st_as_sf(coords=c("_gps_reading_longitude",
                        "_gps_reading_latitude"), crs=4326) |>
  sf::st_transform(crs=32646) |> 
  sf::st_buffer(dist = 500) |> 
  dplyr::select(`_uuid`) 


ndvi_monthly_mean_at_pt<- modis_monthly_baseline_mean |> 
  ee_extract_tidy(y = point_sample_buffered, 
             fun="mean",
             scale = 500)

# just to show that it also works on imageCollection
modis_monthly_baseline_ic<- modis_monthly_baseline_mean |> 
  as_ee()

modis_monthly_baseline_ic |> 
  ee_extract_tidy(y = point_sample_buffered, 
             fun="mean",
             scale = 500)

# and image
modis_monthly_baseline_img_first <-  modis_monthly_baseline_ic$first()

modis_monthly_baseline_img_first |> 
  ee_extract_tidy(y = point_sample_buffered, 
             fun="mean",
             scale = 500)

Limitations/Next Steps

Below I list properties of this approach that could be considered potential downsides and list potential ways to circumvent or minimize these downsides.

1. The new tidyee object reduces interoperability with rgee

Some potential ideas to improve:

a. maybe very simple functions to switch resulting tidyee object back to ee$ImageCollection or ee$Image (maybe as_ic, as_img,as_ee) b.add option to make tidyee on fly from ee$Image/ee$ImageCollection and then also include something like return_icas a logical switch which will just return ee$Image or ee$ImageCollection instead of tidyee class.

So far I lean towards a and just create a as_ee function to implement it.

modis_ic_tidy |> 
  as_ee()

2. as_tidyee makes the process take slightly longer

a. Since as_tidyee relies on client-side operation (primarily rgee::ee_get_date_ic) this function requires some start-up time investment. However, I am thinking that this one-time investment will actually save time since when using the tidyee object we will have constantly updating data.frame which is basically updated instantaneously as we filter and process the ImageCollection. This could allow nice print methods and querying without having to perform the rgee/client side functions like rgee::ee_print and getInfo repeatedly in work-flows which take just as much time as as_tidyee every time they are run. b. To make sure these percieved benefits are actual benefits we should: 1) include checks/assertions at the end of each process to ensure the ee_ob and vrt are in perfect agreement, 2) think about including more information (bands, properties) in the the print methods for tidyee

Might be worth prefixing dplyr functions with ee_ to avoid conflicts?

modis_ndvi_baseline <- modis_ic_tidy |>
  select("NDVI") |>
  filter(year %in% c(2000:2015)) |> 
  group_by(month) |>
  summarise(stat = list("mean","sd"))

modis_ndvi_current <- modis_ic_tidy |>
  select("NDVI") |>
  filter(year %in% c(2022)) |> 
  group_by(month) |>
  summarise(stat = "mean")

modis_mean_and_sd_by_yrmo


modis_monthly_baseline_current <- modis_ndvi_baseline |> 
  inner_join(modis_ndvi_current |> select(NDVI_mean_current="NDVI"), by="month")

modis_current_with_baseline <- modis_ndvi_current |>
  select(NDVI_mean_current="NDVI") |> 
  inner_join(modis_ndvi_baseline, by="month")