knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tsibble, warn.conflicts = FALSE) library(pins) library(dplyr, warn.conflicts = FALSE) library(imputeTS, warn.conflicts = FALSE)
The collected values from the vignette Local Data Collection may be nicely visualized, but require a bit of pre-processing and formatting.
After some time of background data collection, it is now time to recall the collected data and bring them to the light.
board <- board_local() inverter_df <- board |> pin_read("inverter_data") head(inverter_df) #> # A tibble: 6 × 6 #> date device_id inverter output_power today_energy lifetime_energy #> <dttm> <chr> <chr> [W] [kW/h] [kW/h] #> 1 2024-09-22 19:35:41 E07000011776 1 0 0.646 252. #> 2 2024-09-22 19:35:41 E07000011776 2 0 0.654 266. #> 3 2024-09-22 19:43:29 E07000011776 1 0 0.647 252. #> 4 2024-09-22 19:43:29 E07000011776 2 0 0.654 266. #> 5 2024-09-24 19:09:59 E07000011776 1 6 1.66 255. #> 6 2024-09-24 19:09:59 E07000011776 2 6 1.64 269.
We now turn the tibble into a tsibble time-series, in order to be able to use time-series specific functions.
The time-series is chosen to be regular on the 30 minutes interval, and we temporarily remove the units that are not yet {tsibble} friendly. We use fill_gaps()
to make an explicit index entry in the time series, whenever data exists.
inverter_ts <- inverter_df |> mutate(time_index = date |> lubridate::floor_date(unit = "30 minutes"), device_inverter = stringr::str_c(device_id, "_", inverter)) |> summarize(.by = c(time_index, device_id, inverter, device_inverter), lifetime_energy = max(lifetime_energy), today_energy = max(today_energy), output_power = mean(output_power)) |> units::drop_units() |> # convert to time-series as_tsibble(key = c(device_id, inverter, device_inverter), index = time_index ) |> # explicit gaps fill_gaps() head(inverter_ts) #> # A tsibble: 6 x 7 [30m] <?> #> # Key: device_id, inverter, device_inverter [1] #> time_index device_id inverter device_inverter lifetime_energy today_energy output_power #> <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> #> 1 2024-09-24 19:00:00 E07000011433 1 E07000011433_1 47.0 1.59 6 #> 2 2024-09-24 19:30:00 E07000011433 1 E07000011433_1 47.0 1.59 0 #> 3 2024-09-24 20:00:00 E07000011433 1 E07000011433_1 NA NA NA #> 4 2024-09-24 20:30:00 E07000011433 1 E07000011433_1 NA NA NA #> 5 2024-09-24 21:00:00 E07000011433 1 E07000011433_1 NA NA NA #> 6 2024-09-24 21:30:00 E07000011433 1 E07000011433_1 NA NA NA
Oh, the size of the filled gap tsibble is large, and missing values comes very early in the data. Now we realize we have only around 8% of completeness in the data. This will require us a strong effort in imputation.
lifetime_energy
is a monotonic series, and thus is correctly imputed with a linear interpolation.\
We start with a naive imputation of today_energy
series being the delta between two steps of lifetime_energy
.
Despite being globally correct, it is far from being representative of the daily seasonality of photo-voltaic energy production.
full_inverter <- inverter_ts |> # impute 'lifetime_energy' gaps with a linear interpolation mutate( lifetime_energy = na_interpolation(lifetime_energy, option = "linear"), today_energy = if_else( inverter == lag(inverter), coalesce(today_energy, lifetime_energy - lag(lifetime_energy)), today_energy ) )
{imputeTS} provides a good toolset to visualize the imputed values in the time-series :
ggplot_na_imputations(x_with_na = inverter_ts |> filter(device_id == "E07000011433") |> pull(lifetime_energy), x_with_imputations = full_inverter |> filter(device_id == "E07000011433") |> pull(lifetime_energy), ylab = "Total energy [kWh]", title = "Imputed Values, first device", size_imputations = 0.2, size_points = 0.6 ) ggplot_na_imputations(x_with_na = inverter_ts |> filter(device_id == "E07000011776") |> pull(lifetime_energy), x_with_imputations = full_inverter |> filter(device_id == "E07000011776") |> pull(lifetime_energy), ylab = "Total energy [kWh]", title = "Imputed Values, second device", size_imputations = 0.2, size_points = 0.6 )
ggplot_na_imputations(x_with_na = inverter_ts |> filter(device_id == "E07000011433") |> pull(today_energy), x_with_imputations = full_inverter |> filter(device_id == "E07000011433") |> pull(today_energy), ylab = "Daily energy [kWh]", title = "Imputed Values, first device", size_imputations = 0.2, size_points = 0.6) ggplot_na_imputations(x_with_na = inverter_ts |> filter(device_id == "E07000011776") |> pull(today_energy), x_with_imputations = full_inverter |> filter(device_id == "E07000011776") |> pull(today_energy), ylab = "Daily energy [kWh]", title = "Imputed Values, second device", size_imputations = 0.2, size_points = 0.6)
Much better imputation can be done of course form this very challenging dataset. As everybody knows, we have the intuition of a daily seasonality being modulated by the local solar radiation. Up to you now to dig into it !
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.