knitr::opts_chunk$set(echo = FALSE) # by default turn off code echo. Warnings may still be shown.
# Set start time ----
startTime <- proc.time()

# Set grid spy data paths etc from file ----
source(paste0(ggrParams$repoLoc, "/dataProcessing/gridSpy/gSpyParams.R"))

# Packages needed in this .Rmd file ----
rmdLibs <- c("stringr" # for str_wrap for long labels on plots
)
# load them
loadLibraries(rmdLibs)

# Local parameters ----

b2Kb <- 1024 #http://whatsabyte.com/P1/byteconverter.htm
b2Mb <- 1048576
plotLoc <- paste0(ggrParams$repoLoc, "/docs/plots/")

# Local functions ----

\newpage

About

Report circulation:

License


Citation

If you wish to use any of the material from this report please cite as:

This work is (c) r lubridate::year(today()) the University of Southampton.

History


Support


\newpage

Introduction


This report provides summary data quality statistics for the original GREEN Grid GridSpy household power demand monitoring data. This data was used to create a derived 'safe' dataset using the code in the r ggrParams$repo repository.

Original Data: Quality checks

The original data files files are stored on r ggrParams$otagoHCS.

# get file list 
fListCompleteDT <- data.table::as.data.table(readr::read_csv(gSpyParams$fListAll))

# for use below
nFiles <- nrow(fListCompleteDT)
nFilesNotLoaded <- nrow(fListCompleteDT[dateColName %like% "ignore"])

Data collection is ongoing and this section reports on the availability of data files collected up to the time at which the most recent safe file was created (r file.mtime(gSpyParams$fListAll)). To date we have r tidyNum(nFiles) files from r tidyNum(uniqueN(fListCompleteDT$hhID)) unique GridSpy IDs.

However a large number of files (r tidyNum(nFilesNotLoaded) or r round(100*(nFilesNotLoaded/nFiles))%) have 1 of two file sizes (43 or 2751 bytes) and we have determined that they contain no data as the monitoring devices have either been removed (households have moved or withdrawn from the study) or data transfer has failed. We therefore flag these files as 'to be ignored'.

In addition two of the GridSpy units were re-used in new households following withdrawal of the original participants. The GridSpy IDs (rf_XX) remained unchanged despite allocation to different households. The original input data does not therefore distinguish between these households and we discuss how this is resolved in the clean safe data in Section \@ref(reallocation) below.

Input data file quality checks

Figure \@ref(fig:allFileSizesPlot) shows the distribution of the file sizes of all files over time by GridSpy ID. Note that white indicates the presence of small files which may not contain observations.

myCaption <- paste0("Data source: ", gSpyParams$gSpyInPath,
                    "\nUsing data received up to ", max(fListCompleteDT$fMDate))

plotDT <- fListCompleteDT[, .(nFiles = .N,
                              meanfSize = mean(fSize)), 
                          keyby = .(hhID, date = as.Date(fMDate))]

ggplot2::ggplot(plotDT, aes( x = date, y = hhID, 
                             fill = log(meanfSize))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "black") + 
  scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, hjust = 0.5)) + 
  labs(title = "Mean file size of all GridSpy data files received per day",
       y = "GridSpy ID",
       caption = paste0(myCaption, 
                        "\nLog file size used as some files are full year data and so extremely large")

  )
ggplot2::ggsave(paste0(plotLoc, "gridSpyAllFileListSizeTilePlot.png"))

As we can see, relatively large files were downloaded (manually) in June and October 2016 before an automated download process was implemented from January 2017. A final manual download appears to have taken place in early December 2017.

Figure \@ref(fig:loadedFileSizesPlot) plots the same results but excludes files which do not meet the file size threshold and which we therefore assume do not contain data.

plotDT <- fListCompleteDT[!is.na(dateFormat), .(nFiles = .N,
                              meanfSize = mean(fSize)), 
                          keyby = .(hhID, date = as.Date(fMDate))]

ggplot2::ggplot(plotDT, aes( x = date, y = hhID, fill = log(meanfSize))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "black") + 
  scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
  labs(title = "Mean file size of loaded GridSpy data files received per day",
       y = "GridSpy ID",
       caption = paste0(myCaption, 
                        "\nLog file size used as some files are full year data",
                        "\nFiles loaded if fsize > ", gSpyParams$gSpyFileThreshold, " bytes")

  )
ggplot2::ggsave(paste0(plotLoc, "/gridSpyLoadedFileListSizeTilePlot.png"))

As we can see this removes a large number of the automatically downloaded files.

Input date format checks {#getItRightFirstTime}

As noted above, the original data was downloaded in two ways:

Resolving and cleaning these variations and uncertainties have required substantial effort and in some cases the date format (and thus time when timezones are set) has had to be inferred from the file names. A key lesson for future projects is always to ensure that files are named so that meta data is easily parsed and that there can be one and only one:

Table \@ref(tab:listDefaultDateFilesNZT) lists up to 10 of the 'date NZ' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...

# list default files with NZ time
aList <- fListCompleteDT[dateColName == "date NZ" & dateFormat %like% "default", 
                         .(file, fSize, dateColName, dateExample, dateFormat)]

cap <- paste0("First 10 (max) of ", nrow(aList), 
              " files with dateColName = 'date NZ' and default dateFormat")

kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>%
  kable_styling()

Table \@ref(tab:listDefaultDateFilesUTC) lists up to 10 of the 'date UTC' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...

# list default files with UTC time
aList <- fListCompleteDT[dateColName == "date UTC" & dateFormat %like% "default", 
                         .(file, fSize, dateColName, dateExample, dateFormat)]

cap <- paste0("First 10 (max) of ", nrow(aList), 
              " files with dateColName = 'date UTC' and default dateFormat")

kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>%
  kable_styling()

After final cleaning, the final date formats are shown in Table \@ref(tab:finalDateFormatTable).

# See what the date formats look like now
t <- fListCompleteDT[, .(nFiles = .N, 
                         meanFSizeKb = tidyNum(mean(fSize/b2Kb)),
                         minFSizeKb = tidyNum(min(fSize/b2Kb)),
                         maxFSizeKb = tidyNum(max(fSize/b2Kb)),
                         minFDate = min(fMDate), # may not make much sense
                         maxFDate = max(fMDate)), 
                     keyby = .(dateColName, dateFormat)]

kableExtra::kable(t,
             caption = "Number of files & min/max dates (as char) with given date column names by final imputed date format", digits = 2) %>%
  kable_styling()

Results to note:

Processed Data: Quality checks {#cleanData}

In this section we analyse the data files that have a file size > r gSpyParams$gSpyFileThreshold bytes and which have been used to create the safe data. Things to note:

Table \@ref(tab:filesToLoadTable) shows the number of files per GridSpy ID that are actually processed to make the safe version together with the min/max file save dates (not the observed data dates).

# check files to load
t <- fListCompleteDT[dateColName %like% "date", .(nFiles = .N,
                       meanSize = mean(fSize),
                       minFileDate = min(fMDate),
                       maxFileDate = max(fMDate)), keyby = .(gridSpyID = hhID)]

kableExtra::kable(t, caption = "Summary of household files to load", digits = 2) %>%
  kable_styling()

Recoding re-allocated GridSpy units {#reallocation}

As noted in the introduction, two units were re-allocated to new households during the study. These were:

To avoid confusion the data for each of these units has been split in to rf_XXa/rf_XXb files on the appropriate dates during data processing. In principle therefore the clean data should contain data files for:

However rf_15a did not collect usable data before the unit was re-allocated so only files for rf_15b, rf_17a and rf_17b exist in the archive.

Each cleaned safe data file contains both the original hhID (i.e. the GridSpy ID) and a new linkID which has the same value as hhID except in the case of these three files. The linkID variable should always be used to link the GridSpy data to the survey or other household level data in the data package.

In all subsequent analysis we use linkID to give results for each household.

Observations

The following plots show the number of observations per day per household. In theory we should not see:

If present both of the latter may have been implied by the table above and would have evaded the de-duplication filter which simply checks each complete row against all others within its consolidated household dataset (a within household absolute duplicate check).

Note that rf_15a is not present as no usable data was obtained from this household.

hhStatDT <- data.table::as.data.table(readr::read_csv(gSpyParams$hhStatsByDate))

Figure \@ref(fig:obsPlotTile) uses a tile plot which is useful for visualising data gaps. Note that there are indications of missing observations in April possibly caused by DST clock-changes when clocks go back 1 hour in NZ.

myCaption <- paste0(myCaption,
                        "\nOnly files of size > ", 
                        gSpyParams$gSpyFileThreshold, " bytes loaded")

# tile plot ----
obsPlot1 <- ggplot2::ggplot(hhStatDT, aes( x = date, y = linkID, 
                               fill = nObs/nCircuits)) + # divide by the number of circuits
  geom_tile() +
  scale_fill_gradient(low = "red", high = "green") +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5,
                                   hjust = 0.5)) + 
  labs(title = "N observations per household per circuit per day for all loaded GridSpy data",
       caption = myCaption

  )
obsPlot1

Figure \@ref(fig:obsPlotPoint) uses a point plot which is useful for visualising days where there was partial or duplicate data. Note that there are indications of duplicate observations in late September (and April 2015) possibly caused by DST clock-changes when clocks go forward 1 hour in NZ.

# point plot ----
obsPlot2 <-ggplot2::ggplot(hhStatDT, aes( x = date, 
                               y = nObs/nCircuits, 
                               colour = linkID)) +
  geom_point() +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, 
                                   hjust = 0.5)) + 
  labs(title = "N observations per household per circuit per day for all loaded GridSpy data",
       caption = myCaption

  )
obsPlot2
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsTilePlot.png"), obsPlot1)
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsPointPlot.png"), obsPlot2)
# Stats table (so we can pick out the dateTime errors)
t <- hhStatDT[, .(minObs = min(nObs),
             maxObs = max(nObs), # should not be more than 1440, if so suggests duplicates
             meanN_Circuits =mean(nCircuits), #i.e. n circuits
             minDate = min(date),
             maxDate = max(date)),
         keyby = .(linkID)]
kableExtra::kable(t,caption = "Summary observation stats by hhID", digits = 2) %>%
  kable_styling()

Table \@ref(tab:summaryTable) shows the min/max number of observations per day and min/max dates for each household. As above, we should not see:

If we do see any of these then we still have data cleaning work to do!

Finally Figure \@ref(fig:attritionPlot) plots the total number of households for whom we have any data on a given date. This gives an indication of the attrition rate.

plotDT <- hhStatDT[, .(nHH = uniqueN(linkID)), keyby = .(date)]

# point plot ----
ggplot2::ggplot(plotDT, aes( x = date, y = nHH)) +
  geom_point() +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 0.5)) + 
  labs(title = "N live households per day for all loaded grid spy data",
       caption = myCaption

  )
# own chunk to hide warning
ggplot2::ggsave(paste0(plotLoc, "gridSpyLiveHouseholdsToDate.png"))

Date and time checks {#dateTimeChecks}

As we noted above the original data had a variety of date formats. The data processing code does as good a job as it can of parsing non-UTC dateTimes (i.e. the observations with time as NZT) to force the r_dateTime variable to always record as UTC.

Any duplicate observations were then removed by checking for exact repeats of the linkID <-> r_dateTime <-> circuit <-> powerW tuple. Note that this only checks for duplicates in terms of UTC...

This has consequences for Daylight Savings Time changes as follows:

To add even more confusion it is possible that attempts were made to 'correct' the time stamps in the DST breaks in the original data before it was downloaded by the research team. As it is almost impossible for us to determine what was, or should be done in your research context we:

dstBreaks <- fread(ggrParams$dstNZDates)
kableExtra::kable(dstBreaks, caption = "NZ DST breaks", digits = 2) %>%
  kable_styling()

We do not like timezones but we like DST even less. As an example, consider what happens in the following R code:

First set a dateTime and tell lubridate (and R) it is NZT:

dateTimeNZT1 <- lubridate::ymd_hm("2014-09-28 01:50", tz = "Pacific/Auckland")

Did it work?

dateTimeNZT1

Yes.

Is it DST?

lubridate::dst(dateTimeNZT1)

No.

Now set a dateTime that should be DST and tell lubridate (and R) it is NZT:

dateTimeNZT2 <- lubridate::ymd_hm("2014-01-28 01:50", tz = "Pacific/Auckland")

Did it work?

dateTimeNZT2

Yes.

Is it DST?

lubridate::dst(dateTimeNZT2)

Yes

Now set a dateTime that does not exist as it lies inside the DST break (see Table \@ref(tab:dstBreaks)):

dateTimeNZT2 <- lubridate::ymd_hm("2014-09-28 02:01", tz = "Pacific/Auckland")

Boom. Lubridate knows it does not exist in local (civil) time. But of course it does exist as UTC:

dateTimeUTC <- lubridate::ymd_hm("2014-09-28 02:01", tz = "UTC")
dateTimeUTC

Yes, we love timezones and DST.

So:

If in doubt load the data without any auto-parsing and have a good look at it!

Circuit label checks

The following table (\@ref(tab:circuitLabelCheck)) shows the number of files for each household with different circuit labels. In theory each GridSpy ID should only have one set of unique circuit labels. If not:

Some or all of these may be true at any given time.

dt <- hhStatDT[!is.na(circuitLabels), 
                     .(nFiles = .N,
                       nObs = sum(nObs),
                       meanDailyPowerkW = round(mean(meanPower)/1000,2),
                       minDailyPowerkW = round(min(minPowerW)/1000,2),
                       maxDailyPowerkW = round(max(maxPowerW)/1000,2)),
                     keyby = .(linkID,circuitLabels)] # ignore NA - it is files not loaded due to size thresholds

kableExtra::kable(dt, caption = "Circuit labels list by number of files per household", digits = 2) %>%
  kable_styling()

Things to note:

Errors are easier to spot in the following plot where a household spans 2 or more circuit label sets (see Figure \@ref(fig:plotCircuitLabelIssuesAsTile)).

dt$newx = stringr::str_wrap(dt$circuitLabels, width = 40) #https://stackoverflow.com/questions/21878974/auto-wrapping-of-labels-via-labeller-label-wrap-in-ggplot2

ggplot2::ggplot(dt, aes(y = newx, x = linkID, 
                       fill = nObs)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
  theme(axis.text.y = element_text(size = 3)) + 
  theme(legend.position="bottom") +
  scale_fill_gradient(low="green", high = "red") +
  labs(title = "Circuit label counts for all loaded grid spy data",
       y = "Circuit label sets (as strings)",
       caption = paste0(myCaption)

  )
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileCircuitLabelsPlot.png"))

If the above plot and table flag errors then further re-naming of the circuit labels may be necessary.

Calculating total household power demand {#totPower}

Unfortunately this is not as straightforward as one would wish because many households have seperately controlled (and thus monitored) hot water circuits which do not feed from the 'Incomer'. We have provided some example code to attempt to correctly impute the sum of the relevant circuits in each house. This make use of a circuits-to-sum file which specifies the circuits to use in each case:

#YMMV

Dealing with circuit level outliers and negative power {#outliers}

There are a number of observations that have recorded negative power. There are at least two potential reasons for this:

We have conducted a seperate analysis of the incidence of negative values and outliers at the circuit level which makes recommendations on actions to take.

Loading the cleaned data files {#loadClean}

See the code examples for suggestions on how to do this.

Runtime

t <- proc.time() - startTime

elapsed <- t[[3]]

Analysis completed in r round(elapsed,2) seconds ( r round(elapsed/60,2) minutes) using knitr in RStudio with r R.version.string running on r R.version$platform.

R environment

R packages used

Session info

sessionInfo()

References



CfSOtago/GREENGridData documentation built on June 10, 2022, 8:03 p.m.