knitr::opts_chunk$set(echo = FALSE) # by default turn off code echo. Warnings may still be shown.
# Set start time ---- startTime <- proc.time() # Set grid spy data paths etc from file ---- source(paste0(ggrParams$repoLoc, "/dataProcessing/gridSpy/gSpyParams.R")) # Packages needed in this .Rmd file ---- rmdLibs <- c("stringr" # for str_wrap for long labels on plots ) # load them loadLibraries(rmdLibs) # Local parameters ---- b2Kb <- 1024 #http://whatsabyte.com/P1/byteconverter.htm b2Mb <- 1048576 plotLoc <- paste0(ggrParams$repoLoc, "/docs/plots/") # Local functions ----
\newpage
If you wish to use any of the material from this report please cite as:
r ggrParams$Authors
(r 1900 + as.POSIXlt(Sys.Date())$year
) r params$title
: r params$subtitle
, r ggrParams$pubLoc
.This work is (c) r lubridate::year(today())
the University of Southampton.
\newpage
This report provides summary data quality statistics for the original GREEN Grid GridSpy household power demand monitoring data. This data was used to create a derived 'safe' dataset using the code in the r ggrParams$repo
repository.
The original data files files are stored on r ggrParams$otagoHCS
.
# get file list fListCompleteDT <- data.table::as.data.table(readr::read_csv(gSpyParams$fListAll)) # for use below nFiles <- nrow(fListCompleteDT) nFilesNotLoaded <- nrow(fListCompleteDT[dateColName %like% "ignore"])
Data collection is ongoing and this section reports on the availability of data files collected up to the time at which the most recent safe file was created (r file.mtime(gSpyParams$fListAll)
). To date we have r tidyNum(nFiles)
files from r tidyNum(uniqueN(fListCompleteDT$hhID))
unique GridSpy IDs.
However a large number of files (r tidyNum(nFilesNotLoaded)
or r round(100*(nFilesNotLoaded/nFiles))
%) have 1 of two file sizes (43 or 2751 bytes) and we have determined that they contain no data as the monitoring devices have either been removed (households have moved or withdrawn from the study) or data transfer has failed. We therefore flag these files as 'to be ignored'.
In addition two of the GridSpy units were re-used in new households following withdrawal of the original participants. The GridSpy IDs (rf_XX) remained unchanged despite allocation to different households. The original input data does not therefore distinguish between these households and we discuss how this is resolved in the clean safe data in Section \@ref(reallocation) below.
Figure \@ref(fig:allFileSizesPlot) shows the distribution of the file sizes of all files over time by GridSpy ID. Note that white indicates the presence of small files which may not contain observations.
myCaption <- paste0("Data source: ", gSpyParams$gSpyInPath, "\nUsing data received up to ", max(fListCompleteDT$fMDate)) plotDT <- fListCompleteDT[, .(nFiles = .N, meanfSize = mean(fSize)), keyby = .(hhID, date = as.Date(fMDate))] ggplot2::ggplot(plotDT, aes( x = date, y = hhID, fill = log(meanfSize))) + geom_tile() + scale_fill_gradient(low = "white", high = "black") + scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + labs(title = "Mean file size of all GridSpy data files received per day", y = "GridSpy ID", caption = paste0(myCaption, "\nLog file size used as some files are full year data and so extremely large") )
ggplot2::ggsave(paste0(plotLoc, "gridSpyAllFileListSizeTilePlot.png"))
As we can see, relatively large files were downloaded (manually) in June and October 2016 before an automated download process was implemented from January 2017. A final manual download appears to have taken place in early December 2017.
Figure \@ref(fig:loadedFileSizesPlot) plots the same results but excludes files which do not meet the file size threshold and which we therefore assume do not contain data.
plotDT <- fListCompleteDT[!is.na(dateFormat), .(nFiles = .N, meanfSize = mean(fSize)), keyby = .(hhID, date = as.Date(fMDate))] ggplot2::ggplot(plotDT, aes( x = date, y = hhID, fill = log(meanfSize))) + geom_tile() + scale_fill_gradient(low = "white", high = "black") + scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + labs(title = "Mean file size of loaded GridSpy data files received per day", y = "GridSpy ID", caption = paste0(myCaption, "\nLog file size used as some files are full year data", "\nFiles loaded if fsize > ", gSpyParams$gSpyFileThreshold, " bytes") )
ggplot2::ggsave(paste0(plotLoc, "/gridSpyLoadedFileListSizeTilePlot.png"))
As we can see this removes a large number of the automatically downloaded files.
As noted above, the original data was downloaded in two ways:
Resolving and cleaning these variations and uncertainties have required substantial effort and in some cases the date format (and thus time when timezones are set) has had to be inferred from the file names. A key lesson for future projects is always to ensure that files are named so that meta data is easily parsed and that there can be one and only one:
Table \@ref(tab:listDefaultDateFilesNZT) lists up to 10 of the 'date NZ' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...
# list default files with NZ time aList <- fListCompleteDT[dateColName == "date NZ" & dateFormat %like% "default", .(file, fSize, dateColName, dateExample, dateFormat)] cap <- paste0("First 10 (max) of ", nrow(aList), " files with dateColName = 'date NZ' and default dateFormat") kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>% kable_styling()
Table \@ref(tab:listDefaultDateFilesUTC) lists up to 10 of the 'date UTC' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...
# list default files with UTC time aList <- fListCompleteDT[dateColName == "date UTC" & dateFormat %like% "default", .(file, fSize, dateColName, dateExample, dateFormat)] cap <- paste0("First 10 (max) of ", nrow(aList), " files with dateColName = 'date UTC' and default dateFormat") kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>% kable_styling()
After final cleaning, the final date formats are shown in Table \@ref(tab:finalDateFormatTable).
# See what the date formats look like now t <- fListCompleteDT[, .(nFiles = .N, meanFSizeKb = tidyNum(mean(fSize/b2Kb)), minFSizeKb = tidyNum(min(fSize/b2Kb)), maxFSizeKb = tidyNum(max(fSize/b2Kb)), minFDate = min(fMDate), # may not make much sense maxFDate = max(fMDate)), keyby = .(dateColName, dateFormat)] kableExtra::kable(t, caption = "Number of files & min/max dates (as char) with given date column names by final imputed date format", digits = 2) %>% kable_styling()
Results to note:
r nrow(fListCompleteDT[dateFormat == "ambiguous"])
files which are still labelled as having ambiguous dates.In this section we analyse the data files that have a file size > r gSpyParams$gSpyFileThreshold
bytes and which have been used to create the safe data. Things to note:
Table \@ref(tab:filesToLoadTable) shows the number of files per GridSpy ID that are actually processed to make the safe version together with the min/max file save dates (not the observed data dates).
# check files to load t <- fListCompleteDT[dateColName %like% "date", .(nFiles = .N, meanSize = mean(fSize), minFileDate = min(fMDate), maxFileDate = max(fMDate)), keyby = .(gridSpyID = hhID)] kableExtra::kable(t, caption = "Summary of household files to load", digits = 2) %>% kable_styling()
As noted in the introduction, two units were re-allocated to new households during the study. These were:
To avoid confusion the data for each of these units has been split in to rf_XXa/rf_XXb files on the appropriate dates during data processing. In principle therefore the clean data should contain data files for:
However rf_15a did not collect usable data before the unit was re-allocated so only files for rf_15b, rf_17a and rf_17b exist in the archive.
Each cleaned safe data file contains both the original hhID (i.e. the GridSpy ID) and a new linkID
which has the same value as hhID except in the case of these three files. The linkID
variable should always be used to link the GridSpy data to the survey or other household level data in the data package.
In all subsequent analysis we use linkID
to give results for each household.
The following plots show the number of observations per day per household. In theory we should not see:
If present both of the latter may have been implied by the table above and would have evaded the de-duplication filter which simply checks each complete row against all others within its consolidated household dataset (a within household absolute duplicate check).
Note that rf_15a is not present as no usable data was obtained from this household.
hhStatDT <- data.table::as.data.table(readr::read_csv(gSpyParams$hhStatsByDate))
Figure \@ref(fig:obsPlotTile) uses a tile plot which is useful for visualising data gaps. Note that there are indications of missing observations in April possibly caused by DST clock-changes when clocks go back 1 hour in NZ.
myCaption <- paste0(myCaption, "\nOnly files of size > ", gSpyParams$gSpyFileThreshold, " bytes loaded") # tile plot ---- obsPlot1 <- ggplot2::ggplot(hhStatDT, aes( x = date, y = linkID, fill = nObs/nCircuits)) + # divide by the number of circuits geom_tile() + scale_fill_gradient(low = "red", high = "green") + scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + labs(title = "N observations per household per circuit per day for all loaded GridSpy data", caption = myCaption ) obsPlot1
Figure \@ref(fig:obsPlotPoint) uses a point plot which is useful for visualising days where there was partial or duplicate data. Note that there are indications of duplicate observations in late September (and April 2015) possibly caused by DST clock-changes when clocks go forward 1 hour in NZ.
# point plot ---- obsPlot2 <-ggplot2::ggplot(hhStatDT, aes( x = date, y = nObs/nCircuits, colour = linkID)) + geom_point() + scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + labs(title = "N observations per household per circuit per day for all loaded GridSpy data", caption = myCaption ) obsPlot2
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsTilePlot.png"), obsPlot1) ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsPointPlot.png"), obsPlot2)
# Stats table (so we can pick out the dateTime errors) t <- hhStatDT[, .(minObs = min(nObs), maxObs = max(nObs), # should not be more than 1440, if so suggests duplicates meanN_Circuits =mean(nCircuits), #i.e. n circuits minDate = min(date), maxDate = max(date)), keyby = .(linkID)]
kableExtra::kable(t,caption = "Summary observation stats by hhID", digits = 2) %>% kable_styling()
Table \@ref(tab:summaryTable) shows the min/max number of observations per day and min/max dates for each household. As above, we should not see:
If we do see any of these then we still have data cleaning work to do!
Finally Figure \@ref(fig:attritionPlot) plots the total number of households for whom we have any data on a given date. This gives an indication of the attrition rate.
plotDT <- hhStatDT[, .(nHH = uniqueN(linkID)), keyby = .(date)] # point plot ---- ggplot2::ggplot(plotDT, aes( x = date, y = nHH)) + geom_point() + scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + labs(title = "N live households per day for all loaded grid spy data", caption = myCaption )
# own chunk to hide warning ggplot2::ggsave(paste0(plotLoc, "gridSpyLiveHouseholdsToDate.png"))
As we noted above the original data had a variety of date formats. The data processing code does as good a job as it can of parsing non-UTC dateTimes (i.e. the observations with time as NZT) to force the r_dateTime
variable to always record as UTC.
Any duplicate observations were then removed by checking for exact repeats of the linkID
<-> r_dateTime
<-> circuit
<-> powerW
tuple. Note that this only checks for duplicates in terms of UTC...
This has consequences for Daylight Savings Time changes as follows:
dateTime_orig
is UTC so TZ_orig
== "date UTC") is still recorded as UTC. If you load the data using a function which auto-parses dateTimes into your local time (e.g. readr::read_csv()
) you will find the parser will (correctly) produce duplicate (or missing) time values during the relevant DST break. The underlying UTC dateTime will not have duplicates or missing observations (unless the data really is missing!), only the super-imposed local time representation used for printing, charts etc. This may cause confusion.dateTime_orig
is NZT so TZ_orig
== "date NZ") will already have had duplicate (or missing) times during the DST breaks. The data processing code will have attempted to convert the duplicates to identical UTC moments in time and any exact duplicates will have been 'accidentally' removed during the duplicate checking process described above. This may also cause confusion.To add even more confusion it is possible that attempts were made to 'correct' the time stamps in the DST breaks in the original data before it was downloaded by the research team. As it is almost impossible for us to determine what was, or should be done in your research context we:
dateTime_orig
column;TZ_orig
column;r_dateTime
column if they see strange errors around the DST breaks;dstBreaks <- fread(ggrParams$dstNZDates) kableExtra::kable(dstBreaks, caption = "NZ DST breaks", digits = 2) %>% kable_styling()
We do not like timezones but we like DST even less. As an example, consider what happens in the following R code:
First set a dateTime and tell lubridate (and R) it is NZT:
dateTimeNZT1 <- lubridate::ymd_hm("2014-09-28 01:50", tz = "Pacific/Auckland")
Did it work?
dateTimeNZT1
Yes.
Is it DST?
lubridate::dst(dateTimeNZT1)
No.
Now set a dateTime that should be DST and tell lubridate (and R) it is NZT:
dateTimeNZT2 <- lubridate::ymd_hm("2014-01-28 01:50", tz = "Pacific/Auckland")
Did it work?
dateTimeNZT2
Yes.
Is it DST?
lubridate::dst(dateTimeNZT2)
Yes
Now set a dateTime that does not exist as it lies inside the DST break (see Table \@ref(tab:dstBreaks)):
dateTimeNZT2 <- lubridate::ymd_hm("2014-09-28 02:01", tz = "Pacific/Auckland")
Boom. Lubridate knows it does not exist in local (civil) time. But of course it does exist as UTC:
dateTimeUTC <- lubridate::ymd_hm("2014-09-28 02:01", tz = "UTC") dateTimeUTC
Yes, we love timezones and DST.
So:
If in doubt load the data without any auto-parsing and have a good look at it!
The following table (\@ref(tab:circuitLabelCheck)) shows the number of files for each household with different circuit labels. In theory each GridSpy ID should only have one set of unique circuit labels. If not:
Some or all of these may be true at any given time.
dt <- hhStatDT[!is.na(circuitLabels), .(nFiles = .N, nObs = sum(nObs), meanDailyPowerkW = round(mean(meanPower)/1000,2), minDailyPowerkW = round(min(minPowerW)/1000,2), maxDailyPowerkW = round(max(maxPowerW)/1000,2)), keyby = .(linkID,circuitLabels)] # ignore NA - it is files not loaded due to size thresholds kableExtra::kable(dt, caption = "Circuit labels list by number of files per household", digits = 2) %>% kable_styling()
Things to note:
Errors are easier to spot in the following plot where a household spans 2 or more circuit label sets (see Figure \@ref(fig:plotCircuitLabelIssuesAsTile)).
dt$newx = stringr::str_wrap(dt$circuitLabels, width = 40) #https://stackoverflow.com/questions/21878974/auto-wrapping-of-labels-via-labeller-label-wrap-in-ggplot2 ggplot2::ggplot(dt, aes(y = newx, x = linkID, fill = nObs)) + geom_tile() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + theme(axis.text.y = element_text(size = 3)) + theme(legend.position="bottom") + scale_fill_gradient(low="green", high = "red") + labs(title = "Circuit label counts for all loaded grid spy data", y = "Circuit label sets (as strings)", caption = paste0(myCaption) )
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileCircuitLabelsPlot.png"))
If the above plot and table flag errors then further re-naming of the circuit labels may be necessary.
Unfortunately this is not as straightforward as one would wish because many households have seperately controlled (and thus monitored) hot water circuits which do not feed from the 'Incomer'. We have provided some example code to attempt to correctly impute the sum of the relevant circuits in each house. This make use of a circuits-to-sum file which specifies the circuits to use in each case:
#YMMV
There are a number of observations that have recorded negative power. There are at least two potential reasons for this:
We have conducted a seperate analysis of the incidence of negative values and outliers at the circuit level which makes recommendations on actions to take.
See the code examples for suggestions on how to do this.
t <- proc.time() - startTime elapsed <- t[[3]]
Analysis completed in r round(elapsed,2)
seconds ( r round(elapsed/60,2)
minutes) using knitr in RStudio with r R.version.string
running on r R.version$platform
.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.