In CfSOtago/GREENGridData: Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use

knitr::opts_chunk$set(echo = FALSE) # by default turn off code echo. Warnings may still be shown.

# Set start time ----
startTime <- proc.time()

# Set grid spy data paths etc from file ----
source(paste0(ggrParams$repoLoc, "/dataProcessing/gridSpy/gSpyParams.R"))

# Packages needed in this .Rmd file ----
rmdLibs <- c("stringr" # for str_wrap for long labels on plots
)
# load them
loadLibraries(rmdLibs)

# Local parameters ----

b2Kb <- 1024 #http://whatsabyte.com/P1/byteconverter.htm
b2Mb <- 1048576
plotLoc <- paste0(ggrParams$repoLoc, "/docs/plots/")

# Local functions ----

\newpage

About

Report circulation:

Public – this report is intended to accompany the data release.

License

Citation

If you wish to use any of the material from this report please cite as:

r ggrParams$Authors (r 1900 + as.POSIXlt(Sys.Date())$year) r params$title: r params$subtitle, r ggrParams$pubLoc.

This work is (c) r lubridate::year(today()) the University of Southampton.

History

this report's edit history

Support

\newpage

Introduction

This report provides summary data quality statistics for the original GREEN Grid GridSpy household power demand monitoring data. This data was used to create a derived 'safe' dataset using the code in the r ggrParams$repo repository.

Original Data: Quality checks

The original data files files are stored on r ggrParams$otagoHCS.

# get file list 
fListCompleteDT <- data.table::as.data.table(readr::read_csv(gSpyParams$fListAll))

# for use below
nFiles <- nrow(fListCompleteDT)
nFilesNotLoaded <- nrow(fListCompleteDT[dateColName %like% "ignore"])

Data collection is ongoing and this section reports on the availability of data files collected up to the time at which the most recent safe file was created (r file.mtime(gSpyParams$fListAll)). To date we have r tidyNum(nFiles) files from r tidyNum(uniqueN(fListCompleteDT$hhID)) unique GridSpy IDs.

However a large number of files (r tidyNum(nFilesNotLoaded) or r round(100*(nFilesNotLoaded/nFiles))%) have 1 of two file sizes (43 or 2751 bytes) and we have determined that they contain no data as the monitoring devices have either been removed (households have moved or withdrawn from the study) or data transfer has failed. We therefore flag these files as 'to be ignored'.

In addition two of the GridSpy units were re-used in new households following withdrawal of the original participants. The GridSpy IDs (rf_XX) remained unchanged despite allocation to different households. The original input data does not therefore distinguish between these households and we discuss how this is resolved in the clean safe data in Section \@ref(reallocation) below.

Input data file quality checks

Figure \@ref(fig:allFileSizesPlot) shows the distribution of the file sizes of all files over time by GridSpy ID. Note that white indicates the presence of small files which may not contain observations.

myCaption <- paste0("Data source: ", gSpyParams$gSpyInPath,
                    "\nUsing data received up to ", max(fListCompleteDT$fMDate))

plotDT <- fListCompleteDT[, .(nFiles = .N,
                              meanfSize = mean(fSize)), 
                          keyby = .(hhID, date = as.Date(fMDate))]

ggplot2::ggplot(plotDT, aes( x = date, y = hhID, 
                             fill = log(meanfSize))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "black") + 
  scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, hjust = 0.5)) + 
  labs(title = "Mean file size of all GridSpy data files received per day",
       y = "GridSpy ID",
       caption = paste0(myCaption, 
                        "\nLog file size used as some files are full year data and so extremely large")

  )

ggplot2::ggsave(paste0(plotLoc, "gridSpyAllFileListSizeTilePlot.png"))

As we can see, relatively large files were downloaded (manually) in June and October 2016 before an automated download process was implemented from January 2017. A final manual download appears to have taken place in early December 2017.

Figure \@ref(fig:loadedFileSizesPlot) plots the same results but excludes files which do not meet the file size threshold and which we therefore assume do not contain data.

plotDT <- fListCompleteDT[!is.na(dateFormat), .(nFiles = .N,
                              meanfSize = mean(fSize)), 
                          keyby = .(hhID, date = as.Date(fMDate))]

ggplot2::ggplot(plotDT, aes( x = date, y = hhID, fill = log(meanfSize))) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "black") + 
  scale_x_date(date_labels = "%Y %b", date_breaks = "1 month") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
  labs(title = "Mean file size of loaded GridSpy data files received per day",
       y = "GridSpy ID",
       caption = paste0(myCaption, 
                        "\nLog file size used as some files are full year data",
                        "\nFiles loaded if fsize > ", gSpyParams$gSpyFileThreshold, " bytes")

  )

ggplot2::ggsave(paste0(plotLoc, "/gridSpyLoadedFileListSizeTilePlot.png"))

As we can see this removes a large number of the automatically downloaded files.

Input date format checks {#getItRightFirstTime}

As noted above, the original data was downloaded in two ways:

Manual download of large samples of data. In this case the dateTime of the observation appears to have been stored in NZ time and appears also to have varying dateTime formats (d/m/y, y/m/d and even in some cases the inexplicable m/d/y);
Automatic download of daily data. In this case the origial dateTime of the observation was stored as UTC.

Resolving and cleaning these variations and uncertainties have required substantial effort and in some cases the date format (and thus time when timezones are set) has had to be inferred from the file names. A key lesson for future projects is always to ensure that files are named so that meta data is easily parsed and that there can be one and only one:

dateTime format (YYY-MM-DD HH:MM:SS);
timezone (UTC)

Table \@ref(tab:listDefaultDateFilesNZT) lists up to 10 of the 'date NZ' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...

# list default files with NZ time
aList <- fListCompleteDT[dateColName == "date NZ" & dateFormat %like% "default", 
                         .(file, fSize, dateColName, dateExample, dateFormat)]

cap <- paste0("First 10 (max) of ", nrow(aList), 
              " files with dateColName = 'date NZ' and default dateFormat")

kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>%
  kable_styling()

Table \@ref(tab:listDefaultDateFilesUTC) lists up to 10 of the 'date UTC' files which are set by default - do they look OK to assume the default dateFormat? Compare the file names with the dateExample...

# list default files with UTC time
aList <- fListCompleteDT[dateColName == "date UTC" & dateFormat %like% "default", 
                         .(file, fSize, dateColName, dateExample, dateFormat)]

cap <- paste0("First 10 (max) of ", nrow(aList), 
              " files with dateColName = 'date UTC' and default dateFormat")

kableExtra::kable(caption = cap, head(aList, 10), digits = 2) %>%
  kable_styling()

After final cleaning, the final date formats are shown in Table \@ref(tab:finalDateFormatTable).

# See what the date formats look like now
t <- fListCompleteDT[, .(nFiles = .N, 
                         meanFSizeKb = tidyNum(mean(fSize/b2Kb)),
                         minFSizeKb = tidyNum(min(fSize/b2Kb)),
                         maxFSizeKb = tidyNum(max(fSize/b2Kb)),
                         minFDate = min(fMDate), # may not make much sense
                         maxFDate = max(fMDate)), 
                     keyby = .(dateColName, dateFormat)]

kableExtra::kable(t,
             caption = "Number of files & min/max dates (as char) with given date column names by final imputed date format", digits = 2) %>%
  kable_styling()

Results to note:

The non-loaded files only have 2 distinct file sizes, confirming that they are unlikely to contain useful data.
There are a range of dateTme formats - these are fixed in the data cleaning process and all datesTimes have been set to UTC except where explicitly labelled. Note that R will load UTC data with the local timezone so if you re-use the data in New Zealand this will be correct. If you re-use the data outside New Zealand you will need to set the timezone accordingly or you will get thoroughly confused. We are not great fans of timezones.
Following detailed checks there are now r nrow(fListCompleteDT[dateFormat == "ambiguous"]) files which are still labelled as having ambiguous dates.

Processed Data: Quality checks {#cleanData}

In this section we analyse the data files that have a file size > r gSpyParams$gSpyFileThreshold bytes and which have been used to create the safe data. Things to note:

As indicated above, we assume that any files smaller than this value have no observations. This is based on:
- Manual inspection of several small files;
- The identical (small) file sizes involved.
There was substantial duplication of observations, some of which was caused by the different date formats, especially where they run through Daylight Savings Time (DST) changes.

Table \@ref(tab:filesToLoadTable) shows the number of files per GridSpy ID that are actually processed to make the safe version together with the min/max file save dates (not the observed data dates).

# check files to load
t <- fListCompleteDT[dateColName %like% "date", .(nFiles = .N,
                       meanSize = mean(fSize),
                       minFileDate = min(fMDate),
                       maxFileDate = max(fMDate)), keyby = .(gridSpyID = hhID)]

kableExtra::kable(t, caption = "Summary of household files to load", digits = 2) %>%
  kable_styling()

Recoding re-allocated GridSpy units {#reallocation}

As noted in the introduction, two units were re-allocated to new households during the study. These were:

rf_15 - allocated to a new household on 20/1/2015
rf_17 - allocated to a new household on

To avoid confusion the data for each of these units has been split in to rf_XXa/rf_XXb files on the appropriate dates during data processing. In principle therefore the clean data should contain data files for:

rf_15a and rf_15b
rf_17a and rf_17b

However rf_15a did not collect usable data before the unit was re-allocated so only files for rf_15b, rf_17a and rf_17b exist in the archive.

Each cleaned safe data file contains both the original hhID (i.e. the GridSpy ID) and a new linkID which has the same value as hhID except in the case of these three files. The linkID variable should always be used to link the GridSpy data to the survey or other household level data in the data package.

In all subsequent analysis we use linkID to give results for each household.

Observations

The following plots show the number of observations per day per household. In theory we should not see:

dates before 2014 or in to the future. These may indicate:
- date conversion errors;
more than 1440 observations per day. These may indicate:
- duplicate time stamps - i.e. they have the same time stamps but different power (W) values or different circuit labels. These may be expected around the DST changes in April/September. These can be examined on a per household basis using the rf_xx_observationsRatioPlot.png plots to be found in the data package checkPlots folder;
- observations from files that are in the 'wrong' rf_XX folder and so are included in the 'wrong' household as 'duplicate' time stamps.

If present both of the latter may have been implied by the table above and would have evaded the de-duplication filter which simply checks each complete row against all others within its consolidated household dataset (a within household absolute duplicate check).

Note that rf_15a is not present as no usable data was obtained from this household.

hhStatDT <- data.table::as.data.table(readr::read_csv(gSpyParams$hhStatsByDate))

Figure \@ref(fig:obsPlotTile) uses a tile plot which is useful for visualising data gaps. Note that there are indications of missing observations in April possibly caused by DST clock-changes when clocks go back 1 hour in NZ.

myCaption <- paste0(myCaption,
                        "\nOnly files of size > ", 
                        gSpyParams$gSpyFileThreshold, " bytes loaded")

# tile plot ----
obsPlot1 <- ggplot2::ggplot(hhStatDT, aes( x = date, y = linkID, 
                               fill = nObs/nCircuits)) + # divide by the number of circuits
  geom_tile() +
  scale_fill_gradient(low = "red", high = "green") +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5,
                                   hjust = 0.5)) + 
  labs(title = "N observations per household per circuit per day for all loaded GridSpy data",
       caption = myCaption

  )
obsPlot1

Figure \@ref(fig:obsPlotPoint) uses a point plot which is useful for visualising days where there was partial or duplicate data. Note that there are indications of duplicate observations in late September (and April 2015) possibly caused by DST clock-changes when clocks go forward 1 hour in NZ.

# point plot ----
obsPlot2 <-ggplot2::ggplot(hhStatDT, aes( x = date, 
                               y = nObs/nCircuits, 
                               colour = linkID)) +
  geom_point() +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, 
                                   hjust = 0.5)) + 
  labs(title = "N observations per household per circuit per day for all loaded GridSpy data",
       caption = myCaption

  )
obsPlot2

ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsTilePlot.png"), obsPlot1)
ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileNobsPointPlot.png"), obsPlot2)

# Stats table (so we can pick out the dateTime errors)
t <- hhStatDT[, .(minObs = min(nObs),
             maxObs = max(nObs), # should not be more than 1440, if so suggests duplicates
             meanN_Circuits =mean(nCircuits), #i.e. n circuits
             minDate = min(date),
             maxDate = max(date)),
         keyby = .(linkID)]

kableExtra::kable(t,caption = "Summary observation stats by hhID", digits = 2) %>%
  kable_styling()

Table \@ref(tab:summaryTable) shows the min/max number of observations per day and min/max dates for each household. As above, we should not see:

dates before 2014 or in to the future (indicates date conversion errors);
fewer than 1440 observations per day (since we should have at least 1 circuit monitored for 24 * 60 = 1440 minutes);
non-integer counts of circuits as it suggests some circuit label errors or changes to the number of circuits monitored over time;
NA in any row (indicates date conversion errors).

If we do see any of these then we still have data cleaning work to do!

Finally Figure \@ref(fig:attritionPlot) plots the total number of households for whom we have any data on a given date. This gives an indication of the attrition rate.

plotDT <- hhStatDT[, .(nHH = uniqueN(linkID)), keyby = .(date)]

# point plot ----
ggplot2::ggplot(plotDT, aes( x = date, y = nHH)) +
  geom_point() +
  scale_x_date(date_labels = "%Y %b", date_breaks = "6 months") +
  theme(axis.text.x = element_text(angle = 90, 
                                   vjust = 0.5, 
                                   hjust = 0.5)) + 
  labs(title = "N live households per day for all loaded grid spy data",
       caption = myCaption

  )

# own chunk to hide warning
ggplot2::ggsave(paste0(plotLoc, "gridSpyLiveHouseholdsToDate.png"))

Date and time checks {#dateTimeChecks}

As we noted above the original data had a variety of date formats. The data processing code does as good a job as it can of parsing non-UTC dateTimes (i.e. the observations with time as NZT) to force the r_dateTime variable to always record as UTC.

Any duplicate observations were then removed by checking for exact repeats of the linkID <-> r_dateTime <-> circuit <-> powerW tuple. Note that this only checks for duplicates in terms of UTC...

This has consequences for Daylight Savings Time changes as follows:

data which was originally stored as UTC (dateTime_orig is UTC so TZ_orig == "date UTC") is still recorded as UTC. If you load the data using a function which auto-parses dateTimes into your local time (e.g. readr::read_csv()) you will find the parser will (correctly) produce duplicate (or missing) time values during the relevant DST break. The underlying UTC dateTime will not have duplicates or missing observations (unless the data really is missing!), only the super-imposed local time representation used for printing, charts etc. This may cause confusion.
data which was originally stored as NZT (dateTime_orig is NZT so TZ_orig == "date NZ") will already have had duplicate (or missing) times during the DST breaks. The data processing code will have attempted to convert the duplicates to identical UTC moments in time and any exact duplicates will have been 'accidentally' removed during the duplicate checking process described above. This may also cause confusion.

To add even more confusion it is possible that attempts were made to 'correct' the time stamps in the DST breaks in the original data before it was downloaded by the research team. As it is almost impossible for us to determine what was, or should be done in your research context we:

have retained the original timestamp in the data in the dateTime_orig column;
have flagged our best guess (see above) of the original date format in the data in the TZ_orig column;
suggest that users carefully check these columns against the r_dateTime column if they see strange errors around the DST breaks;
suggest that users learn how to use lubridate to manipulate dates, time and time zones and thus what lubridate did during data processing. You can also use lubridate's very useful lubridate::dst() function to check if a given dateTime is in DST or not;
strongly suggest that if at all possible, users avoid using data from the days when there are DTS breaks (see Table \@ref(tab:dstBreaks)).

dstBreaks <- fread(ggrParams$dstNZDates)
kableExtra::kable(dstBreaks, caption = "NZ DST breaks", digits = 2) %>%
  kable_styling()

We do not like timezones but we like DST even less. As an example, consider what happens in the following R code:

First set a dateTime and tell lubridate (and R) it is NZT:

dateTimeNZT1 <- lubridate::ymd_hm("2014-09-28 01:50", tz = "Pacific/Auckland")

Did it work?

dateTimeNZT1

Yes.

Is it DST?

lubridate::dst(dateTimeNZT1)

No.

Now set a dateTime that should be DST and tell lubridate (and R) it is NZT:

dateTimeNZT2 <- lubridate::ymd_hm("2014-01-28 01:50", tz = "Pacific/Auckland")

Did it work?

dateTimeNZT2

Yes.

Is it DST?

lubridate::dst(dateTimeNZT2)

Yes

Now set a dateTime that does not exist as it lies inside the DST break (see Table \@ref(tab:dstBreaks)):

dateTimeNZT2 <- lubridate::ymd_hm("2014-09-28 02:01", tz = "Pacific/Auckland")

Boom. Lubridate knows it does not exist in local (civil) time. But of course it does exist as UTC:

dateTimeUTC <- lubridate::ymd_hm("2014-09-28 02:01", tz = "UTC")
dateTimeUTC

Yes, we love timezones and DST.

So:

If in doubt load the data without any auto-parsing and have a good look at it!

Circuit label checks

The following table (\@ref(tab:circuitLabelCheck)) shows the number of files for each household with different circuit labels. In theory each GridSpy ID should only have one set of unique circuit labels. If not:

some of the circuit labels for these households may have been changed during the data collection process;
some of the circuit labels may have character conversion errors which have changed the labels during the data collection process;
at least one file from one household has been saved to a folder containing data from a different household (unfortunately the raw data files do not contain household IDs in the data or the file names which would enable checking/preventative filtering). This will be visible in the table if two households appear to share exactly the same list of circuit labels.

Some or all of these may be true at any given time.

dt <- hhStatDT[!is.na(circuitLabels), 
                     .(nFiles = .N,
                       nObs = sum(nObs),
                       meanDailyPowerkW = round(mean(meanPower)/1000,2),
                       minDailyPowerkW = round(min(minPowerW)/1000,2),
                       maxDailyPowerkW = round(max(maxPowerW)/1000,2)),
                     keyby = .(linkID,circuitLabels)] # ignore NA - it is files not loaded due to size thresholds

kableExtra::kable(dt, caption = "Circuit labels list by number of files per household", digits = 2) %>%
  kable_styling()

Things to note:

rf_25 has an additional unexpected "Incomer 1 - Uncontrolled$2757" circuit in some files but its value is always NA so it has been ignored;
rf_46 had multiple circuit labels caused by apparent typos. These have been re-labelled but note that this is the only household to have 13 circuits monitored;
there can be negative power.

Errors are easier to spot in the following plot where a household spans 2 or more circuit label sets (see Figure \@ref(fig:plotCircuitLabelIssuesAsTile)).

dt$newx = stringr::str_wrap(dt$circuitLabels, width = 40) #https://stackoverflow.com/questions/21878974/auto-wrapping-of-labels-via-labeller-label-wrap-in-ggplot2

ggplot2::ggplot(dt, aes(y = newx, x = linkID, 
                       fill = nObs)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)) + 
  theme(axis.text.y = element_text(size = 3)) + 
  theme(legend.position="bottom") +
  scale_fill_gradient(low="green", high = "red") +
  labs(title = "Circuit label counts for all loaded grid spy data",
       y = "Circuit label sets (as strings)",
       caption = paste0(myCaption)

  )

ggplot2::ggsave(paste0(plotLoc, "gridSpyLoadedFileCircuitLabelsPlot.png"))

If the above plot and table flag errors then further re-naming of the circuit labels may be necessary.

Calculating total household power demand {#totPower}

Unfortunately this is not as straightforward as one would wish because many households have seperately controlled (and thus monitored) hot water circuits which do not feed from the 'Incomer'. We have provided some example code to attempt to correctly impute the sum of the relevant circuits in each house. This make use of a circuits-to-sum file which specifies the circuits to use in each case:

code: adapted from an original kindly provided by Jason Mair
circuits-to-sum (see package data folder)
Processing reports:
Based on circuitsToSumv1.0.csv
Based on circuitsToSumv1.1.csv

#YMMV

Dealing with circuit level outliers and negative power {#outliers}

There are a number of observations that have recorded negative power. There are at least two potential reasons for this:

a Grid Spy sensor was placed on the wrong wire and/or set to the wrong phase
a circuit contained a PV inverter which pushes power into the home and which is measured as -ve power (when active)

We have conducted a seperate analysis of the incidence of negative values and outliers at the circuit level which makes recommendations on actions to take.

Loading the cleaned data files {#loadClean}

See the code examples for suggestions on how to do this.

Runtime

t <- proc.time() - startTime

elapsed <- t[[3]]

Analysis completed in r round(elapsed,2) seconds ( r round(elapsed/60,2) minutes) using knitr in RStudio with r R.version.string running on r R.version$platform.

R environment

R packages used

base R [@baseR]
bookdown [@bookdown]
GREENGridData [@GREENGridData] which depends on:
- data.table [@data.table]
- dplyr [@dplyr]
- hms [@hms]
- lubridate [@lubridate]
- progress [@progress]
- readr [@readr]
- readxl [@readxl]
- reshape2 [@reshape2]
ggplot2 [@ggplot2]
kableExtra [@kableExtra]
knitr [@knitr]
rmarkdown [@rmarkdown]
stringr [@stringr]

Session info

sessionInfo()

References

CfSOtago/GREENGridData documentation built on June 10, 2022, 8:03 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CfSOtago/GREENGridData
Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use

In CfSOtago/GREENGridData: Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use

About

Report circulation:

License

Citation

History

Support

Introduction

Original Data: Quality checks

Input data file quality checks

Input date format checks {#getItRightFirstTime}

Processed Data: Quality checks {#cleanData}

Recoding re-allocated GridSpy units {#reallocation}

Observations

Date and time checks {#dateTimeChecks}

Circuit label checks

Calculating total household power demand {#totPower}

Dealing with circuit level outliers and negative power {#outliers}

Loading the cleaned data files {#loadClean}

Runtime

R environment

R packages used

Session info

References

R Package Documentation

Browse R Packages

We want your feedback!

CfSOtago/GREENGridData Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use

In CfSOtago/GREENGridData: Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use

About

Report circulation:

License

Citation

History

Support

Introduction

Original Data: Quality checks

Input data file quality checks

Input date format checks {#getItRightFirstTime}

Processed Data: Quality checks {#cleanData}

Recoding re-allocated GridSpy units {#reallocation}

Observations

Date and time checks {#dateTimeChecks}

Circuit label checks

Calculating total household power demand {#totPower}

Dealing with circuit level outliers and negative power {#outliers}

Loading the cleaned data files {#loadClean}

Runtime

R environment

R packages used

Session info

References

R Package Documentation

Browse R Packages

We want your feedback!

CfSOtago/GREENGridData
Processing NZ GREEN Grid project data to create a 'safe' version for data archiving and re-use