In CfSOtago/GREENGridEECA: Supports Analysis of NZ GREEN Grid data for EECA

knitr::opts_chunk$set(echo = FALSE) # by default turn off code echo

# Set start time ----
startTime <- proc.time()
# Local parameters ----
b2Kb <- 1024 #http://whatsabyte.com/P1/byteconverter.htm
b2Mb <- 1048576
plotLoc <- paste0(repoParams$repoLoc, "/docs/plots/") # where to put the plots

# Packages used in the report ----
rmdLibs <- c("ggplot2", # plots
          "kableExtra", # fancy tables
          "lubridate"
          )
GREENGridEECA::loadLibraries(rmdLibs)

# Local parameters

doSkim <- 1 # set to 1 to get skimmed tables in Data Annex (takes time)

# Local functions ----

\newpage

About

Citation

Report circulation

Public – this report is intended for publication following EECA approval.

Code

All code used to create this report is available from:

https://github.com/CfSOtago/GREENGridEECA

License {#license}

History

this report's edit history

Support

The European Union via SPATIALEC, a Marie Skłodowska-Curie Global Fellowship based at the University of Otago’s Centre for Sustainability (2017-2019) & the University of Southampton’s Sustainable Energy Research Group (2019-2020).

\newpage

Introduction

Previous work conducted as Part B of this project has calculated the % contribution of different appliances to peak power load (kW) and to overall annual consumption (kWh) for the ~ 40 New Zealand GREEN Grid household electricity demand study dwellings [@stephenson_smart_2017].

EECA have expressed an interest in understanding how these results and/or this sample of dwellings can be scaled up to provide estimates of the same values for the overall New Zealand residential dwelling stock/household population.

Unfortunately, due to their recruitment method the New Zealand GREEN Grid household electricity demand study sample dwellings are unlikely to be representative of either the two regions from which they were drawn (Taranaki and Hawke's Bay) or New Zealand as a whole as we discuss below. In addition these households form an extremely small sample so that the resulting confidence intervals and margins of error around derived estimates are large (see Section \@ref(ggSample) below). In combination, these two issues mean that upscaling the sample to represent either the regions from which they were drawn and/or the national household population will produce estimates that are neither representative nor precise.

If we assume that the GREEN Grid sample adequately represents the regional or national distribution of electricity-using devices and appliances and that the way the occupants of the GREEN Grid dwellings use these devices and appliances represents the usage habits and patterns of all New Zealand households then we could make generalisable inferences. However we can do very little about the lack of precision caused by the small sample size.

Whilst these assumptions may be plausible in the case of, say, lighting, it is likely to be less plausible in the case of space or hot water heating due to the absence of reticulated gas and coal/wood as main energy sources from the GREEN Grid sample. Further, the very small size of the GREEN Grid sample means that unusual patterns may be unrealistically inflated in the upscaling process.

Nevertheless it is still useful to review potential upscaling methods should we wish to make assumptions of this kind or to guide future work when a larger and representative sample of New Zealand households and their energy consumption is available.

The remainder of this report therefore introduces three potential upscaling methods both of which have been used with the GREEN Grid household electricity demand data in different ways and for different purposes. The first is a simple multiplication method used to estimate overall residential electricity consumption for comparison to the GXP level data used in Part B. The second is a simple proportionate upscaling method used to estimate energy demand for heat pumps in New Zealand [@dortans_estimating_2019]. The third is a more complex re-weighting approach that uses iterative proportional fitting [@simpson_combining_2005; @anderson_estimating_2012] to re-weight households on multiple dimensions to match known Census distributions [@anderson_small_2019].

Data

A number of data sources are used in this report.

Census 2013

# 2013
# 1,549,890
# http://archive.stats.govt.nz/Census/2013-census/profile-and-summary-reports/qstats-families-households/households.aspx
# basically matches sum(ipfCensusDT$npeople_totalHouseholds, na.rm = TRUE)
nzHhPop2013 <- sum(ipfCensusDT$npeople_totalHouseholds, na.rm = TRUE)

# 2015
# http://archive.stats.govt.nz/browse_for_stats/population/estimates_and_projections/DwellingHouseholdEstimates_HOTPJun17qtr.aspx
# 30 June 2017
nzHhPop2015 <- 1796000

# 
hhGrowth <- (100*(nzHhPop2015/nzHhPop2013)) - 100

This data comprises family, household or dwelling counts (depending on the variables of interest) at area unit level. The data was downloaded from the Stats NZ census table download tool and processed using code developed as part of the EU H2020 Marie Skłodowska-Curie scheme-funded SPATIALEC Global Fellowship hosted by the University of Otago.

Grid exit point kWh export (GXP)

gxpDataDT <- drake::readd(gxpData) # gxp data
gxpDataDT <- gxpDataDT[, hms := hms::as_hms(rDateTime)]
gxpDataDT <- GREENGridEECA::addNZSeason(gxpDataDT)
gxpDataDT <- gxpDataDT[, year := lubridate::year(date)]
gxpDataDT <- gxpDataDT[!is.na(kWh)]

This data comprises the half-hourly electricity export (kWh) data for each New Zealand Grid Exit Point for 2015 downloaded from the Electricity Authority EMI grid export database. We use this data to provide total electricity consumption levels against which we can compare our various residential consumption estimates.

NZ GREEN Grid Household Electricity Demand {#ggSample}

fix <- function(dt){
  dt[, r_dateTimeHalfHour := lubridate::as_datetime(r_dateTimeHalfHour, tz = "Pacific/Auckland")]
  dt[, date := lubridate::date(r_dateTimeHalfHour)]
  dt[, hms := hms::as_hms(r_dateTimeHalfHour)]
  dt <- GREENGridEECA::addNZSeason(dt, date = date)
  dt[, meankWh := (meanPowerW/1000)/2] # convert to kW & divide by 2 as half-hours
}
# Total load
hhTotalLoadDT <- fix(drake::readd(ggHHData))

annualKwhDT <- hhTotalLoadDT[, .(annualkWh = sum(meankWh),
                                 annualkW = sum(meanPowerW)/1000,
                                 nObs = .N), keyby = .(linkID)]


# HP
hhHeatPumpLoadDT <- fix(drake::readd(ggHPData))

This sample comprises 44 households who were recruited via the local electricity distribution businesses (EDBs) in two areas: Taranaki starting in May 2014 and Hawkes Bay starting in November 2014 [@GREENGridData].

Recruitment was via a non-random sampling method and a number of households were intentionally selected for their ‘complex’ electricity consumption (and embedded generation) patterns and appliances [@dianaThesis2015; @stephenson_smart_2017; @jack_minimal_2018; @suomalainen_detailed_2019].

The EDBs invited their own employees and those of other local companies to participate in the research and ~80 interested potential participants completed short or long forms of the Energy Cultures 2 household survey [@ec2Survey2015]. Households were then selected from this pool by the project team based on selection criteria relevant to the GREEN Grid project. These included:

having the majority of their energy supply from electricity (i.e. not gas heating);
presence of children;
family and dwelling size;
types of appliances owned.

As a result of this process the sample cannot be assumed to represent the population of customers (or employees) of any of the companies involved, nor the populations in each location or New Zealand dwellings as a whole [@stephenson_smart_2017].

Annual electricity consumption

Table \@ref(tab:ggAnnualkWh) shows the mean annualised net electricity consumption for all dwellings in the data by presence of photovoltaic panel (PV) inverter. For those that do not have PV the mean annual consumption of ~ 8,000 kWh is ~ 14% higher than the ~ 7,000 kWh reported for 2018 [@electricityInNZ_2018]. As we discuss below, this is likely to be due to the social composition of these dwelling.

kWhDT <- hhTotalLoadDT[, .(sumkWh = sum(meankWh), # add up energy consumption
                           nDays = uniqueN(date)), keyby = linkID]
kWhDT[, annualkWh := 365*(sumkWh/nDays)] # divide by number of days of data and * by 365 to give anualised
setkey(kWhDT, linkID)

setkey(hhAttributesDT, linkID)
kWhDT <- hhAttributesDT[kWhDT]

t <- kWhDT[, .(min = min(annualkWh),
               mean = mean(annualkWh),
               max = max(annualkWh),
               sd = sd(annualkWh),
               nHouseholds = .N), keyby = `PV Inverter`]

# lower = estimate - (s.e.*1.96) 
# upper = estimate + (s.e.*1.96)

# plotDT <- plotDT[, ci_upper := meanWh + qnorm(0.975)*(sdWh/sqrt(nObs))]

t[, ci_lower := mean - (sd/sqrt(nHouseholds))*qnorm(0.975)]
t[, ci_upper := mean + (sd/sqrt(nHouseholds))*qnorm(0.975)]

kableExtra::kable(caption = "Summary statistics for annualised net electricity consumption by presence of PV Inverter (NA indicates unknown)", t, digits = 2) %>%
  kable_styling()

dodge <- position_dodge(width=0.9)
# https://ggplot2.tidyverse.org/reference/geom_linerange.html
p <- ggplot2::ggplot(t, aes(x = `PV Inverter` , y = mean, fill = `PV Inverter`)) +
  geom_col(position = "dodge") +
  scale_fill_discrete(name = "Presence of PV inverter") +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), position = dodge, width = 0.25) +
  theme(legend.position = "bottom") +
  labs(x = "Presence of PV inverter", y = "Mean annualised kWh",
       caption = "Error bars = 95% CI, \nTranspower 2018 estimate shown as black horizontal dashed line")

p + geom_hline(yintercept = 7000, color = "black", 
               linetype = "dashed",
               size = 1, alpha = 0.5)

Figure \@ref(fig:ggAnnualkWh) plots these values and the width of the 95% confidence intervals illustrates the degree of uncertainty in the mean estimates which is largely driven by the small sample sizes reported in Table \@ref(tab:ggAnnualkWh) together with inter-household heterogeneity. It is worth noting however that the ~ 7,000 kWh reported for 2018 [@electricityInNZ_2018] lies just within the lower 95% confidence interval for the dwellings with no PV inverter.

Heating {#ggHeating}

As an example of the sample bias, Table \@ref(tab:ggHeatSource) shows the main heat source used for space heating in the GREEN Grid sample. As we can see heat pumps were the main heat source in 55% of dwellings and wood burners in 25% with very little use of gas in any form and no record of coal use.

t <- with(hhAttributesDT, table(Appliance = Q20_coded, useNA = "always"))
pc_t <- 100*prop.table(t)
pc_dt <- setnames(as.data.table(pc_t), "N", "%")
setkey(pc_dt, Appliance)
dtf <- as.data.table(t)
setkey(dtf, Appliance)
dt <- dtf[pc_dt]
kableExtra::kable(caption = paste0("Main method of heating, n dwellings = ", sum(dt$N)), 
                   dt, digits = 2) %>%
  kable_styling()

For comparison, Table \@ref(tab:compareHeatSourceHCS) and Figure \@ref(fig:compareHeatSourceHCS) compare the 'main method of heating' for the GREEN Grid sample with the results from the BRANZ Housing Conditions Survey 2015 [@vicki_white_warm_2017] where comparable. As we can see, the GREEN Grid sample has a higher incidence of heat pumps compared to the HCS 2015 estimate of 40% heat pumps for owner occupied dwellings (the most similar to the GREEN Grid sample) and 25% for rentals. On the other hand, only 25% of the GREEN Grid sample used enclosed wood burners as a main method of heating compared to HCS 2015 estimates of 39% and 23% for owner-occupied and rental dwellings respectively. Relatively few GREEN Grid dwellings used portable electric and none used portable gas heaters compared to the HCS 2015 sample. These results suggest that the pattern of electricity use for space heating in the GREEN Grid sample is unlikely to be representative of all New Zealand dwellings.

t <- with(hhAttributesDT, table(Q20_coded, useNA = "always"))
p_t <- prop.table(t)
p_dt <- setnames(as.data.table(p_t), "N", "p")
setkey(p_dt, Q20_coded)
dtf <- as.data.table(t)
setkey(dtf, Q20_coded)
dt <- dtf[p_dt]
# kableExtra::kable(caption = paste0("Main method of heating, n dwellings = ", sum(dt$N)), 
#                   dt, digits = 2) %>%
#   kable_styling()

# compare to HCS 2015

setnames(dt, "Q20_coded", "Appliance")
dt[,type := "GREEN Grid sample"]
dt[, ci_95pc := 100 * (qnorm(0.975) * (sqrt(p*(1-p)/N)))]
dt[, pc:= 100*p]

# keep comparables for clarity
r_pc_dt <- dt[Appliance == "Heat pump" |
        Appliance == "Enclosed wood burner" |
        Appliance == "Portable electric heaters" |
        Appliance == "Portable gas heaters", .(Appliance, pc, type, ci_95pc)]

combined_dt <- rbind(r_pc_dt,hcs2015DT)
combined_dt[, ci_lower := pc - ci_95pc]
combined_dt[, ci_upper := pc + ci_95pc]

kableExtra::kable(combined_dt[, .(sample = type, Appliance, "%" = pc, 
                                  "Lower 95% CI" = ci_lower,
                                  "Upper 95% CI" = ci_upper)], digits = 1, caption = "Main heat source") %>%
  kable_styling()

dodge <- position_dodge(width=0.9)
# https://ggplot2.tidyverse.org/reference/geom_linerange.html
ggplot2::ggplot(combined_dt, aes(x = Appliance , y = pc, fill = type)) +
  geom_col(position = "dodge") +
  scale_fill_discrete(name = "Sample:") +
  geom_errorbar(aes(ymin = pc-ci_95pc, ymax = pc+ci_95pc), position = dodge, width = 0.25) +
  coord_flip() +
  theme(legend.position = "bottom") +
  labs(x = "Appliance", y = "%",
       caption = "Non-comparable appliances excluded, within-sample totals will not sum to 100%\nError bars = 95% CI for HCS 2015")

Insert footnote re HCS 2015 confidence intervals of +/-6.1% for owner-occupier homes and +/-10.8% for rentals at the 95% level [@vicki_white_warm_2017, p51]. This means that if we repeated the HCS 100 times we would expect that in 95 of them the % of owner occupier dwellings with heat pumps would be between 33.9% and 46.1% and the figure for rentals would be 14.2% to 35.8%. Clearly the small sample size of the HCS 2015 leads to substantial uncertainty in the estimates of heat pump prevalance (and all other estimates) in these groups. For a more detailed discussion of this problem see [@anderson_ensuring_2020].

Further, Table \@ref(tab:compareHeatSource) compares the 'Fuel type used to heat dwelling' for Census 2013 and the 'methods available for heating your house' in the GREEN Grid sample. Note that multiple responses were possible in both data sources and the responses have been coded to be as comparable as possible.

# census
ct <- ipfCensusDT[, .(Electricity = sum(heatSourceElectricity, na.rm = TRUE),
                     Gas = sum(heatSourceGas,na.rm = TRUE),
                     Wood = sum(heatSourceWood,na.rm = TRUE),
                     Coal = sum(heatSourceCoal,na.rm = TRUE),
                     Other = sum(heatSourceOther,na.rm = TRUE)
                     )
                  ]
pc_ct <- 100*prop.table(ct)

# get census household counts per type
crt <- ipfCensusDT[, .(ElectricityT = sum(heatSourceElectricity, na.rm = TRUE),
                       GasT = sum(heatSourceGas,na.rm = TRUE),
                       WoodT = sum(heatSourceWood,na.rm = TRUE),
                       CoalT = sum(heatSourceCoal,na.rm = TRUE),
                       OtherT = sum(heatSourceOther,na.rm = TRUE),
                       Total = sum(fuel_totalHouseholds, na.rm = TRUE )
                       ), 
                   keyby = .(Region = REGC2013_label)
                   ]
# % of total (will sum to > 100 as multiple responses possible)
crt_pc <- crt[, .(Region,
                  Electricity = 100*(ElectricityT/Total),
                     Gas = 100*(GasT/Total),
                     Wood = 100*(WoodT/Total),
                     Coal = 100*(CoalT/Total),
                     Other = 100*(OtherT/Total)
                  )
              ]
# get survey counts
hhAttributesDT[, Electricity := ifelse(Q19_1 == 1 | # Heat pump
                                         Q19_2 == 1 | # Night store
                                         Q19_3 == 1 | # Portable elec
                                         Q19_4 == 1 | # Fixed elec
                                         Q19_10 == 1 | # elec CH
                                         Q19_10.1 == 1 | # elec DVS - typo?
                                         Q19_12 == 1 | # elec HRV
                                         Q19_13 == 1 ,  # elec UFH
                                       1, 0)]
hhAttributesDT[, Gas := ifelse(Q19_7 == 1 | # Portable gas heater
                                         Q19_8 == 1 | # Fixed gas heater
                                         Q19_9 == 1 | # Gas CH
                                         Q19_14 == 1 , # Gas UFH
                                       1, 0)]
hhAttributesDT[, Wood := ifelse(Q19_6 == 1 | # wood burner
                                         Q19_17 == 1,  # pellets
                                       1, 0)]
hhAttributesDT[, Coal := ifelse(Q19_5 == 1 , # coal burner
                                       1, 0)]
hhAttributesDT[, Other := ifelse(Q19_15 == 1 | # open fire
                                         Q19_16 == 1 , # Fixed elec
                                       1, 0)]

ht <- hhAttributesDT[, .(Electricity = sum(Electricity, na.rm = TRUE),
                       Gas = sum(Gas,na.rm = TRUE),
                       Wood = sum(Wood,na.rm = TRUE),
                       Coal = sum(Coal,na.rm = TRUE),
                       Other = sum(Other,na.rm = TRUE)
                       )
                   ]

# ht <- ipfSurveyDT[, .(Electricity = sum(heatSourceElectricity, na.rm = TRUE),
#                      Gas = sum(heatSourceGas,na.rm = TRUE),
#                      Wood = sum(heatSourceWood,na.rm = TRUE),
#                      Coal = sum(heatSourceCoal,na.rm = TRUE),
#                      Other = sum(heatSourceOther,na.rm = TRUE)
#                      )
#                   ]

# % of total (will sum to > 100 as multiple responses possible)
pc_ht <- 100*prop.table(ht)
pc_ct$source <- "Census 2013 (all regions)"
pc_ht$source <-"GREEN Grid sample"
setnames(crt_pc, "Region", "source") # fix var names
extract <- crt_pc[source  %like% "Taranaki" | source %like% "Hawke", 
                  .(Electricity = mean(Electricity),
                    Gas = mean(Gas),
                    Coal = mean(Coal),
                    Wood = mean(Wood),
                    Other = mean(Other),
                    source = "Census 2013 (Taranaki & Hawke's Bay only)"
                    )]
t <- rbind(pc_ct, # total census data
           extract, # regions
           pc_ht) # GG

kableExtra::kable(t, digits = 2, 
                  caption = "Comparing heating fuel used/methods available (% of dwellings)") %>%
  kable_styling() %>%
  footnote(general = "Row totals may sum to more than 100% due to multiple responses\nOpen fire is coded as 'Other' in GREEN Grid - it is not clear how this is coded in the Census data")

Table \@ref(tab:compareHeatSource) shows that the GREEN Grid sample has a similar 'available fuel' profile to New Zealand dwellings as a whole but is not representative of the combined regions of Taranaki and Hawke's Bay from which it was drawn. In particular there are lower rates of Gas and Wood and much higher 'Other' although as noted the latter may be a coding artefact.

Number of rooms

Table \@ref(tab:nRooms) shows the distribution of the number of rooms per household and shows that the GREEN Grid sample has fewer smaller dwellings and more larger dwellings than both New Zealand as a whole and the combined regions from which the sample is drawn.

# census
ct <- ipfCensusDT[, .(nRooms1_4 = sum(nRooms1_4, na.rm = TRUE),
                     nRooms5_6 = sum(nRooms5_6,na.rm = TRUE),
                     nRooms7m = sum(nRooms7m,na.rm = TRUE)
                     )
                  ]
pc_ct <- 100*prop.table(ct)

# get census household counts per type
crt <- ipfCensusDT[, .(nRooms1_4t = sum(nRooms1_4, na.rm = TRUE),
                       nRooms5_6t = sum(nRooms5_6,na.rm = TRUE),
                       nRooms7mt = sum(nRooms7m,na.rm = TRUE),
                       Total = sum(nrooms_totalHouseholds, na.rm = TRUE )
                       ), 
                   keyby = .(Region = REGC2013_label)
                   ]
# % of total (will sum to > 100 as multiple responses possible)
crt_pc <- crt[, .(Region,
                  nRooms1_4 = 100*(nRooms1_4t/Total),
                     nRooms5_6 = 100*(nRooms5_6t/Total),
                     nRooms7m = 100*(nRooms7mt/Total)
                  )
              ]
# get survey counts
ht <- ipfSurveyDT[, .(nRooms1_4 = sum(nRooms1_4, na.rm = TRUE),
                       nRooms5_6 = sum(nRooms5_6,na.rm = TRUE),
                       nRooms7m = sum(nRooms7m,na.rm = TRUE)
                       )
                   ]

# % of total (will sum to > 100 as multiple responses possible)
pc_ht <- 100*prop.table(ht)
pc_ct$source <- "Census 2013 (all regions)"
pc_ht$source <-"GREEN Grid sample"
setnames(crt_pc, "Region", "source") # fix var names
extract <- crt_pc[source  %like% "Taranaki" | source %like% "Hawke", 
                  .(nRooms1_4 = mean(nRooms1_4),
                    nRooms5_6 = mean(nRooms5_6),
                    nRooms7m = mean(nRooms7m),
                    source = "Census 2013 (Taranaki & Hawke's Bay only)"
                    )]
t <- rbind(pc_ct, # total census data
           extract, # regions
           pc_ht) # GG

kableExtra::kable(t, digits = 2, 
                  caption = "Comparing number of rooms per dwelling (% of dwellings)") %>%
  kable_styling()

Number of people

Table \@ref(tab:nPeople) shows the distribution of number of people per household and shows that the GREEN Grid sample has far lower proportions of single or 2 person households than either the 2013 Census as a whole or the two regions but a much higher proportion of larger households. Since electricity demand is well known to correlate with the number of occupants [@huebner_explaining_2016] we would therefore expect average per dwelling/household GREEN Grid sample demand (or consumption) values to be overestimates of the true national values.

# census
ct <- ipfCensusDT[, .(nPeople_1 = sum(nPeople_1, na.rm = TRUE),
                     nPeople_2 = sum(nPeople_2,na.rm = TRUE),
                     nPeople_3 = sum(nPeople_3,na.rm = TRUE),
                     nPeople_4m = sum(nPeople_4m,na.rm = TRUE)
                     )
                  ]
pc_ct <- 100*prop.table(ct)

# get census household counts per type
crt <- ipfCensusDT[, .(nPeople_1 = sum(nPeople_1, na.rm = TRUE),
                     nPeople_2 = sum(nPeople_2,na.rm = TRUE),
                     nPeople_3 = sum(nPeople_3,na.rm = TRUE),
                     nPeople_4m = sum(nPeople_4m,na.rm = TRUE),
                     Total = sum(npeople_totalHouseholds, na.rm = TRUE )
                       ), 
                   keyby = .(Region = REGC2013_label)
                   ]
# % of total (will sum to > 100 as multiple responses possible)
crt_pc <- crt[, .(Region,
                  nPeople_1 = 100*(nPeople_1/Total),
                     nPeople_2 = 100*(nPeople_2/Total),
                     nPeople_3 = 100*(nPeople_3/Total),
                  nPeople_4m = 100*(nPeople_4m/Total)
                  )
              ]
# get survey counts
ht <- ipfSurveyDT[, .(nPeople_1 = sum(nPeople_1, na.rm = TRUE),
                     nPeople_2 = sum(nPeople_2,na.rm = TRUE),
                     nPeople_3 = sum(nPeople_3,na.rm = TRUE),
                     nPeople_4m = sum(nPeople_4m,na.rm = TRUE)
                       )
                   ]

# % of total (will sum to > 100 as multiple responses possible)
pc_ht <- 100*prop.table(ht)
pc_ct$source <- "Census 2013 (all regions)"
pc_ht$source <-"GREEN Grid sample"
setnames(crt_pc, "Region", "source") # fix var names
extract <- crt_pc[source  %like% "Taranaki" | source %like% "Hawke", 
                  .(nPeople_1 = mean(nPeople_1),
                    nPeople_2 = mean(nPeople_2),
                    nPeople_3 = mean(nPeople_3),
                    nPeople_4m = mean(nPeople_4m),
                    source = "Census 2013 (Taranaki & Hawke's Bay only)"
                    )]
t <- rbind(pc_ct, # total census data
           extract, # regions
           pc_ht) # GG

kableExtra::kable(t, digits = 2, 
                  caption = "Comparing number of people per dwelling (% of dwellings)") %>%
  kable_styling()

Number of children

Finally, Table \@ref(tab:nKids) shows the distribution of 0 vs 1+ children per household. As expected, the GREEN grid sample has fewer households with no children and far more households with children than the population as a whole and this is also true for the two regions. Again, we would therefore expect average per dwelling/household GREEn Grid sample demand (or consumption) values to be differ substantially from the true values.

# census
ct <- ipfCensusDT[, .(nKids_0 = sum(nKids_0, na.rm = TRUE),
                     nKids_1m = sum(nKids_1m,na.rm = TRUE)
                     )
                  ]
pc_ct <- 100*prop.table(ct)

# get census household counts per type
crt <- ipfCensusDT[, .(nKids_0 = sum(nKids_0_families, na.rm = TRUE),
                     nKids_1m = sum(nKids_1m,na.rm = TRUE),
                     Total = sum(nkids_totalFamilies, na.rm = TRUE )
                       ), 
                   keyby = .(Region = REGC2013_label)
                   ]
# % of total (will sum to > 100 as multiple responses possible)
crt_pc <- crt[, .(Region,
                  nKids_0 = 100*(nKids_0/Total),
                     nKids_1m = 100*(nKids_1m/Total)
                  )
              ]
# get survey counts
ht <- ipfSurveyDT[, .(nKids_0 = sum(nKids_0, na.rm = TRUE),
                     nKids_1m = sum(nKids_1m,na.rm = TRUE)
                       )
                   ]

# % of total (will sum to > 100 as multiple responses possible)
pc_ht <- 100*prop.table(ht)
pc_ct$source <- "Census 2013 (all regions)"
pc_ht$source <-"GREEN Grid sample"
setnames(crt_pc, "Region", "source") # fix var names
extract <- crt_pc[source  %like% "Taranaki" | source %like% "Hawke", 
                  .(nKids_0 = mean(nKids_0),
                    nKids_1m = mean(nKids_1m),
                    source = "Census 2013 (Taranaki & Hawke's Bay only)"
                    )]
t <- rbind(pc_ct, # total census data
           extract, # regions
           pc_ht) # GG

kableExtra::kable(t, digits = 2, 
                  caption = "Comparing number of households with 0 vs 1+ child per dwelling (% of dwellings)") %>%
  kable_styling()

Summary

In summary we conclude that the GREEN Grid sample comprises more larger dwellings with more heat pumps, more occupants and more children than we would expect for the population as a whole and indeed the regions from which they were drawn. As a result we cannot assume the sample to be representative of the New Zealand household population.

Scaling examples

Despite our concerns over the representativeness of the GREEN Grid sample it is still helpful to use it as a data source to explore a number of scaling approaches. These range from simple multiplication through proportional multiplication to complex re-weighting schemes.

Simple multiplication {#simpleMulti}

Under this method we assume the GREEN Grid sample to be representative and simply multiply an estimated mean (or total) household values derived from the GREEN Grid sample by the relevant national or regional household counts.

Total electricity consumption profiles - 2015 {#totalLoad2015}

As an example let us assume we wish to calculate half-hourly national total residential electricity consumption profiles by season for 2015 to estimate their contribution to overall electricity consumption in New Zealand. To do this, we use the half-hourly total kWh data produced in Part A and calculate mean energy consumption (kWh) per half-hour for each season in 2015.

We do this using all values even where negative as we assume that negative consumption is export to the grid by the few households with photovoltaic panels (see Table \@ref(tab:ggAnnualkWh)). Clearly we could:

remove these households although the negative power values are only likely to have an effect during the middle of the day (during off-peak periods) and mainly in summer;
convert the -ve PV circuit values to +ve demand and re-build the 'total household load' estimates to give 'gross load' prior to conducting these analyses.

For simplicity and to keep the work within resource scope we have left these approaches for future work.

# keep values <= 0 (see text)
totalLoadProfileDT <- hhTotalLoadDT[, .(meankWh = mean(meankWh),
                                        sdkWh = sd(meankWh), 
                                        nDwellings = uniqueN(linkID)), keyby = .(hms,season)]

totalLoadProfileDT[, ci_lower := meankWh - (sdkWh/sqrt(nDwellings))*qnorm(0.975)]
totalLoadProfileDT[, ci_upper := meankWh + (sdkWh/sqrt(nDwellings))*qnorm(0.975)]

ggplot(totalLoadProfileDT, aes(x = hms)) +
  #geom_line() +
  #geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = season)) +
  geom_line(aes(y = meankWh)) +
  scale_color_discrete(name = "Season") +
  facet_grid(season ~ .) +
  labs(y = "Mean kWh per half hour",
       x = "Time of Day",
       caption = paste0("Source: GREEN Grid sample \nN observed households: ", 
                        uniqueN(hhTotalLoadDT$linkID), "\nRibbons = 95% CI")
       )

winterMax <- round(max(totalLoadProfileDT[season == "Winter"]$meankWh),2)

Figure \@ref(fig:prepHHProfiles) shows the results of the initial sample-based profile calculation. The dark line is the calculated mean with no values excluded whilst the ribbons show the 95% confidence interval for the mean estimate. We estimate that the maximum mean kWh per half-hour in winter in 2015 was ~ r winterMax kWh. If we multiply this by the number of households estimated by the 2015 NZ Statistics household count projection (r tidyNum(nzHhPop2015)) then we get a rough estimate of maximum mean half-hourly residential kWh in winter of ~r round((winterMax * nzHhPop2015)/1000000,2) GWh.

Figure \@ref(fig:nationalProfiles2015) shows the simple upscaling of Figure \@ref(fig:prepHHProfiles) to the 2015 New Zealand household population by multiplying by the total number of households projected.

totalLoadProfileDT[, totalResMWh2015 := (meankWh * nzHhPop2015)/1000]
totalLoadProfileDT[, totalResMWh2015_upper := (ci_upper * nzHhPop2015)/1000]
totalLoadProfileDT[, totalResMWh2015_lower := (ci_lower * nzHhPop2015)/1000]

# test
#profileDT[, .(mean = mean(meankWh)), keyby = .(season)]

ggplot(totalLoadProfileDT, aes(x = hms)) +
  #geom_line() +
  #geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper)) +
  geom_ribbon(aes(ymin = totalResMWh2015_lower, ymax = totalResMWh2015_upper, fill = season)) +
  geom_line(aes(y = totalResMWh2015)) +
  scale_color_discrete(name = "Season") +
  facet_grid(season ~ .) +
  labs(y = "Total mean MWh per half-hour",
       x = "Time of Day",
       caption = paste0("Observed households: ", uniqueN(hhTotalLoadDT$linkID),
                        "\nRescaled to New Zealand households (2015): ", GREENGridData::tidyNum(nzHhPop2015), "\nRibbons = 95% CI"
                        )
       )
lower <- round(totalLoadProfileDT[hms == hms::as_hms("18:00:00") & season == "Winter", totalResMWh2015_lower],-1)
mean <- round(totalLoadProfileDT[hms == hms::as_hms("18:00:00") & season == "Winter", totalResMWh2015],-1)
upper <- round(totalLoadProfileDT[hms == hms::as_hms("18:00:00") & season == "Winter", totalResMWh2015_upper],-1)

diff <- mean - lower
pc <- 100*(diff/mean)
# round to reduce spurious precision

As we can see, the uncertainty in the sample estimates are reflected in the upscaled estimates with, for example, the 95% CI for winter at 18:00 is +/- ~r round(pc,-1)% of the mean (r tidyNum(lower) to ~ r tidyNum(upper) MWh).

We can then compare the mean values with the overall electricity consumption captured by the 2015 Grid Exit Point (GXP) data used in Part B. For clarity we do not include the upweighted 95% CI.Figure \@ref(fig:compareGxp2015) shows the upweighted mean as a % of the total network electricity export by season for all regions in 2015. We can see that the estimated % contribution of residential dwellings to peak time consumption is ~ 40% in summer but close to 70% in winter with a substantial dip in summer reflecting the 'overweighting' of the sample PV dwellings.

In the absence of any other data sources we are unable to comment on the plausibility of these estimates but 70% is likely to be a considerable over-estimate given the non-representative nature of the GREEN Grid sample as reported in Section \@ref(ggSample) above. This approach illustrates the difficulty of producing robust national level estimates using simple multiplication of a small and biased sample.

myCaption <- paste0("Source: EA EMI Grid Exit Point (GXP) data",
                    ", ", min(gxpDataDT$date, na.rm = TRUE)
                    , " - ", max(gxpDataDT$date, na.rm = TRUE))

gxpDT <- gxpDataDT[year == 2015, .(sumkWh = sum(kWh), meankWh = mean(kWh), 
                        nGXPs = uniqueN(MXLOCATION), nDays = uniqueN(date)), 
                    keyby = .(hms, season)]



gxpDT[,totDailyMWh := (sumkWh/(nDays))/1000]
setkey(gxpDT, hms, season)
setkey(totalLoadProfileDT, hms, season)

plotDT <- totalLoadProfileDT[gxpDT]
plotDT[, pc := 100 * (totalResMWh2015/totDailyMWh)]

p <- ggplot2::ggplot(plotDT, aes(x = hms, y = pc, colour = season)) +
  geom_line() +
  scale_color_discrete(name = "Season") +
  labs(y = "%",
       x = "Time of Day",
       caption = myCaption
       )

gxpReducedDT <- gxpDataDT[year == 2015 & description != "Tiwai", 
                          .(sumkWh = sum(kWh), meankWh = mean(kWh), 
                        nGXPs = uniqueN(MXLOCATION), nDays = uniqueN(date)), 
                    keyby = .(hms, season)]

gxpReducedDT[,totDailyMWh := (sumkWh/(nDays))/1000]
setkey(gxpReducedDT, hms, season)
setkey(totalLoadProfileDT, hms, season)

plotDT <- totalLoadProfileDT[gxpReducedDT]
plotDT[, pc := 100 * (totalResMWh2015/totDailyMWh)]

p_red <- ggplot2::ggplot(plotDT, aes(x = hms, y = pc, colour = season)) +
  geom_line() +
  scale_color_discrete(name = "Season") +
  labs(y = "%",
       x = "Time of Day",
       caption = paste0(myCaption, "\nExcluded: Tiwai")
       )

# plot which?
p
#p_red

Clearly it would be possible to repeat this simple method using projected inter-Censual regional household counts for Taranaki and Hawke's Bay respectively although again, as shown in Section \@ref(ggSample), the GREEN Grid sample is not representative of these regions either.

Simple proportional scaling

In this method, rather than assume that the base (GREEN Grid) sample represents all New Zealand consumption profiles and mean values can simply be multiplied by the overall household count, we proportionately adjust the profiles using some other data source to attempt to correct for sample bias. We provide two examples of this below but for simplicity do not include confidence intervals in the presentation of results.

Example: Adjusting total consumption for different available heat sources

Table \@ref(tab:ggHeatSource) showed that there were differences between the GREEN Grid sample and the overall household population in terms of main heat source used. We might expect main space heat source to make a difference to the load profile of the households. In principle we can use this information to proportionately adjust the estimates by calculating overall kWh profiles for each main heat type in the GREEN Grid sample. We would then multiply these different profiles by the number of each household type we would expect in the population calculated by combining Census household counts with HCS 2015 estimates of heat source prevalence (see Table \@ref(tab:ggHeatSource)).

Unfortunately, as Table \@ref(tab:ggHeatSource) showed, the GREEN Grid sample has too few cases in all but the heat pump category to make this feasible but a larger sample, even if not representative, could be used for this purpose provided all relevant categories are present.

Example: Heat pump electricity consumption 2015

ggHP_pc <- combined_dt[type %like% "GREEN" & Appliance %like% "Heat"]$pc

Similarly, we know from the BRANZ HCS 2015 survey that 46% of owner-occupier dwellings have a heat pump compared to 25% for rentals but that in the 2015 GREEN Grid sample, r round(ggHP_pc,2)% have heat pumps. Were we to naively multiply the heat pump circuit profiles of the GREEN Grid dwellings which have heat pumps by the number of New Zealand households we would produce a 'r round(ggHP_pc,0)% heat pump uptake' estimate. Clearly this may be useful for scenario-based futures analysis but to produce estimates of current heat-pump derived consumption we would need to adjust for the uptake rate of heat pumps.

As in the previous example, in principle we should estimate heat pump electricity consumption profiles for the owner-occupier and rental dwellings in the GREEN Grid sample to allow for systematic differences. We would then multiply these two profiles by the proportion of dwellings of each type in New Zealand. We can calculate the latter by multiplying household counts for owner-occupied & rental dwellings by the BRANZ percentage estimates. In doing so we would have to assume that the proportions of owner-occupied to rental dwellings had remained constant from 2013 (Census) to 2015 (HCS 2015).

Unfortunately, as Table \@ref(tab:ggTenure) shows, over 90% of GREEN Grid dwellings were owner-occupied and so the approach of calculating different consumption profiles for each tenure type is not feasible. Again, a larger sample would make this possible.

hhAttributesDT[, hasHeatPump := ifelse(Q20_coded == "Heat pump", "Yes", "No")]
t <- table(hhAttributesDT$Q12_coded, hhAttributesDT$hasHeatPump)

kableExtra::kable(100*prop.table(t), digits = 1,
                  caption = "% dwellings in GREEN Grid sample with/without heat pumps by tenure") %>%
  kable_styling()

Nevertheless, if we assume there are no systematic differences between the heat pump usage patterns by each tenure type we can continue with the example. To do this we first calculate the overall mean electricity consumption profiles for heat pump circuits based on the GREEN Grid data (Figure \@ref(fig:prepHpProfiles)).

profileHpDT <- hhHeatPumpLoadDT[, .(meankWh = mean(meankWh)), keyby = .(hms,season)]

# test
#profileDT[, .(mean = mean(meankWh)), keyby = .(season)]

ggplot(profileHpDT, aes(x = hms, y = meankWh, colour = season)) +
  geom_line() +
  scale_color_discrete(name = "Season") +
  theme(legend.position="bottom") +
  labs(y = "Mean kWh per half-hour",
       x = "Time of Day",
       caption = paste0("Source: GREEN Grid sample \nN observed households: ", 
                        uniqueN(hhHeatPumpLoadDT$linkID))
       )

winterHpMax <- round(max(profileHpDT[season == "Winter"]$meankWh),2)

Figure \@ref(fig:prepHpProfiles) shows the results of these initial sample-based profile calculations and we can clearly see the increased electricity consumption of heat pumps in winter with a hint of limited use for late afternoon cooling in summer.

prHouseholdsRent <- 453135/nzHhPop2015 # http://archive.stats.govt.nz/Census/2013-census/profile-and-summary-reports/quickstats-about-housing/households-who-rent.aspx
prHouseholdsOwn <- 0.648 # 64.8% http://archive.stats.govt.nz/Census/2013-census/profile-and-summary-reports/quickstats-about-housing/home-ownership-households.aspx
prRest <- 1-(prHouseholdsRent + prHouseholdsOwn)

nOwnHPmid <- (prHouseholdsOwn*nzHhPop2015) * 0.4 # HCS 2015
nRentHPmid <- (prHouseholdsRent*nzHhPop2015) * 0.25 # HCS 2015
nOtherHPmid <- (prRest*nzHhPop2015) * 0.4 # HCS 2015

The Census 2013 tells us that r round(100*prHouseholdsRent,1)% of households were rentals and r round(100*prHouseholdsOwn,1)% were owner occupied. In the absence of other data, we assume that the remaining r round(100*prRest,1)% have a similar heat pump distribution to owner-occupier homes. This means that if we apply the HCS 2015 prevalence rates then we have:

~r tidyNum(round(prHouseholdsOwn*nzHhPop2015,-2)) owner occupied homes of which ~r tidyNum(round(nOwnHPmid,-2)) have heat pumps
~r tidyNum(round(prHouseholdsRent*nzHhPop2015,-2)) rental homes of which ~r tidyNum(round(nRentHPmid,-2)) have heat pumps
~r tidyNum(round(prRest*nzHhPop2015,-2)) 'other tenure' homes of which ~r tidyNum(round(nOtherHPmid,-2)) have heat pumps

Note that whilst this may appear to provide the ability to model the heat (pump) demand of a large proportion of New Zealand dwellings, we should remember that the demand profiles are based on a very small and non-representative sample.

We then multiply each of these 'heat pump owning' household counts by the electricity consumption profiles shown in Figure \@ref(fig:prepHpProfiles). This means that we are applying identical heat pump demand profiles to all household types even though we might expect that there would be some differences. It would be possible to apply a distribution to these profiles to better account for inter-household heterogeneity or to use a microsimulation approach to do so (see Section \@ref(complexWeight)) but this is left to future work.

Figure \@ref(fig:nationalProfiles2015) shows the results of this simple upscaling and is a similar method to that described in [@dortans_estimating_2019]. Note that because we have applied identical heat pump demand profiles to all household types, the figure is effectively a multiplication upscaling of Figure \@ref(fig:prepHpProfiles). If we had been able to apply different profiles for each tenure type (or number of occupants) then this would not have been the case and the upscaled estimates may well differ in shape from the sample-level profiles shown in Figure \@ref(fig:prepHpProfiles).

profileHpDT[, totalHpMWhOwn2015 := (meankWh * nOwnHPmid)/1000]
profileHpDT[, totalHpMWhRent2015 := (meankWh * nRentHPmid)/1000]
profileHpDT[, totalHpMWhOther2015 := (meankWh * nOtherHPmid)/1000]
profileHpDT[, totalHpMWh2015 := totalHpMWhOwn2015 + totalHpMWhRent2015 + totalHpMWhOther2015]
# test
#profileDT[, .(mean = mean(meankWh)), keyby = .(season)]

ggplot(profileHpDT, aes(x = hms, y = totalHpMWh2015, colour = season)) +
  geom_line() +
  theme(legend.position="bottom") +
  scale_color_discrete(name = "Season") +
  labs(y = "MWh per half-hour",
       x = "Time of Day",
       caption = paste0("Observed households: ", uniqueN(hhHeatPumpLoadDT$linkID),
                        "\nRescaled to New Zealand households (2015): ", GREENGridData::tidyNum(nzHhPop2015)
                        )
       )

Figure \@ref(fig:compareHpGxp2015) shows these values as a % of the total network electricity export by season for all regions in 2015. We can see that the estimated % contribution of residential dwellings to peak time consumption is small - < 2% in summer but up to just over 7% in winter. In the absence of any other data sources we are unable to comment on the plausibility of these estimates.

myCaption <- paste0("Source: EA EMI Grid Exit Point (GXP) data 2015",
                    ", ", min(gxpDataDT$date, na.rm = TRUE)
                    , " - ", max(gxpDataDT$date, na.rm = TRUE))

setkey(profileHpDT, hms, season)

plotDT <- profileHpDT[gxpDT] # we created gxpDT earlier
plotDT[, pc := 100 * (totalHpMWh2015/totDailyMWh)]

ggplot(plotDT, aes(x = hms, y = pc, colour = season)) +
  geom_line() +
  theme(legend.position="bottom") +
  scale_color_discrete(name = "Season")+
  labs(y = "%",
       x = "Time of Day",
       caption = myCaption
       )

Summary

The previous sections have shown that it is possible to conduct simple multiplicative upscaling of sample datasets to produce national or regional estimates. However this is only likely to be robust if:

samples are large enough to ensure that unusual patterns of demand in particular households do not cause the estimates to be skewed;
samples are nationally or regionally representative.

In the examples provided we have shown how the small sample size also leads to extremely imprecise estimates of national demand which are unlikely to be sufficient for detailed policy or planning purposes.

Further they have shown that where samples are not representative, they can be proportionally upscaled where other sources of data are available. However this can only be done if sufficient dwellings in the under-represented groups exist in the sample and, as we have seen, this is not the case for the GREEN Grid sample.

In addition it should be clear that proportional upscaling as described above can only 'fix' one dimension at a time. As noted in Section \@ref(ggSample), the GREEN Grid sample is not representative at the regional and national levels on a number of dimensions. In this instance more complex re-weighting mechanisms which can adjust for several dimensions at once may be required.

Complex re-weighting {#complex}

A number of multi-variate weighting schemes have been developed which allow the multiple re-weighting of survey data along several dimensions. Whilst some use forms of regression modelling [@battaglia_sampling_2016], others use a conceptually more simple iterative reweighting process. This seeks to calculate weights for each household that, when summed, add up to totals derived from a trusted 'true' source such as a high quality national survey or census.

Of the range of methods available to do this, the iterative proportional fitting (IPF) method is well established and appears in a multitude of guises, from balancing factors in spatial interaction modelling through survey raking (re-weighting) [@beaujouan_reweighting_2011; @casati_synthetic_2015] to the RAS method in economic accounting, and its behaviour is relatively well known [@wong_reliability_1992; @simpson_combining_2005].

Further, the IPF approach is well suited to our current purpose where we wish to preserve household level heterogeneity in demand whilst re-weighting to match known Census household distributions at national, regional or small area levels [@simpson_combining_2005; @birkin_spatial_2011; @anderson_estimating_2012; @anderson_small_2019]. It is important to note that these calculations involve multiplying individual household kWh values by a set of household level weights and then summing them. This is the inverse of the simple multiplication approach which started from the whole-sample mean and multiplied by a single number (household count). As a result not only do complex re-weighting results tend to preserve inter-household heterogeneity but they allow household level scenarios (e.g. switching from one heat source to another) to be easily implemented for a given proportion of the sample.

In this section we demonstrate the value of this approach in re-weighting the GREEN Grid sample along several dimensions to produce estimates of total electricity consumption for:

New Zealand
Taranaki
Hawke's Bay
a set of Census 2013 area units in the Hawke's Bay region

This is done using a household weights dataset developed using the IPF method which reweight the GREEN Grid dwellings so that they fit the appropriate area level distributions of the following 'constraint' variables:

number of people (1,2,3,4+)
number of rooms (1-4, 5-6 or 7+)
number of children (0, 1+)

Whilst it would have been preferable to also re-weight by heat source/space heating method (see Section \@ref(ggHeating)), the GREEN Grid sample does not contain cases with mainly coal or gas heating that would make this possible. The weights were originally developed under an EU funded MSCA Fellowship (SPATIALEC) and have already been used to estimate household electricity consumption patterns for New Zealand Census 2013 area units [@anderson_small_2019].

It should be noted that, as reported in Section \@ref(ggSample), the GREEN Grid sample is sufficiently small and sufficiently different from the overall Census 2013 household proportions on these dimensions that a number of the households are likely to be allocated extremely large weights because we are, in effect, forcing ~ 40 households to represent r tidyNum(nzHhPop2013). As a result, unusual patterns of consumption or, as we will see, generation may well become inflated where the dwellings exhibiting them are disproportionately upweighted due to the rarity of cases. This is described in some detail below.

Finally, we do not attempt to calculate indicators of uncertainty for these estimates. Since the process effectively creates a synthetic Census population, confidence intervals based on the usual statistical sampling theories do not apply. The development of alternative 'credible intervals' is the subject of ongoing methodological research [@whitworth_estimating_2017; @tanton_spatial_2018] and a number of alternatives are emerging that could be implemented in future work.

Example: Total residential electricity consumption 2015 {#complexWeight}

National estimates

In this example we use the household weights previously developed using the NZ Census 2013, the GREEN Grid sample and the IPF method to reweight the GREEN Grid sample so that it fits to the national 2013 NZ census distributions of the variables listed above.

# sum the IPF weights per household at regional level
ipfRegionalWDT <- ipfWeightsDT[, .(ipfWeight = sum(ipfWeight)), keyby = .(REGC2013_label, linkID)]
ipfNationalWDT <- ipfWeightsDT[, .(ipfWeight = sum(ipfWeight)), keyby = .(linkID)]

t <- summary(ipfNationalWDT)

kableExtra::kable(t, caption = "Summary of household level IPF weights required to re-weight the GREEN Grid sample to the national household population", digits = 2) %>%
  kable_styling()

As \@ref(tab:ipfWeightsNational) shows, the household level weights required range from ~r tidyNum(round(min(ipfNationalWDT$ipfWeight),-2)) for relatively common combinations of the constraint variables to ~r tidyNum(round(max(ipfNationalWDT$ipfWeight),-2)) for less common combinations. Clearly, this is a direct consequence of the extremely small size of the GREEN Grid sample.

Figure \@ref(fig:ipfTotalLoadNational) shows the results of calculating weighted seasonal electricity residential demand profiles using these household level weights while Table \@ref(fig:compareMethods) and Figure \@ref(fig:compareMethods) compare the simple multiplicative results from Section \@ref(totalLoad2015) and the IPF-derived results.

powerProfileDT <- hhTotalLoadDT[, .(meankWh = mean(meankWh)), keyby = .(linkID, season, hms)]
setkey(ipfNationalWDT, linkID)
setkey(powerProfileDT, linkID)
weightedProfileDT <- powerProfileDT[ipfNationalWDT]
weightedProfileDT[, weightedMWh := (meankWh*ipfWeight)]
plotDT <- weightedProfileDT[, .(ipfTotalkWh = sum(weightedMWh)), keyby = .(hms, season)]

ggplot2::ggplot(plotDT, aes(x = hms, y = ipfTotalkWh/1000, # MWh
                            colour = season)) +
  geom_line() +
  theme(legend.position="bottom") +
  scale_color_discrete(name = "Season") +
  labs(x = "Time of Day",
       y = "Mean total MWh")

As we can see the IPF method produces lower overall consumption estimates and this appears to be especially the case during the day in summer. This is likely to be caused by the upweighting of the four PV-equipped dwellings which we note tend to have larger weights (see Table \@ref(fig:weightsByPV) in the Data Annex) although we note that the IPF results are generally lower at all times of day and in all seasons.

# merge with total load from above

totalLoadProfileDT[, source := "Simple multiplication"]
totalLoadProfileDT[, totalMWh := totalResMWh2015]
plotDT[, source := "IPF re-weighting"]
plotDT[, totalMWh := ipfTotalkWh/1000]

dt <- rbind(plotDT[, .(hms,season,source, totalMWh)],totalLoadProfileDT[, .(hms,season,source,totalMWh)])

ggplot2::ggplot(dt[!is.na(season)], aes(x = hms, y = totalMWh, colour = source)) + 
  geom_line() +
  scale_color_discrete(name = "Source") +
  theme(legend.position="bottom") +
  facet_wrap(. ~ season) +
  labs(y = "Mean MWh per half hour",
       x = "Time of day")

t <- dt[, .("Min MWh" = min(totalMWh, na.rm = TRUE),
            "Mean MWh" = mean(totalMWh, na.rm = TRUE),
            "Max MWh" = max(totalMWh, na.rm = TRUE)),
        keyby = .(source)
        ]
kableExtra::kable(t, caption = "Summary of national level half-hourly MWh estimates by source") %>%
  kable_styling()

dt <- ipfNationalWDT[hhAttributesDT]
t <- dt[, .(nHouseholds = sum(ipfWeight, na.rm = TRUE)), keyby = .(`PV Inverter`)]
kableExtra::kable(t, digits = -2, caption = "Weighted number of GREEN Grid households with PV inverter (IPF results, rounded)") %>%
  kable_styling()

nPVHouseholds <- t[`PV Inverter` == "Yes", nHouseholds]

As before, in the absence of other data sources it is not possible to comment on the plausibility of these estimates. However we note that the IPF process combined with the non-representative GREEN Grid sample has, in effect, created a scenario where ~r tidyNum(round(nPVHouseholds,-2)) have PV installed (Table \@ref(tab:pvWeights)). This is roughly 10 times the total estimated 2018 New Zealand installation level for both residential and commercial sites combined. As a result Figure \@ref(fig:ipfTotalLoadNational) and Figure \@ref(fig:compareMethods) illustrate one possible future residential electricity demand scenario for New Zealand should PV installation rates continue to rise as they have [@ford_emerging_2017].

Whilst it would be possible to compare the IPF results with the GXP grid export data (c.f. Section \@ref(totalLoad2015)), it does not seem appropriate to do so given the increase in PV generation apparently embedded within the IPF-based results. This would reduce the GXP level electricity flows due to increased local generation and thus reduce overall GXP level 'consumption' below that observed for 2015. Modelling the effect of such local embedded generation on GXP level flows based on upscaled samples such as GREEN grid could therefore be the subject of future work.

Regional estimates {#regional}

In this example we use the IPF method to reweight the GREEN Grid sample so that it fits to the 2013 NZ census distributions of the variables listed above at the regional levels. It is usually considered best practice to use sample households from a given region to represent that region in this process. However the very small number of households in the GREEN Grid sample mean this is not feasible and must be left to future work using larger regionally representative samples.

Whilst this process adjusts the GREEN Grid households to fit each region separately according to the dimensions listed above, it cannot account for the regional spatial distribution of different energy infrastructures (e.g. reticulated gas etc - see Section \@ref(ggSample)) because there are not enough (or no) cases in the GREEN Grid sample representing these types of households.

# sum the IPF weights per household at regional level
ipfRegionalWDT <- ipfWeightsDT[, .(ipfWeight = sum(ipfWeight)), keyby = .(REGC2013_label, linkID)]

t <- summary(ipfRegionalWDT)

kableExtra::kable(t, caption = "Summary of household level IPF weights required to re-weight the GREEN Grid sample to the regional household populations", digits = 2) %>%
  kable_styling()

As \@ref(tab:ipfWeightsRegional) shows, in this case the household level weights required range from ~r round(min(ipfRegionalWDT$ipfWeight),-2) to ~r tidyNum(round(max(ipfRegionalWDT$ipfWeight),-2)).

# sum the IPF weights per household at regional level
setkey(ipfRegionalWDT, linkID)
setkey(powerProfileDT, linkID)
weightedProfileRegionalDT <- powerProfileDT[ipfRegionalWDT, allow.cartesian = TRUE]
weightedProfileRegionalDT[, weightedkWh := meankWh*ipfWeight]
plotDT <- weightedProfileRegionalDT[, .(totalkWh = sum(weightedkWh)), keyby = .(hms, season, REGC2013_label)]

# select the regions we are interested in
ggplot2::ggplot(plotDT[!is.na(season) & 
                         !is.na(REGC2013_label) & 
                         (REGC2013_label %like% "Hawke" |
                         REGC2013_label %like% "Taranaki" |
                         REGC2013_label %like% "Auckland" |
                         REGC2013_label %like% "Wellington")], 
                aes(x = hms, y = totalkWh/1000, # MWh
                    colour = season)) +
  geom_line() +
  theme(legend.position="bottom") +
  facet_wrap(. ~ REGC2013_label) +
  scale_color_discrete(name = "Season") +
  labs(y = "Mean total MWh per half hour", x = "Time of day")

Figure \@ref(fig:ipfTotalLoadRegional) shows the resulting total residential electricity consumption profiles by season for Hawke's Bay and Taranaki (the regions from which the GREEN Grid sample was drawn) and also for Wellington and Auckland for comparison. Clearly levels of consumption are driven by the population sizes and the shapes of the curves are similar largely due to:

our inability to adjust for different levels of heating and other appliances in each region;
the need to use households from the Taranaki and Hawke's Bay regions (GREEN Grid sample) to represent households from Auckland and Wellington who may experience very different environmental conditions or have very different housing stock performance.

Again we have no data that can be used to validate these results given the apparent increase in embedded generation that is included.

Census area unit estimates

Finally, in this section we use the same method to calculate total residential electricity consumption profiles by season for a subset of Hawke's Bay Census area units. Area units are the second smallest New Zealand census output zones with a mean household count of ~ 750 (See Table \@ref(tab:auCountsByregion)).

t <- ipfCensusDT[, .(nHouseholds = sum(npeople_totalHouseholds, na.rm = TRUE),
                 nAUs = .N,
                 meanHouseholdsPerAU = mean(npeople_totalHouseholds, na.rm = TRUE)), keyby = .(REGC2013_label)]

kableExtra::kable(t[order(-nHouseholds)], digits = 1, caption = "Number of households and number of area units by region (Census 2013)") %>%
  kable_styling()

As \@ref(tab:hawkesBayAU) shows, in the case of Hawke's Bay, the weights have smaller values as we are using the ~40 GREEN Grid households to represent on average ~ 680 households (see Table \@ref(tab:auCountsByregion)).

hbWeightsDT <- ipfWeightsDT[REGC2013_label %like% "Hawke"]
setkey(hbWeightsDT, linkID)
setkey(powerProfileDT, linkID)

dt <- powerProfileDT[hbWeightsDT, allow.cartesian = TRUE]

dt$nMBs <- NULL

t <- summary(dt[,.(linkID, meankWh, ipfWeight)])

kableExtra::kable(t, caption = "Summary of linked GREEN Grid household consumption data and area unit level weights") %>%
  kable_styling()

Rather than show the results for all area units in Hawke's Bay we select the two which have the highest and lowest proportions of 4+ occupants (see Table \@ref(tab:auMWh)). This should enable us to see differences in are unit level profiles which correlate with different area level household composition.

Figure \@ref(fig:auMWh) shows the total MWh seasonal residential electricity consumption profiles for these two areas and we can clearly see that they are both downscaled versions of the regional profiles (Figure \@ref(fig:ipfTotalLoadRegional)). We can also see that as we would expect, the area unit with the lowest proportion of 4+ occupants (and the lowest total number of households) has the lowest consumption profile.

dt <- dt[, weightedkWh := meankWh * ipfWeight]

auProfilesDT <- dt[!is.na(season), .(totalkWh = sum(weightedkWh, na.rm = TRUE)), keyby = .(hms, season, AU2013_code, AU2013_label)]

ipfCensusDT[, pc_nPeople_4m := 100*(nPeople_4m/npeople_totalHouseholds)]
top <- head(ipfCensusDT[REGC2013_label %like% "Hawke" & npeople_totalHouseholds > 500, 
                        .(AU2013_label, nPeople_1, nPeople_2, nPeople_3, nPeople_4m, npeople_totalHouseholds, pc_nPeople_4m)][order(-pc_nPeople_4m)],1)
bottom <- head(ipfCensusDT[REGC2013_label %like% "Hawke" & npeople_totalHouseholds > 500, 
                           .(AU2013_label, nPeople_1, nPeople_2, nPeople_3, nPeople_4m, npeople_totalHouseholds, pc_nPeople_4m)][order(pc_nPeople_4m)],1)

selected <- rbind(top, bottom)

kableExtra::kable(selected, digits = 1, caption = "Selected area units with highest and lowest % of large (4+ person) households (areas with more than 500 households)") %>%
  kable_styling()

setkey(selected, AU2013_label)
setkey(auProfilesDT, AU2013_label)

plotDT <- auProfilesDT[selected]

ggplot2::ggplot(plotDT, aes(x = hms, y = totalkWh/1000, # MWh
                            colour = AU2013_label)) +
  geom_line() +
  theme(legend.position="bottom") +
  facet_wrap(. ~ season) +
  scale_color_discrete(name = "Area unit:") +
  labs(x = "Time of Day",
       y = "Mean total MWh per half hour", x = "Time of day")

As before, it should be noted that this approach is applying the electricity consumption profiles of the GREEN Grid sample to these area units and adjusting only for the census constraints chosen (number of children, rooms and people). It is not adjusting for local area variation in energy infrastructures, micro-climate or any other factor that is not correlated with these constraints. Given this and also the earlier discussion of the prevalence of PV generation in this sample, the results should be viewed with extreme caution.

Clearly this method could be used to produce similar profiles for each area unit (or even mesh block) in New Zealand but this would mean assuming that the GREEN Grid electricity consumption profiles can be unproblematically applied to area units in Northland, Southland and all places in between. Given the known spatial heterogeneity of heat sources, winter temperatures and day length (for example) this assumption appears implausible.

On the other hand the method and the GREEN grid sample can be used to develop area unit level profiles for the regions from which the sample is drawn, although even in this case the non-representative and extremely small sample means the results could not be considered robust [@anderson_small_2019].

Summary

Overall we have shown that it is possible to use complex multi-variate re-weighting to upscale small samples to produce national, regional or local electricity demand profiles. However, as in previous sections the robustness of the results depends entirely on the representativeness of the sample and its ability to include (and represent) sub-groups of households of interest - such as those with different main heat sources. In addition the methods require that relevant 'constraints' are available from Census or similar data. While this is true of 'available heating fuel' in the 2013 Census, the inclusion of information on heating appliances in Census 2018 means that it may be possible to produce more robust local demand profiles in future.

Overall summary and recommendations

This report has outlined a number of methods that could be used to upscale the GREEN Grid (or similar) household energy demand datasets to develop national or regional level residential demand or consumption estimates. However we have noted that in all cases the GREEN grid data is insufficient for this purpose due to:

its small size - meaning very small numbers of households would be 'representing' very large numbers in the true population with a consequential risk of both skew and a lack of precision;
its recruitment bias - meaning that some household types and some crucial energy source types (e.g. coal, wood-only, reticulated gas) are not represented at all and so cannot be 're-weighted'.

We have demonstrated this through a number of worked examples and suggest that future work could apply these methods to more suitable larger scale data sources such as:

the BRANZ Housing Conditions Survey datasets - although detailed energy monitoring was not included;
the recent NZ Stats/MBIE Housing Quality Indicator survey - although detailed energy monitoring was not included;
the potential future population representative BRANZ Household Energy End-Use (HEEP2) study;
representative smart meter datasets linked to household attributes;
imputed load profiles generated using population representative New Zealand Time Use diary data (see [@mckenna_high-resolution_2016;@mckenna_simulating_2017] for a UK example)

Would these samples be large enough for these purposes? Unfortunately there is no simple answer to this question since it is likely that the size of the chosen sample will depend on the precision of the estimates required. For example, in Section \@ref(ggSample) we noted that the BRANZ Housing Condition Survey 2015 sample size of 560 dwellings enabled a precision (95% confidence interval) of +/- 5% for overall sample statistics but +/- 6% and +/- 11% for the smaller owned and rental subgroups [@vicki_white_warm_2017]. As shown in Figure \@ref(fig:compareHeatSourceHCS) this means that the percentage of owner-occupied dwellings with heat pumps (for example) can only be estimated to lie between 34% and 46% (lower and upper 95% intervals). If greater precision is required at the whole sample or sub-group levels then larger samples will be needed and there are a number of recent papers which can provide guidance in this regard [@sovacool_promoting_2018;@anderson_ensuring_2020].

However, in the case of multivariate upscaling (Section \@ref(complex)) it is clear that a representative sample would also need to be large enough to include sufficient numbers of households to represent key sub-groups of interest (i.e. reweighting constraints) at the required level of precision. In the work reported above this is evident in the inability to re-weight the GREEN Grid sample to fit national, regional or local distributions of space heating appliance or even main heating fuel source due to an absence of representative cases. As noted, no method can re-weight households types that are absent from the sample...

Nevertheless, we have also demonstrated that small and biased samples such as GREEN Grid can be used to create scenarios of future demand at national, regional or small area (neighbourhood) levels:

where the sample comprises a set of households specifically chosen to represent 'the future', methods such as IPF can be used to develop plausible estimates of their future demand profiles;
where it does not, specific household attributes can be changed in specific households in the sample and the consequences upscaled to local, regional or national estimates - an approach known as spatial microsimulation [@birkin_spatial_2011].

We would suggest that modelling a range of future scenarios using the GREEN Grid sample could take this approach and we have noted a number of potential avenues including the effect of large-scale PV installation on regional (or local area) electricity flows. This work could include adjusting the total load estimates to reflect gross rather than net load for those households with PV (see \@ref(totalLoad2015)) and the use of regionally representative households to ensure geographically accurate generation and consumption patterns (see \@ref(regional)).

Other possible avenues could include the effects of wide-scale EV uptake or energy efficiency interventions such as LED light-bulbs (building on [@dortans_estimating_2019]) or transition from radiant electric heating to heat pumps. Further, with the potential release of updated Census 2018 data on heating appliances, it may also be possible to produce substantially more precise (and robust) estimates of regional and local demand profiles for electric heating of various kinds. As part of this, it would also be possible to introduce microsimulation or statistical distribution-based methods to model inter-household heterogeneity and to implement emerging 'credible intervals' to give indicators of regional uncertainty in the estimates.

Data Annex

Census data summary

Descriptive statistics for census data:

if(doSkim){skimr::skim(ipfCensusDT)}

GREEN Grid half-hourly total load summary

Descriptive statistics for dwelling half hourly overall kWh data:

if(doSkim){skimr::skim(hhTotalLoadDT)}

Descriptive statistics for dwelling half hourly heat pump kWh data:

if(doSkim){skimr::skim(hhHeatPumpLoadDT)}

GREEN Grid household data summary

Descriptive statistics for household data:

if(doSkim){skimr::skim(hhAttributesDT)}

GREEN Grid/SPATIALEC ipf household weights

Descriptive statistics for ipf-derived weights:

if(doSkim){skimr::skim(ipfWeightsDT)}

dt <- ipfNationalWDT[hhAttributesDT]

t <- dt[, .(linkID, ipfWeight, `PV Inverter`, notes)][order(-ipfWeight)]

kableExtra::kable(t, caption = "Household level ipfWeights for national rescaling with example household attributes (oredered by ipfWeight)")

ggplot2::ggplot(dt, aes(x = `PV Inverter`, y = log(ipfWeight))) +
  geom_boxplot()

dt[, Electricity := ifelse(is.na(Electricity), "No", "Yes")]
ggplot2::ggplot(dt, aes(x = Electricity, y = log(ipfWeight))) +
  geom_boxplot()

dt[, Wood := ifelse(is.na(Wood), "No", "Yes")]
ggplot2::ggplot(dt, aes(x = Wood, y = log(ipfWeight))) +
  geom_boxplot()

Runtime

t <- proc.time() - startTime
elapsed <- t[[3]]

Analysis completed in r round(elapsed,2) seconds ( r round(elapsed/60,2) minutes) using knitr in RStudio with r R.version.string running on r R.version$platform.

R environment

R packages used

base R [@baseR]
bookdown [@bookdown]
data.table [@data.table]
ggplot2 [@ggplot2]
kableExtra [@kableExtra]
knitr [@knitr]
lubridate [@lubridate]
rmarkdown [@rmarkdown]

Session info

sessionInfo()

References

CfSOtago/GREENGridEECA documentation built on May 1, 2022, 8:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

CfSOtago/GREENGridEECA Supports Analysis of NZ GREEN Grid data for EECA

In CfSOtago/GREENGridEECA: Supports Analysis of NZ GREEN Grid data for EECA

About

Citation

Report circulation

Code

License {#license}

History

Support

Introduction

Data

Census 2013

Grid exit point kWh export (GXP)

NZ GREEN Grid Household Electricity Demand {#ggSample}

Annual electricity consumption

Heating {#ggHeating}

Number of rooms

Number of people

Number of children

Summary

Scaling examples

Simple multiplication {#simpleMulti}

Total electricity consumption profiles - 2015 {#totalLoad2015}

Simple proportional scaling

Example: Adjusting total consumption for different available heat sources

Example: Heat pump electricity consumption 2015

Summary

Complex re-weighting {#complex}

Example: Total residential electricity consumption 2015 {#complexWeight}

National estimates

Regional estimates {#regional}

Census area unit estimates

Summary

Overall summary and recommendations

Data Annex

Census data summary

GREEN Grid half-hourly total load summary

GREEN Grid household data summary

GREEN Grid/SPATIALEC ipf household weights

Runtime

R environment

R packages used

Session info

References

R Package Documentation

Browse R Packages

We want your feedback!

CfSOtago/GREENGridEECA
Supports Analysis of NZ GREEN Grid data for EECA