In jackobailey/britpol: Data and Functions for Studying British Politics

\doublespacing

```{=tex} \thispagestyle{empty} \clearpage

\pagebreak

\setcounter{page}{1}
\setcounter{tocdepth}{2}  <!-- Set table of contents depth -->
\tableofcontents          <!-- Print table of contents -->
\pagebreak                <!-- Space between table of contents and content -->

```r

# Icons made by [Freepik](https://www.freepik.com) from [Flaticon](https://www.flaticon.com/).

# Load packages

library(britpol)
library(kableExtra)
library(tidyverse)
library(lubridate)
library(brms)
library(here)


# Load PollBasePro data

data("pollbase")
data("pollbasepro")
data("samplesizes")


# Load correlations

load(here("R", "sysdata.rda"))


# Load timeline data

timeline <-
  britpol:::load_timeline() %>%
  select(
    date = polldate,
    elecdate,
    country,
    party = partyid,
    vote = poll_
  ) %>%
  filter(country == "United Kingdom") %>% 
  na.omit()


# Create custom ggplot theme

theme_bailey <- function(){
  theme_minimal() +
    theme(legend.title = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   text = element_text(family = "Cabin", color = "black", size = 8),
                   plot.title = element_text(family = "Cabin", face = "bold", size = rel(1.4), hjust = 0),
                   plot.subtitle = element_text(family = "Cabin", size = rel(1), hjust = 0, margin = margin(b = 10)),
                   axis.line = element_line(lineend = "round"),
                   axis.title.x = element_text(family = "Cabin", face = "bold", margin = margin(t = 10), size = rel(1)),
                   axis.text.x = element_text(color = "black", size = rel(1)),
                   axis.ticks.x = element_line(lineend = "round"),
                   axis.title.y = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   axis.text.y = element_text(color = "black", size = rel(1)),
                   strip.text = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   panel.spacing = unit(.3, "cm"),
                   panel.grid.major.y = element_line(size = .5, lineend = "round"),
                   panel.grid.minor.y = element_blank(),
                   panel.grid.major.x = element_blank(),
                   panel.grid.minor.x = element_blank()
    )
}


# Tell knitr to use Cairo PDF when rendering plots so that it uses nice fonts

knitr::opts_chunk$set(dev = "cairo_pdf")

About `britpol`

britpol is an R package that makes analysing British political data quick and simple. It contains two pre-formatted datasets, plus a host of useful functions. The first dataset, pollbase, is a long-format version of Mark Pack's [-@pack2021] dataset of historic British public opinion polls combined with more recent data from Wikipedia. The second dataset, pollbasepro, provides r format(nrow(pollbasepro), big.mark = ",") daily estimates of voting intention figures for each of Britain's three largest parties between r str_remove(format(min(pollbasepro$date), "%d %B %Y"), "^0") and r str_remove(format(max(pollbasepro$date), "%d %B %Y"), "^0").

britpol will change as elections come and go. Users should only use the most recent version of the package in their research. Like any project, some minor mistakes might have crept into the code. If you think that you have found an error, please raise an issue on the britpol GitHub repository.

Citing `britpol` and the PollBasePro Data

You may also use the britpol codebase for your own purposes in line with its license. But you must do so with attribution. That is, you may reproduce, reuse, and adapt the code as you see fit, but must state in each case that you used britpol to produce your work. The relevant citation is as follows:

Bailey, J. (r format(Sys.Date(), format = "%Y")) britpol vr packageVersion("britpol"): User Guide and Data Codebook. Retrieved from https://doi.org/10.17605/OSF.IO/2M9GB.

Likewise, if you use the pollbasepro dataset, you should cite it too. This project comprises two elements: the dataset itself and a companion paper. The citations for each item are as follows:

Data: Bailey, J., M. Pack, and L. Mansillo (r format(Sys.Date(), format = "%Y")) PollBasePro: Daily Estimates of Aggregate Voting Intention in Great Britain from r format(min(pollbasepro$date), "%Y") to r format(max(pollbasepro$date), "%Y") v.r packageVersion("britpol") [computer file], r format(Sys.Date(), format = "%B %Y"). Retrieved from https://doi.org/10.7910/DVN/3POIQW.
Paper: Bailey, J., M. Pack, and L. Mansillo (r format(Sys.Date(), format = "%Y")) PollBasePro: Daily Estimates of Aggregate Voting Intention in Great Britain from r format(min(pollbasepro$date), "%Y") to r format(max(pollbasepro$date), "%Y"). Retrieved from doi.

Using `britpol`

Getting started with britpol requires only three short steps: first, to install the package; second, to load the package; and, third, to load any data you want to use. Once you have taken these three steps, you can then begin your analysis. Note that britpol is not yet available on CRAN, the service that hosts most R packages. As such, it is not yet possible to install britpol with R's standard install.packages() function. Instead, we must install it directly from its GitHub repository. Thankfully this is straightforward and requires only that you run the following code in your R console:

# 1. Install the britpol package from GitHub
devtools::install_github("jackobailey/britpol")

# 2. Load the britpol package in R
library(britpol)

# 3. Load the pollbase and pollbasepro datasets
data("pollbase")
data("pollbasepro")

Though britpol is first and foremost an R data package, the data are available for Stata and SPSS users too. To download the pollbase and pollbasepro datasets in .dta and .sav format click here.

Functions

Data

`pollbase` (Historical British Election Polls from `r format(min(pollbase$start), "%Y")` to `r format(max(pollbase$end), "%Y")`)

Variable List

britpol:::make_table(pollbase)

\pagebreak

`pollbasepro` (Daily British Voting Intention from `r format(min(pollbasepro$date), "%Y")` to `r format(max(pollbasepro$date), "%Y")`)

Analysing `pollbasepro`

ggplot() +
  geom_density_2d(
    data = pollbasepro %>% mutate(facet = "Conservative"),
    mapping = 
      aes(
        x = date,
        y = qnorm(.975)*con_err
        ),
    colour = "#0087dc",
    alpha = .4
  ) +
  stat_smooth(
    data = pollbasepro %>% mutate(facet = "Conservative"),
    mapping = aes(
      x = date,
      y = qnorm(.975)*con_err
    ),
    method = "lm",
    formula = y ~ x,
    se = TRUE,
    colour = "black"
  ) +
  geom_density_2d(
    data = pollbasepro %>% mutate(facet = "Labour"),
    mapping = 
      aes(
        x = date,
        y = qnorm(.975)*lab_err
        ),
    colour = "#d50000",
    alpha = .4
  ) +
  stat_smooth(
    data = pollbasepro %>% mutate(facet = "Labour"),
    mapping = aes(
      x = date,
      y = qnorm(.975)*lab_err
    ),
    method = "lm",
    formula = y ~ x,
    se = TRUE,
    colour = "black"
  ) +
  geom_density_2d(
    data = pollbasepro %>% mutate(facet = "Liberal"),
    mapping = 
      aes(
        x = date,
        y = qnorm(.975)*lib_err
        ),
    colour = "#fdbb30",
    alpha = .4
  ) +
  stat_smooth(
    data = pollbasepro %>% mutate(facet = "Liberal"),
    mapping = aes(
      x = date,
      y = qnorm(.975)*lib_err
    ),
    method = "lm",
    formula = y ~ x,
    se = TRUE,
    colour = "black"
  ) +
  facet_wrap(~facet) +
  scale_y_continuous(
    breaks = seq(0, 0.05, by = .01),
    labels = scales::percent_format(suffix = "pts", accuracy = .1)
  ) +
  scale_x_date(
    breaks = seq.Date(as.Date("1960-01-01"), as.Date("2020-01-01"), by = "10 years"),
    labels = year(seq.Date(as.Date("1960-01-01"), as.Date("2020-01-01"), by = "10 years"))
  ) +
  coord_cartesian(ylim = c(0, 0.05)) +
   theme_minimal() +
    theme(legend.title = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   text = element_text(family = "Cabin", color = "black", size = 8),
                   plot.title = element_text(family = "Cabin", face = "bold", size = rel(1.4), hjust = 0),
                   plot.subtitle = element_text(family = "Cabin", size = rel(1), hjust = 0, margin = margin(b = 10)),
                   axis.line = element_line(lineend = "round"),
                   axis.title.x = element_blank(),
                   axis.text.x = element_text(color = "black", size = rel(1)),
                   axis.ticks.x = element_line(lineend = "round"),
                   axis.title.y = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   axis.text.y = element_text(color = "black", size = rel(1)),
                   strip.text = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   panel.spacing = unit(.3, "cm"),
                   panel.grid.major.y = element_line(size = .5, lineend = "round"),
                   panel.grid.minor.y = element_blank(),
                   panel.grid.major.x = element_blank(),
                   panel.grid.minor.x = element_blank()
    ) +
  labs(y = "Posterior Uncertainty (95%)")

Note that the estimates that we include in the pollbasepro dataset are probabilistic. As such, we include in the dataset both an estimate of the posterior mean of aggregate voting intention for each party on each day and the posterior uncertainty in these estimates. As figure \@ref(fig:error-plot) makes clear, our uncertainty estimates are not random. Instead, they are correlated with time. This occurs because polls have become more numerous and have tended to include larger sample sizes as time has passed. Thus, our estimates for more recent years are more certain than our estimates for years long far into the past.

We advise all those who use the pollbasepro dataset to include this uncertainty in their analyses wherever possible. This is important both because propagating our uncertainty forward is good practice and because such uncertainty serves both to reduce statistical power and to attenuate real and existing relationships in the data. This is possible using "errors-in-variables models". These models work much like regular generalised linear models, though account for measurement error in either the dependent variable, the independent variables, or both. @mcelreath2020 provides a good introduction to the intuition behind error-in-variable models. Similarly, Bürkner [-@burkner2017] provides an easy-to-use interface for fitting such models in R using the brms package [see also chapter 15.1 in @kurz2020 for an applied example].

\pagebreak

Variable List

britpol:::make_table(pollbasepro)

Technical Details: Estimating Daily Voting Intention

We adapt Jackman's [-@jackman2005] method to derive our daily estimates. Still, there are issues specific to our case that we must first overcome. We elaborate on our choices below.

Imputating Missing Sample Sizes

Our data do not include sample sizes before the 2010 general election. This is a problem, as our model requires that we know this information. To solve this problem, we use data from Jennings and Wlezien's [-@jennings2016a] "Timeline of Elections" dataset. Though less comprehensive than PollBase, these data do include information on sample sizes. What's more, they also include data from countries other than Britain. This lets us pool all available information to improve our estimates.

Sample sizes are count data. As such, we use the following multilevel Poisson regression model to impute likely sample sizes for all of our pre-2010 polling data:

\begin{align} n_{i} &\sim \mathrm{Poisson}(\lambda_{i}) \srlab{Likelihood function} \ log(\lambda_{i}) &= \alpha_{Country[i]} + \beta_{Country[i]} T_{i} \srlab{Linear model on $\lambda$} \ \begin{bmatrix} \alpha_{Country} \ \beta_{Country} \end{bmatrix} &\sim \mathrm{MVNormal}(\begin{bmatrix} \alpha \ \beta \end{bmatrix}, \textbf{S}) \srlab{Multivariate prior on varying effects} \ \textbf{S} &= \begin{pmatrix} \sigma_{\alpha} & 0 \ 0 & \sigma_{\beta} \end{pmatrix} \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \ 0 & \sigma_{\beta} \end{pmatrix} \srlab{Covariance matrix on varying effects} \ \alpha &\sim \mathrm{Normal}(7, 0.5) \srlab{Prior on average intercept, $\alpha$} \ \beta &\sim \mathrm{Normal}(0, 0.1) \srlab{Prior on average slope, $\beta$} \ \sigma_{\alpha} &\sim \mathrm{Exponential}(10) \srlab{Prior on uncertainty in the intercepts, $\sigma_{\alpha}$} \ \sigma_{\beta} &\sim \mathrm{Exponential}(10) \srlab{Prior on uncertainty in the slopes, $\sigma_{\beta}$} \ \textbf{R} &\sim \mathrm{LKJ}(2) \srlab{Prior on correlation matrix, $\textbf{\textrm{R}}$} \ \end{align}

# Create sample size plot

samplesizes %>%
  ggplot(
    aes(
      x = date,
      y = n_est,
      ymin = qpois(0.025, n_est),
      ymax = qpois(0.975, n_est)
    )
  ) +
  geom_ribbon(alpha = .3, colour = NA) +
  geom_line() +
  scale_y_continuous(breaks = seq(0, 2500, by = 500)) +
  scale_x_date(
    breaks = seq.Date(as.Date("1945-01-01"), as.Date("2020-01-01"), by = "5 years"),
    labels = year(seq.Date(as.Date("1945-01-01"), as.Date("2020-01-01"), by = "5 years"))
    ) +
  coord_cartesian(ylim = c(0, 2600)) +
  theme_minimal() +
    theme(legend.title = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   text = element_text(family = "Cabin", color = "black", size = 8),
                   plot.title = element_text(family = "Cabin", face = "bold", size = rel(1.4), hjust = 0),
                   plot.subtitle = element_text(family = "Cabin", size = rel(1), hjust = 0, margin = margin(b = 10)),
                   axis.line = element_line(lineend = "round"),
                   axis.title.x = element_blank(),
                   axis.text.x = element_text(color = "black", size = rel(1)),
                   axis.ticks.x = element_line(lineend = "round"),
                   axis.title.y = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   axis.text.y = element_text(color = "black", size = rel(1)),
                   strip.text = element_text(family = "Cabin", face = "bold", size = rel(1)),
                   panel.spacing = unit(.3, "cm"),
                   panel.grid.major.y = element_line(size = .5, lineend = "round"),
                   panel.grid.minor.y = element_blank(),
                   panel.grid.major.x = element_blank(),
                   panel.grid.minor.x = element_blank()
    ) +
  labs(
    y = "Imputed Sample Size"
  )

We assume that the sample size associated with poll $i$ in the Timeline data, $n_{i}$, is distributed as Poisson according to some rate parameter, $\lambda_{i}$. We then model the logarithm of this parameter using a simple linear function that includes an intercept, $\alpha_{Country}$, and a slope on the effect of time, $\beta_{Country}$, which we allow to vary over countries. We then relate these two parameters to one another by modelling them as though they come from a multivariate normal distribution. In effect, this allows the parameters to be correlated and, thus, to share information.

While our data concern Britain alone, we use all available data in the Timeline dataset. This is for good reason. The dataset does not contain reliable sample size values for British polls conducted before the early 1960s. But it does contain reliable values for other countries as early as the mid-1940s. Pooling all available information for all countries across the entire time series, thus, allows us to impute reliable estimates of likely sample sizes in Britain across the full range of dates by drawing on persistent differences between British polls and those from other countries.

Figure \@ref(fig:n-plot) shows the model's best estimate of the likely sample size of the average British voting intention poll between r year(min(samplesizes$date)) and r year(max(samplesizes$date)). These imputed values seem sensible and conform to our expection that sample sizes should increase over time. The model estimates that the average British voting intention poll included around r format(round(samplesizes$n_est[samplesizes$date == min(samplesizes$date)], 0), big.mark = ",") respondents in r year(min(samplesizes$date)). By r year(max(samplesizes$date)), the model suggests that this value had increased by r format(round(samplesizes$n_est[samplesizes$date == max(samplesizes$date)] - samplesizes$n_est[samplesizes$date == min(samplesizes$date)], 0), big.mark = ",") to r format(round(samplesizes$n_est[samplesizes$date == max(samplesizes$date)], 0), big.mark = ",") respondents per poll on average.

We use the model to produce a time series of estimates sample sizes between r year(min(samplesizes$date)) and r year(max(samplesizes$date)). This includes all dates for which we intend to produce a voting intention estimate. Where our polling data come from before the 2010 general election, or are otherwise missing, will fill in the gaps with these imputed values. To do so we match our polling data to the imputed values from the model based on their respective dates.

Estimating Daily Voting Intention Figures

As we mention above and in the accompanying paper, we adapt the model in @jackman2005 to compute our daily British voting intention estimates. The model is complex and has many moving parts, so we will build it up step by step.

We assume that each poll in our underlying data, $Poll_{i}$, is generated from some normal distribution. This distribution has two parameters. The first is some mean, $\mu_{i}$. The second is some error that leads the estimates to be higher or lower than the expected value, $\mu_{i}$. In many models with a normal likelihood function, this error parameter would measure only random residual error and be represented by the Greek letter $\sigma$. But, in our case, we have additional information that we can use. We know that each poll is a proportion and represents a draw from some random distribution. Thus, we can use the equation for the standard error of a proportion to calculate the uncertainty in each estimate, where $S_{i} = \sqrt{\frac{Poll_{i} (1 - Poll_{i})}{\nu_{i}}}$. Note that $\nu_{i}$ is the sample size of $Poll_{i}$, $n_{i}$, divided by the number of days the poll spent in the field, $k_{i}$. In effect, this implies that we assume an equal number of people were polled on each day that the model was in the field. We can then include both in our model to account for any known error, $S_{i}$, and any random residual error, $\sigma$. So far, our model is as follows:

\begin{align} Poll_{i} &\sim \mathrm{Normal}(\mu_{i}, \sqrt{\sigma^2 + S_{i}^2}) \srlab{Likelihood function} \ \end{align}

The next step is to fit a model to $\mu_{i}$. This will be a measurement model, as it will allow us to produce an estimate of the electorate's latent voting intention on each day. We assume that $\mu_{i}$ is a linear function of two variables: $\alpha_{Day[i]}$, the electorate's latent voting intention for $Poll_{i}$ on the day that it was fielded, and $\delta_{Pollster[i]}$, the persistent "house effects" that arise due to the methodological and design choices that inform how the company that ran the poll collected its data. If we update our model specification to include these assumptions, we get the following:

At present, all values of $\alpha_{Day}$ are independent. This is a problem. First, we want estimates closer together to be more similar. Second, some days have no polling data to inform them. To address this problem, we constrain $\alpha_{1}$ to be equal to the vote share that a given party received at a given election. We also constrain $\alpha_{T}$ to be equal to the vote share that the same party received at the following election. Next, we fit a dynamic model to $\alpha_{t}$ for all days in our time series except for the first and last. This acts as a sort of "chain" that links together all values of $\alpha$. Because these values are now linked, they can then share information amongst themselves. This means that when the value of one estimate changes during the model estimation process, so too do the values of all others. The model assumes that $\alpha_{t}$ is equal to $\alpha_{t-1}$, plus any random shocks that take place between the two days, $\omega_{t-1}$. These random shock parameters are themselves scaled according to $\tau$, the scale of innovations parameter. This [DOES WHATEVER IT DOES]. Updating our model specification again, we get:

As we rely on Bayesian methods, our final step is to provide the model with a set of prior distributions.

\begin{align} Poll_{i} &\sim \mathrm{Normal}(\mu_{i}, \sqrt{\sigma^2 + S_{i}^2}) \srlab{Likelihood function} \ \mu_{i} &= \alpha_{Day[i]} + \delta_{Pollster[i]} \srlab{Measurement model on $\mu$} \ \alpha_{t} &= \alpha_{t-1} + \tau \omega_{t-1} \text{ for } t \text{ in } 2, ..., T-1 \srlab{Dynamic model on $\alpha_{t}$} \ \alpha_{T} &\sim \mathrm{Normal}(\alpha_{T-1}, \tau) \srlab{Adaptive prior on $\alpha_{T}$} \ \delta_{j} &\sim \mathrm{Normal}(0, 0.05) \text{ for } j \text{ in } 1, ..., J \srlab{Prior on house effects, $\delta$} \ \omega_{t} &\sim \mathrm{Normal}(0, 0.025) \text{ for } t \text{ in } 1, ..., T-1 \srlab{Prior on random shocks, $\omega$} \ \tau &\sim \mathrm{Normal}(0, 0.05)^{+} \srlab{Positive prior on scale of innovations, $\tau$} \ \sigma &\sim \mathrm{Exponential}(20) \srlab{Prior on residual error, $\sigma$} \ \end{align}

Model pinned at both ends and includes only one set of polling figures -- loop over each pair of elections and each party between 1955 and the present day.

Validating Our Estimates

# Define party colours

pty_cols <-
  c(
    "Conservative Party" = "#0087dc",
    "Labour Party" = "#d50000",
    "Liberals (Various Forms)" = "#fdbb30"
  )


# Load Timeline data, filter to include only UK cases, and split by party

tl <-
  britpol:::load_timeline() %>%
  select(
    date = polldate,
    elecdate,
    country,
    party = partyid,
    vote = poll_
  ) %>%
  filter(country == "United Kingdom") %>%
  mutate(
    vote = vote/100,
    party = case_when(party == 1 ~ "con", party == 2 ~ "lab", party == 3 ~ "lib"),
    polldate = as_date(date)
  ) %>%
  na.omit()

con <-
  tl %>%
  filter(party == "con") %>%
  left_join(
    pollbasepro,
    c("polldate" = "date")
  ) %>%
  na.omit()

lab <-
  tl %>%
  filter(party == "lab") %>%
  left_join(
    pollbasepro,
    c("polldate" = "date")
  ) %>%
  na.omit()

lib <-
  tl %>%
  filter(party == "lib") %>%
  left_join(
    pollbasepro,
    c("polldate" = "date")
  ) %>%
  na.omit()


# Create correlation plot

ggplot() +
  geom_density_2d(
    data = con %>% mutate(facet = "Conservative"),
    mapping = aes(x = vote, y = con_est),
    colour = pty_cols[1],
    alpha = .4
  ) +
  stat_smooth(
    data = con %>% mutate(facet = "Conservative"),
    mapping = aes(x = vote, y = con_est),
    colour = "black",
    fill = pty_cols[1],
    method = "lm",
    formula = y ~ x
  ) +
  # geom_text(
  #   data = con %>% mutate(facet = "Conservative"),
  #   mapping = 
  #     aes(
  #       x = .6,
  #       y = .05,
  #       label = 
  #         paste0(
  #           in_text(pluck(posterior_samples(cor_con, "rescor"), 1)*100, suffix = "%", inside = F),
  #           "\n",
  #           "MAE = ", round(mae(cor_con$data$con_est, cor_con$data$vote), 2),
  #           ", RMSE = ", round(rmse(cor_con$data$con_est, cor_con$data$vote), 2)
  #           )
  #       ),
  #   hjust = 1,
  #   size = 2,
  #   family = "Cabin Regular"
  # ) +
  geom_density_2d(
    data = lab %>% mutate(facet = "Labour"),
    mapping = aes(x = vote, y = lab_est),
    colour = pty_cols[2],
    alpha = .4
  ) +
  stat_smooth(
    data = lab %>% mutate(facet = "Labour"),
    mapping = aes(x = vote, y = lab_est),
    colour = "black",
    fill = pty_cols[2],
    method = "lm",
    formula = y ~ x
  ) +
  # geom_text(
  #   data = lab %>% mutate(facet = "Labour"),
  #   mapping = 
  #     aes(
  #       x = .6,
  #       y = .05,
  #       label = 
  #         paste0(
  #           in_text(pluck(posterior_samples(cor_lab, "rescor"), 1)*100, suffix = "%", inside = F),
  #           "\n",
  #           "MAE = ", round(mae(cor_lab$data$lab_est, cor_lab$data$vote), 2),
  #           ", RMSE = ", round(rmse(cor_lab$data$lab_est, cor_lab$data$vote), 2)
  #           )
  #     ),
  #   hjust = 1,
  #   size = 2,
  #   family = "Cabin Regular"
  # ) +
  geom_density_2d(
    data = lib %>% mutate(facet = "Liberal (Various Forms)"),
    mapping = aes(x = vote, y = lib_est),
    colour = pty_cols[3],
    alpha = .4
  ) +
  stat_smooth(
    data = lib %>% mutate(facet = "Liberal (Various Forms)"),
    mapping = aes(x = vote, y = lib_est),
    colour = "black",
    fill = pty_cols[3],
    method = "lm",
    formula = y ~ x
  ) +
  # geom_text(
  #   data = lib %>% mutate(facet = "Liberal (Various Forms)"),
  #   mapping = 
  #     aes(
  #       x = .6,
  #       y = .05,
  #       label = 
  #         paste0(
  #           in_text(pluck(posterior_samples(cor_lib, "rescor"), 1)*100, suffix = "%", inside = F),
  #           "\n",
  #           "MAE = ", round(mae(cor_lib$data$lib_est, cor_lib$data$vote), 2),
  #           ", RMSE = ", round(rmse(cor_lib$data$lib_est, cor_lib$data$vote), 2)
  #           )
  #     ),
  #   hjust = 1,
  #   size = 2,
  #   family = "Cabin Regular"
  # ) +
  facet_wrap(~ facet) +
  scale_y_continuous(
    breaks = seq(0, .6, by = .2),
    labels = scales::percent_format(accuracy = 1)
  ) +
  scale_x_continuous(
    breaks = seq(0, .6, by = .2),
    labels = scales::percent_format(accuracy = 1)
  ) +
  coord_cartesian(
    ylim = c(0, 0.62),
    xlim = c(0, 0.62)
  ) +
  labs(x = "Jennings and Wlezien's 'Timeline of Elections' Data", y = "PollBasePro Estimates") +
  theme_bailey()

To validate our estimates, we compare them against raw polling data from Jennings and Wlezien's "Timeline of Elections" dataset [-@jennings2016a]. These data contain r format(nrow(timeline[timeline$party == 1, ]), big.mark = ",") polls from Britain between r str_remove(format(min(as_date(timeline$date)), "%d %B %Y"), "^0") and r str_remove(format(max(as_date(timeline$date)), "%d %B %Y"), "^0"). Given that the data we use to produce our estimates are so comprehensive, it is likely that most polls appear in both datasets. Still, the Timeline data provide a good test as Jennings and Wlezien compiled them independently. As figure \@ref(fig:val-plot) shows, our estimates are well validated. Correlations between the two series are strong and positive, as we would expect. Their mean absolute error (MAE) and root-mean-square error (RMSE) are also low in all cases. The Conservatives showed a correlation of r britpol:::in_text(cor_mods$tl_con*100, suffix = "%", inside = F), an MAE of r paste0(round(cor_mods$mae_con, 2), " percentage points"), and an RMSE of r round(cor_mods$rmse_con, 2); Labour, a correlation of r britpol:::in_text(cor_mods$tl_lab*100, suffix = "%", inside = F), an MAE of r paste0(round(cor_mods$mae_lab, 2), " points"), and an RMSE of r round(cor_mods$rmse_lab, 2); and the Liberals, a correlation of r britpol:::in_text(cor_mods$tl_lib*100, suffix = "%", inside = F), an MAE of r paste0(round(cor_mods$mae_lib, 2), " points"), and an RMSE of r round(cor_mods$rmse_lib, 2).

Open-Source Data Pipeline

We recognise that some users will find understanding our modelling decisions more simple if they were able to see our code^[Note that we estimate all of our models using R r str_remove(R.Version()$version.string, "^R*.") and either version r packageVersion("brms") of the brms package [@burkner2017] or version r cmdstanr::cmdstan_version() of cmdstan, an interface to the Stan probabilistic programming language [@carpenter2017]]. This transparency also has other benefits: it allows our users to identify mistakes in our code. After all---and like any project of this nature---our data pipeline likely contains minor errors or inefficiencies that could affect the estimates that we obtain. To guard against this, and provide our users with a more in-depth look at our modelling process, we have hosted our entire data pipeline online for others to inspect. If our users find any errors in our code or wish to make recommendations for future updates, we invite them to raise an issue on the project's GitHub repository or to contact the authors directly.

\pagebreak

Change Log

For the sake of openness and transparency, we provide a change log that lists all updates and changes made to britpol over time. If you think that you have found a problem with either the data, code, or documentation, please raise an issue on the project's GitHub repository.

# Read change log

change_log <- readLines(here("documentation", "change-log.txt"), warn = F)


# Output change log

cat(paste0(change_log, collapse = "\n"))

\pagebreak

References

::: {#refs}

:::

# Save session information to the "sessions" folder

britpol:::save_info(path = here("sessions", "008_codebook.txt"))

jackobailey/britpol documentation built on Aug. 6, 2023, 2:30 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jackobailey/britpol
Data and Functions for Studying British Politics

In jackobailey/britpol: Data and Functions for Studying British Politics

About `britpol`

Citing `britpol` and the PollBasePro Data

Using `britpol`

Functions

Data

`pollbase` (Historical British Election Polls from `r format(min(pollbase$start), "%Y")` to `r format(max(pollbase$end), "%Y")`)

Variable List

`pollbasepro` (Daily British Voting Intention from `r format(min(pollbasepro$date), "%Y")` to `r format(max(pollbasepro$date), "%Y")`)

Analysing `pollbasepro`

Variable List

Technical Details: Estimating Daily Voting Intention

Imputating Missing Sample Sizes

Estimating Daily Voting Intention Figures

Validating Our Estimates

Open-Source Data Pipeline

Change Log

References

R Package Documentation

Browse R Packages

We want your feedback!

jackobailey/britpol Data and Functions for Studying British Politics

In jackobailey/britpol: Data and Functions for Studying British Politics

About britpol

Citing britpol and the PollBasePro Data

Using britpol

Functions

Data

pollbase (Historical British Election Polls from r format(min(pollbase$start), "%Y") to r format(max(pollbase$end), "%Y"))

Variable List

pollbasepro (Daily British Voting Intention from r format(min(pollbasepro$date), "%Y") to r format(max(pollbasepro$date), "%Y"))

Analysing pollbasepro

Variable List

Technical Details: Estimating Daily Voting Intention

Imputating Missing Sample Sizes

Estimating Daily Voting Intention Figures

Validating Our Estimates

Open-Source Data Pipeline

Change Log

References

R Package Documentation

Browse R Packages

We want your feedback!

jackobailey/britpol
Data and Functions for Studying British Politics

About `britpol`

Citing `britpol` and the PollBasePro Data

Using `britpol`

`pollbase` (Historical British Election Polls from `r format(min(pollbase$start), "%Y")` to `r format(max(pollbase$end), "%Y")`)

`pollbasepro` (Daily British Voting Intention from `r format(min(pollbasepro$date), "%Y")` to `r format(max(pollbasepro$date), "%Y")`)

Analysing `pollbasepro`