knitr::opts_chunk$set(echo = TRUE, fig.height = 5, fig.width = 8, message = TRUE, warning = FALSE)
#load libraries: #load strm library(strm) library(rmarkdown) #load package dependencies for the vignette library(tidyr) library(gt) library(dplyr) library(ggplot2) library(patchwork) library(purrr) library(rgdal) library(spdep)
strm
is an R
package that fits spatio-temporal regression model based on Chi & Zhu Spatial Regression Models for the Social Sciences (2019). The approach here fits a spatial error model while incorporating a temporally lagged response variable and temporally lagged explanatory variables.
This package builds on the errorsarlm()
function from the spatialreg
package.
This package is still under development. Please report bugs or constructive tips to issues here.
This data example will work through downloading census data using the tidycensus
package, creating spatial weights, and then using strm
to fit the spatio-temporal regression model. We will fit the same model using both the long and wide data formats.
In this example, we want to look at percent poverty as it relates to female employment rates at the Wisconsin county-level. We will use the American Community Survey (ACS) 5-year data. The variables we will extract are:
B17020_002
- Estimate: Total - Income in the past 12 months below poverty levelB17020_001
- Estimate: Total - Poverty Status in the past 12 months.B23022_026
- Estimate: Total Female by Work Status by weeks worked in the past 12 months for the population 16-64 years old.B23022_001
- Estimate: Total: status in the past 12 months by usual hours worked per week in the past 12 months by weeks worked in the past 12 months for the population 16-64 years old (Male and Female)We first load in the dataset wi_raw
that contains the raw 5-year estimates from two years: 2013-2017 and 2014-2018 ACS data at the county-level in Wisconsin using data(wi_raw)
:
data(wi_raw)
We first explore the raw Wisconsin data:
class(wi_raw) names(wi_raw) str(wi_raw) head(wi_raw)
Notice that wi_raw
is of class sf
and class data.frame
We first clean the data in order to calculate percent of each year-county population that is in poverty and percent female employment rate. We also take the natural log transformation of the poverty percent rate. We first use select()
from dplyr
to select out the moe
(margin of error) variable as we do not need it for this analysis. We then use spread
from tidyr
to reformat the key variables from long to wide format so that we can then use mutate()
from dplyr
to create new variables (pov_prct
, feemp_prct
, id
, logpov
). Note this dataset is still in long
format because there are multiple observations for each county due to multiple years.
wi <- wi_raw %>% dplyr::select(-moe) %>% spread(key = variable, value=estimate) %>% mutate(pov_prct = B17020_002/B17020_001, feemp_prct = B23022_026/B23022_001, year=id, logpov = log(pov_prct)) wi <- st_transform(wi, 4326) class(wi) str(wi)
Let us explore the summary statistics by year for pov_prct
using summary()
and the tapply()
function:
summary(wi$pov_prct) tapply(wi$pov_prct, wi$year, summary)
We also the summary statistics by year for feemp_prct
summary(wi$feemp_prct) tapply(wi$feemp_prct, wi$year, summary)
Finally, we look at correlation between percent poverty and percent female employment across all 2 years of ACS data:
cor(wi$pov_prct,wi$feemp_prct)
We next visually explore the data using the geom_histogram()
and geom_density()
geometries from the ggplot2
package.
We first explore percent poverty (pov_prct
) and the log-transformed percent poverty (logpov
):
ggplot(data=wi, aes(x=pov_prct)) + facet_grid(.~ year) + geom_histogram(binwidth=0.01,color="black", aes(fill=factor(year))) + geom_density(alpha=.2, aes(fill=factor(year))) + theme_minimal() + xlab("Percent Poverty") + labs(title="County-Level Percent Poverty in Wisconsin Across Time", fill="Year") + theme(plot.title = element_text(hjust = 0.5), legend.position = "none") + scale_fill_manual(values=c("chartreuse4", "deeppink4", "cornflowerblue","orange2","tomato3")) breaks <- quantile(wi$logpov,seq(0,1,by=0.1)) ggplot(data=wi, aes(x=logpov)) + geom_histogram(aes(y =..density.., fill=factor(year)),color="black", breaks=breaks) + facet_grid(.~ year) + geom_density(alpha=.2, aes(fill=factor(year))) + theme_minimal() + xlab("(Log) Percent Poverty") + labs(title="County-Level (Log) Percent Poverty in Wisconsin Across Time", fill="Year") + theme(plot.title = element_text(hjust = 0.5), legend.position = "none") + scale_fill_manual(values=c("chartreuse4", "deeppink4", "cornflowerblue","orange2","tomato3"))
Recall that wi
is of class sf
. We can use the geom_sf()
geometry to easily create maps of both pov_prct
and logpov
:
ggplot(wi) + geom_sf(aes(fill = pov_prct)) + scale_fill_viridis_c() + coord_sf(datum = NA) + facet_grid(. ~ year) + theme_minimal()+ theme(plot.title=element_text(hjust = 0.5)) + labs(title = "County-Level Percent Poverty in Wisconsin Across Time", fill="%Poverty") ggplot(wi) + geom_sf(aes(fill = logpov)) + scale_fill_viridis_c() + coord_sf(datum = NA) + facet_grid(. ~ year) + theme_minimal()+ theme(plot.title=element_text(hjust = 0.5)) + labs(title = "County-Level (Log) Percent Poverty in Wisconsin Across Time", fill="(Log) %Poverty")
Next we visually explore feemp_prct
:
ggplot(data=wi, aes(x=feemp_prct)) + facet_grid(.~ year) + geom_histogram(binwidth=0.01,color="black", aes(fill=factor(year))) + geom_density(alpha=.2, aes(fill=factor(year))) + theme_minimal() + xlab("Percent Female Employment") + labs(title="County-Level Percent Female Employment in Wisconsin Across Time", fill="Year") + theme(plot.title = element_text(hjust = 0.5), legend.position = "none") + scale_fill_manual(values=c("chartreuse4", "deeppink4", "cornflowerblue","orange2","tomato3"))
ggplot(wi) + geom_sf(aes(fill = feemp_prct)) + scale_fill_viridis_c() + coord_sf(datum = NA) + facet_grid(. ~ year) + theme_minimal()+ theme(plot.title=element_text(hjust = 0.5)) + labs(title = "County-Level Percent Female Employment in Wisconsin Across Time", fill="%Feemp")
In order to apply the spatio-temporal model using strm
, we need to first define county neighbors. We will define first-order contiguity-based neighborhood structure (Queen's adjacency) with row-standardized spatial weights.
We have 2 years of data, however the spatial dependence between counties remains the same. Therefore, we will select unique counties to create neighbors and spatial weights. Specifically, we will use the geometries from 2018 using filter()
from dplyr
and to simplify the data frame we will select just the GEOID
and NAME
variables using select()
from dplyr
:
wi_uniq <- wi %>% dplyr::filter(year==2018) %>% dplyr::select(GEOID, NAME) nrow(wi_uniq) class(wi_uniq)
We now have 72 unique county geometries which correspond to the 72 counties in Wisconsin.
We next create the neighborhood structure using poly2nb()
from the spdep
package. We set the row names of wi_uniq
dataframe to correspond to the county names and then use the argument row.names=row.names(wi_uniq)
in poly2nb()
so that we can easily identify each element of the neighborhood structure list with a county. Finally, we print the first 6 neighbors using str(head(wi_nb))
:
row.names(wi_uniq) <- wi_uniq$NAME head(row.names(wi_uniq)) wi_nb <- poly2nb(wi_uniq, row.names=row.names(wi_uniq)) str(head(wi_nb))
Now that we have the neighborhood structure, we create the list of spatial weights using nb2listw()
from the spdep
package and print the class and first 6 elements of the spatial weights list:
listw_w <- nb2listw(wi_nb, style = "W") class(listw_w) str(head(listw_w$weights))
Lastly, we can visualize these neighbors. We first extract the county polygons using st_geometry()
from the sf
package and save that as the object county_geoms
. Next we find the county centroids using st_centroid()
from sf
on county_geoms
and save that as the object cntrd
. Lastly, we find the coordinates of the centroids using st_coordinates()
on the cntrd
object:
county_geoms <- st_geometry(wi_uniq) cntrd <- st_centroid(county_geoms) coords <- st_coordinates(cntrd) head(coords)
We can visualize the county geometries, centroids, and neighborhood:
plot(county_geoms) plot(listw_w, coords, col="blue", add=TRUE)
strm
We use the strm()
function from the strm
package. The first argument is the model formula. Given that the data is in long format, strm
will create the lagged values for you as well as the lagged response in the right-hand side of the model. The next argument is id
, which we set to "NAME" as each observation is taken at the county-level. We then specify the name of the data set (data=wi
). We set the argument listw=listw_w
to the row-standardized list of spatial weights. We set time=2
because there are 2 time periods in the dataset. We set wide=FALSE
, thereby specifying that the data is in long format. Lastly, we set the argument returndf=TRUE
to return the modified dataframe along with the model output. The default to returndf
is FALSE
and strm
will only return the model output summary.
#create model formula formula <-as.formula(logpov ~ feemp_prct) #use strm out <- strm(formula, id="NAME", data=wi, listw = listw_w, time=2,wide=FALSE, returndf = TRUE) #explore model summary output summary(out$res) #explore modified dataframe used in strm computation summary(out$modframe)
We next explore the model diagnostics. We create a new dataframe called out_df
using cbind.data.frame()
, which has the model residuals, model standardized residuals, and fitted values.
out_df <- cbind.data.frame(resids = out$res$residuals, stdresids = out$res$residuals/sd(out$res$residuals), fit = out$res$fitted.values)
Using the model residuals and fitted values, we create a QQ plot and (standardized) residuals vs. fitted plot, and show them both using the +
operator from the patchwork
R package:
#QQ Plot p1 <- ggplot(out_df, aes(sample = stdresids)) + geom_qq(size = 1) + stat_qq_line() + theme_minimal() + xlab("Theoretical Quantiles") + ylab("Standardized Residuals") + ggtitle("QQ Plot") + theme(plot.title=element_text(hjust = 0.5)) #Residuals vs. Fitted Plot p2 <- ggplot(out_df, aes(x=fit, y=stdresids)) + geom_point(size = 0.5) + geom_hline(yintercept = 0, col="red", linetype="dashed") + theme_minimal() + xlab("Fitted Values") + ylab("Standardized Residuals") + ggtitle("Residuals vs. Fitted Plot") + theme(plot.title=element_text(hjust = 0.5)) p1 + p2
Next, we will perform the same analysis as above, but will use the data in wide format.
We first clean the dataframe and convert it to wide format using functions from both the dplyr
and tidyr
packages
wi_w <- as.data.frame(wi_raw) %>% dplyr::select(-moe) %>% spread(key = variable, value=estimate) %>% mutate(pov_prct = B17020_002/B17020_001, feemp_prct = B23022_026/B23022_001, year=id, logpov = log(pov_prct), logfeemp = log(pov_prct)) %>% dplyr::select(-id, -(B17020_001:B23022_026), - geometry) %>% relocate(year, .after=GEOID) %>% gather(variable, value, -(GEOID:NAME)) %>% unite(temp, variable, year) %>% spread(temp, value) str(wi_w) head(wi_w) nrow(wi_w)
Notice how in this wide format, each variable is one of our poverty percent or female employment percent variables plus an associated year, and now we only have r nrow(wi_w)
rows in this wide format compared to r nrow(wi)
(72 counties $\times$ 2 time periods) rows.
We will use the 2018 geometries for creating our neighborhood structure and list of spatial weights. To do this, we first create wi_2018geom
, which is a simplified sf
dataframe that only contains the GEOID
variable and the associated county geometries.
wi_2018geom <- wi %>% filter(year==2018) %>% dplyr::select(GEOID, geometry) str(wi_2018geom)
Next, we will merge wi_2018geom
(an sf
dataframe that contains the county geometries) with wi_w
(a regular dataframe) using the merge()
function from the sp
package:
#merge 2018 geometries back in wi_w.sf <- merge(wi_2018geom, wi_w, by="GEOID") nrow(wi_w.sf) class(wi_w.sf)
Since our data is in wide format, we can create a similar map to the one created above in the long format, by creating 5 separate maps for each time period and then putting them together using the +
operator from the patchwork
R package. To create a title for the combined plot, we use plot_annotation()
and to specify a single legend we use plot_layout(guides = 'collect')
, both from the patchwork
package.
p18 <- ggplot(wi_w.sf) + geom_sf(aes(fill = logpov_2018)) + scale_fill_viridis_c(limits = c(-3.2, -1)) + coord_sf(datum = NA) + theme_minimal() + theme(plot.title=element_text(hjust = 0.5)) + labs(title = "2018", fill="(Log) %Poverty") p17 <- ggplot(wi_w.sf) + geom_sf(aes(fill = logpov_2017)) + scale_fill_viridis_c(limits = c(-3.2, -1)) + coord_sf(datum = NA) + theme_minimal() + theme(plot.title=element_text(hjust = 0.5)) + labs(title = "2017", fill="(Log) %Poverty") #patchwork the plots together patch1 <- p17 + p18 patch1 + plot_annotation(title="County-Level (Log) Percent Poverty in Wisconsin Across Time") + plot_layout(guides = "collect")
Creating the neighborhood structure and list of spatial weights is the same as above in long format.
row.names(wi_w.sf) <- wi_w.sf$NAME head(row.names(wi_w.sf)) wi_nb_w <- poly2nb(wi_w.sf, row.names=row.names(wi_w.sf)) str(head(wi_nb_w))
listw_w <- nb2listw(wi_nb_w, style = "W") class(listw_w) str(head(listw_w$weights))
county_geoms <- st_geometry(wi_w.sf) cntrd <- st_centroid(county_geoms) coords <- st_coordinates(cntrd) head(coords)
To be sure that our neighbors look the same, we can again plot the county geometries, centroids, and neighborhood:
plot(county_geoms) plot(listw_w, coords, col="blue", add=TRUE)
strm
The first argument is the model formula. Given that the data is now in wide format, we must manually enter the response and all of the explanatory variables and their lagged terms. The next argument is id
, which we set to "NAME" as each observation is taken at the county-level. We then specify the name of the data set (data=wi_w.sf
). We set the argument listw=listw_w
to the row-standardized list of spatial weights. We set time=2
because there are 2 time periods in the dataset. We set wide=TRUE
, thereby specifying that the data is in wide format.
formula2 <-as.formula(logpov_2018 ~feemp_prct_2017 + feemp_prct_2018 + logpov_2017) out2 <- strm(formula2, id="NAME", data=wi_w.sf, listw= listw_w, time=2, wide=TRUE) summary(out2)
Compare the output from out$res
from the long format dataframe and out2
from the wide format dataframe and see that they are the same!
out_coefs <- c(as.vector(out$res$coefficients), out$res$lambda) out_ses <- c(as.vector(out$res$rest.se),out$res$lambda.se) out_names <- c(attributes(out$res$coefficients)[[1]], "lambda") out2_coefs <- c(as.vector(out2$coefficients), out2$lambda) out2_ses <- c(as.vector(out2$rest.se), out2$lambda.se) out2_names <- c(attributes(out2$coefficients)[[1]], "lambda") out_table <- rbind.data.frame(cbind.data.frame(Term=out_names, Estimate=out_coefs, SE=out_ses, format=rep("Long format",length(out_names))), cbind.data.frame(Term=out2_names, Estimate=out2_coefs, SE=out2_ses,format=rep("Wide format",length(out2_names)))) out_table %>% gt(groupname_col = c("format")) %>% tab_header(title="Comparison of Long and Wide format Output") %>% fmt_number(columns = c(Estimate, SE), decimals = 2) %>% cols_label(Term = md("**Term**"), Estimate=md("**Estimate**"), SE=md("**SE**"))
tidycensus
In this example, we have provided the data used. However the data can be downloaded directly using the tidycensus
R package. The code to download the data directly is below. To use tidycensus
, you will need to request an API key here.
#load tidycensus library(tidycensus) options(tigris_use_cache = TRUE)
tidycensus
From the tidycensus
package, we use the census_api_key()
function and specify install=TRUE
so that the API key is stored in the local R environment. To access the variables we want from the American Community Survey. It should be noted that using tidycensus, we can also access the 1990, 2000, and 2010 decennial US Census data as well.
census_api_key(Sys.getenv("CENSUS_API_KEY"), install=TRUE, overwrite = TRUE) readRenviron("~/.Renviron")
census_api_key("YOUR API KEY HERE", install=TRUE)
(You may need to either restart your R session or run readRenviron("~/.Renviron")
in your console to use the API key immediately.)
vars <- load_variables(2018, "acs5", cache=TRUE) years <- lst(2017,2018) multi_year <- map(years,~ get_acs(geography = "county", variables = c("B17020_001","B17020_002","B23022_026","B23022_001"), state = "WI", year = .x, geometry = TRUE)) %>% map2(years, ~ mutate(.x, id = .y)) wi_raw <- reduce(multi_year, rbind)
For more examples of working with the tidycensus
package, please see the vignettes for basic usage and working with spatial data. Full tidycensus
support can be found here.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.