knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The following example illustrates how the csranks
package can be used for estimation and inference in rank-rank regressions. These are commonly used for studying intergenerational mobility.
In this example, we want to intergenerational income mobility by estimating and performing inference on the rank correlation between parents and their children's incomes. The csranks
package contains an artificial dataset with data on children's and parents' household incomes, the child's gender and race (black
, hisp
or neither
).
First, load the package csranks
. Second, load the data and take a quick look at it:
library(csranks) data(parent_child_income) head(parent_child_income)
In economics, it is common to estimate measures of mobility by running rank-rank regressions. For instance, the rank correlation between parents' and children's incomes can be estimated by running a regression of a child's income rank on the parent's income rank:
lmr_model <- lmranks(r(c_faminc) ~ r(p_faminc), data=parent_child_income) summary(lmr_model)
This regression specification takes each child's income (c_faminc
), computes its rank among all children's incomes, then takes each parent's income (p_faminc
) and computes its rank among all parents' incomes. Then the child's rank is regressed on the parent's rank using OLS. The lmranks
function computes standard errors, t-values and p-values according to the asymptotic theory developed in Chetverikov and Wilhelm (2023).
A naive approach, which does not lead to valid inference, would compute the children's and parents' ranks first and the run a standard OLS regression afterwards:
c_faminc_rank <- frank(parent_child_income$c_faminc, omega=1, increasing=TRUE) p_faminc_rank <- frank(parent_child_income$p_faminc, omega=1, increasing=TRUE) lm_model <- lm(c_faminc_rank ~ p_faminc_rank) summary(lm_model)
Notice that the point estimates of the intercept and slope are the same as those of the lmranks
function. However, the standard errors, t-values and p-values differ. This is because the usual OLS formulas for standard errors do not take into account the estimation uncertainty in the ranks.
One can also run the rank-rank regression with additional covariates, e.g.:
lmr_model_cov <- lmranks(r(c_faminc) ~ r(p_faminc) + gender + race, data=parent_child_income) summary(lmr_model_cov)
In some economic applications, it is desired to run rank-rank regressions separately in subgroups of the population, but compute the ranks in the whole population. For instance, we might want to estimate rank-rank regression slopes as measures of intergenerational mobility separately for males and females, but the ranking of children's incomes is formed among all children (rather than form separate rankings for males and females).
Such regressions can easily be run using the lmranks
function and interaction notation:
grouped_lmr_model_simple <- lmranks(r(c_faminc) ~ r(p_faminc_rank):gender, data=parent_child_income) summary(grouped_lmr_model_simple)
In this example, we have run a separate OLS regression of children's ranks on parents' ranks among the female and male children. However, incomes of children are ranked among all children and incomes of parents are ranked among all parents. The standard errors, t-values and p-values are implemented according to the asymptotic theory developed in Chetverikov and Wilhelm (2023), where it is shown that the asymptotic distribution of the estimators now need to not only account for the fact that ranks are estimated, but also for the fact that estimators are correlated across gender subgroups because they use the same estimated ranking.
A naive application of the lm
function would produce the same point estimates, but not the correct standard errors:
grouped_lm_model_simple <- lm(c_faminc_rank ~ p_faminc_rank:gender + gender - 1, #group-wise intercept data=parent_child_income) summary(grouped_lm_model_simple)
One can also create more granular subgroups by interacting several characteristics such as gender and race:
parent_child_income$subgroup <- interaction(parent_child_income$gender, parent_child_income$race) grouped_lmr_model <- lmranks(r(c_faminc) ~ r(p_faminc_rank):subgroup, data=parent_child_income) summary(grouped_lmr_model)
Let's compare the confidence intervals for regression coefficients produced
by lmranks
and naive approaches.
grouped_lm_model <- lm(c_faminc_rank ~ p_faminc_rank:subgroup + subgroup - 1, #group-wise intercept data=parent_child_income) summary(grouped_lm_model)
library(ggplot2) theme_set(theme_minimal()) ci_data <- data.frame(estimate=coef(lmr_model), parameter=c("Intercept", "slope"), group="Whole sample", method="csranks", lower=confint(lmr_model)[,1], upper=confint(lmr_model)[,2]) ci_data <- rbind(ci_data, data.frame( estimate = coef(grouped_lmr_model), parameter = rep(c("Intercept", "slope"), each=6), group = rep(c("Hispanic female", "Hispanic male", "Black female", "Black male", "Other female", "Other male"), times=2), method="csranks", lower=confint(grouped_lmr_model)[,1], upper=confint(grouped_lmr_model)[,2] )) ci_data <- rbind(ci_data, data.frame( estimate = coef(lm_model), parameter = c("Intercept", "slope"), group = "Whole sample", method="naive", lower=confint(lm_model)[,1], upper=confint(lm_model)[,2] )) ci_data <- rbind(ci_data, data.frame( estimate = coef(grouped_lm_model), parameter = rep(c("Intercept", "slope"), each=6), group = rep(c("Hispanic female", "Hispanic male", "Black female", "Black male", "Other female", "Other male"), times=2), method="naive", lower=confint(grouped_lm_model)[,1], upper=confint(grouped_lm_model)[,2] )) ggplot(ci_data, aes(y=estimate, x=group, ymin=lower, ymax=upper,col=method, fill=method)) + geom_point(position=position_dodge2(width = 0.9)) + geom_errorbar(position=position_dodge2(width = 0.9)) + geom_hline(aes(yintercept=estimate), data=subset(ci_data, group=="Whole sample"), linetype="dashed", col="gray") + coord_flip() + labs(title="95% confidence intervals of intercept and slope\nin rank-rank regression")+ facet_wrap(~parameter)
The coefficient calculated for the whole sample has a narrow confidence interval, which is
expected. In this example, there are some differences in the correct (csranks
) confidence intervals and the incorrect (naive
) confidence intervals, but they are rather small. The paper by Chetverikov and Wilhelm (2023), however, provides empirical examples in which the differences can be quite large.
Check out the documentation of individual functions at the package's website and further examples in the package's Github repository.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.