Problemset Taxation in Scandinavia

Author: David Hertle

< ignore

library(restorepoint)
# facilitates error detection
# set.restore.point.options(display.restore.point=TRUE)

library(RTutor)
library(yaml)
#library(restorepoint)
setwd("C:/Users/David/Dropbox/Uni Ulm/Bachelor/David/ps")
ps.name = "Taxation_Scandinavia"; sol.file = paste0(ps.name,"_sol.Rmd")
libs = c("ggplot2","dplyr","ggrepel","magrittr","stargazer","yaml","lmtest","gridExtra","regtools","lfe","googleVis","faraway") # all packages you load in the problem set
#name.rmd.chunks(sol.file) # set auto chunk names in this file
create.ps(sol.file=sol.file, ps.name=ps.name, user.name=NULL,libs=libs, stop.when.finished=FALSE,var.txt.file ="variables.txt",addons="quiz")
show.shiny.ps(ps.name, load.sav=FALSE,  sample.solution=FALSE, is.solved=FALSE, catch.errors=TRUE, launch.browser=TRUE)
stop.without.error()

>

In his paper "How Can Scandinavians Tax So Much?" Henrik Jacobsen Kleven (2014) analyzes why Scandinavians (defined as Denmark, Norway, and Sweden) are able to raise large amounts of tax revenue for redistribution and social insurance while maintaining some of the strongest economic outcomes in the world. He concludes three policies that can explain the anomaly.

In this interactive R Tutorial, we are going to successively replicate his study and discuss the results.

(The article and the corresponding public data is provided on the website of the American Economic Association. You can click on this link to download it.)

You do not need to solve the exercises in the given order but it is recommended to do so as it makes the most sense.

Exercise Content

  1. Overview of the data set

  2. Recap of aspects of taxation & policy implications

2.1 Third-party information reporting

2.2 Broad tax base

2.3 Subsidization of goods complementary to working

2.3.1 Graphical approach

2.3.2 Linear Regressions

  1. Evidence on social and cultural influences

  2. Conclusion

  3. References

Exercise 1 -- Overview of the data set

First of all, it is necessary to familiarize with the existing data on an economic issue before working with the data. This chapter will teach you some useful R commands to get a brief overview of the data at hand.

Hence, most of the data provided on the website of the American Economic Association is not used, I have shortened the data set to a minimum of used variables by reading in the original dta file (which is used by Stata data analysis software) with the read.dta() command from the foreign package and saved it as an rds file with the command saveRDS() for your convenience.

< info "saveRDS() / readRDS()"

These commands allow to load or save a single R object from or to a file. The advantage over saving data into txt or csv files is that it preserves the data structures, such as column data types. But saving rds files makes only sense if you don't want to open the files with other analysis software than R.

If you set the working directory correctly and save the data in it, it will suffice to use the character string specifying the name of a file.

dat = readRDS(file = "data.rds")  #reads the rds file and saves it into the variable dat

You can also set the full path if you did not store the data in the working directory.

dat = readRDS(file = "C:/mypath/data.rds")

To save an object to a file the command looks like as follows.

saveRDS(object, file = "C:/mypath/data.rds")

If you want to know more about the saveRDS() command, you can take a look at this summary

>

Since this is your first task, most of the command is already given. But before you start entering any code you have to press the edit button or just click into the code box if the edit button is not displayed. This has to be done on every first task of each chapter.


Task: Use the command readRDS() to load the data set data_condensed.rds into the R workspace and assign it to the variable data. If you need help on how to use readRDS(), look at the info box above. When you are finished, click the check button to find out whether your solution is correct.

If you need further advice, click the hint button, which contains more detailed information on your task. If the hint is not helpful enough, you can always access the solution with the solution button. Here you just need to uncomment the code (remove the #) and fill the placeholder ... correctly.

#< task
# Adapt the following line of code
# ... = readRDS("...")
#>
data = readRDS("data_condensed.rds")
#< hint
display("Just write: data = readRDS(\"data_condensed.rds\") and press check afterwards.")
#>

< award "Starter"

Perfect, you have completed your first task.

>

The data set contains various macroeconomic variables of several countries such as employment rates, tax rates for the years 1950 to 2012 and survey results. If you want to take a look at the data, you can click the data button located above every task.

If you are interested in the original data sources Kleven (2014) used in his paper, have a look at the following info box.

< info "original data sources"

The variable ief_taxpergdp2012 was obtained from the Index of Economic Freedom - Heritage Foundation.

From the OECD Tax Revenue Statistics originate the following variables: oecd_t1000, oecd_t2000, oecd_t3000, oecd_propertytax (= oecd_t4000), oecd_consumptiontax (= oecd_t5000) and oecd_othertax (= oecd_t6000).

The pss_topmitr is from Piketty, Saez, and Stantcheva (2014)

To calculate the participation tax rates oecd_entrytax_emp20to59 and oecd_entrytax_emp20to59_notranf Kleven (2014) used the following data sources: OECD National Accounts, OECD Government Revenue Statistics, OECD Social Expenditure Statistics, Penn World Table 7.0. The exact formula is presented in the info box on the participation tax rate.

From the World Bank stems the variable wb_employees_empselfemp which is the fraction of employees of the workforce and wb_employers_empselfemp which is the fraction of employers of the workforce. Further, the variables wb_laborforce_rate and wb_natresorents_pct_gdp come from the World Bank.

The fraction of employees in evasive sectors ilo_evasivesectemp comes from the ILO (International Labour Organization).

From the OECD Labor Force Statistics, the variables on employment rates among 20 to 59-year-olds oecd_emp20to59 and oecd_f_emp20to59 were obtained. Also, oecd_childcarepresch, oecd_elderlycare and oecd_hoursworked originate from there.

From the World Value Survey following variable were used: wvs_trustpeople and wvs_inneed_lazy.

To calculate the social capital index sk_proxy_wo_relig following components were used: - civic participation: weighted-average of a binary indicator for active membership of an organization (the latest available year, source: WVS, various waves), - average voter turnout in elections held after 2000, excluding the European Parliament elections (source: Voter Turnout Database, IDEA) - the inverse of the homicide rate (the latest available year, source: UNODC).

The Polity2 score polity2 originates from Polity IV.

From the World Giving Index, Charities Aid Foundation, Kleven (2014) retrieved the share of people donating money to charitable organizations in 2012 wgi_donatemoney_2012.

The share of labor income in GDP pwt_labshr_gdp was obtained from the Penn World Tables.

>

For further insights on the data, the dplyr package provides commands for common data manipulation tasks to prepare the data. We are going to use the intuitive functions from this package in the following parts of the problem set.

For instance, you can use the n_distinct() command from the package to count the number of unique rows of specified columns from your input data frame.

< info "n_distinct()"

The n_distinct() command helps to count the number of unique values in a vector. The first argument ... requires a vector or a data frame. By default the option na.rm is set to FALSE. This means that missing values represented by the symbol NA are also counted.

n_distinct( ... , na.rm = FALSE)

Note that objects within data frames can be made accessible with the attach() function in R. This means variables can be referenced to by their name with less typing. But the disadvantage is that it can lead to confusion if there are identically named variables in other data frames. To reduce possibilities for creating errors you can use the extract operator $ to refer to any variable in a specific data frame.

n_distinct(dataframe$variable)

>

Whenever you want to use functions from a package, you first have to load these packages with the command: library(package name). In this R problem set, the required packages have to be loaded again for every chapter to work properly.


Task: Use the command n_distinct() to find out how many different countries are included in the data set. The variable name is called Country. If you need help on how to use the command, have a look at the info box above.

#< task
# loading the required package
library(dplyr)
# use the n_distinct() command
#>
n_distinct(data$Country)
#< hint
display("Take a look at the examples in the info box.")
#>


< quiz "Countries in the data set"

question: How many distinct countries are available in the data set? sc: - 159 - 209* - 50

success: Great, your answer is correct! failure: Try again.

>


To get a further glimpse on the data we want to use the command sample_n() from the dplyr package to see all available variables. This command is similar to the basic head() or tail() command. The main difference is that it gives you not the first or last rows but randomly selects them from a table.


Task: Use the command sample_n() to print a sample containing 10 rows from the data frame called data.

#< task
# use the command sample_n(tbl, size) 
#>
sample_n(data,10)
#< hint
display("See the comment in this code box")
#>


As you may have noticed, you can get an additional description of each variable by hovering over their names on the displayed table. If you want to take a closer look at the data, you can click the data button, located above every task. This will take you to the Data Explorer.


To get some valuable descriptive statistics of the variables in our data set it is advisable to use the summary() function. It returns the minimum, maximum, mean, median, lower and upper quartiles for all columns at once. Additionally, it reports the number of missing values (NA's).

Task: Use the command summary() to get a summary for the data frame data.

#< task

#>
summary(data)
#< hint
display("Adapt the function summary(...)")
#>

The output from the summary function should give you a better picture about the distribution of the variables and their magnitude. This helps for the interpretation of the results in later tasks. Note that I created the variable style to distinguish the observation in the following tasks.


But now let's get back to the research question of the paper.

Reproducibility: Since the presidential candidate Bernie Sanders had a vision of Nordic-style policies which he referred to as "democratic socialism" for the United States (Partanen, 2016) it poses the question again how it is possible for Scandinavia to achieve such results and if it could work in other countries as well.

Although Scandinavian countries redistribute large amounts of income through taxes and transfers they belong to the leading position in the world in terms of income per capita and other economic and social outcomes. This case challenges the thesis that large redistribution has a harmful effect on economic growth and welfare. (Kleven, 2014)

To get a rough overview, we want to compare the tax revenue (Tax per GDP) and tax rates in Scandinavia with other countries. For this purpose, we want to subset our data to a handful observations with the use of the filter function.

< info "filter()"

The filter() command allows you to return a subset of rows which satisfy conditions that can be joined together with & or , and other Boolean operators like |. The command requires two arguments: a data frame and one or more conditions to filter the data by. For the conditions, you can use the Boolean operators (e.g. >, <, >=, <=, !=, %in%)

For example, below will return all the rows where the year is greater or equal to 2010 and the variable Country equals Germany.

filter(data, Country == "Germany" & year >= 2010)

It is also possible to filter by a list with the %in% operator to write shorter and much clearer code.

The following example filters the data and returns only rows of the countries defined in the vector.

filter(data, Country %in% c("United Kingdom","Austria","Germany"))

To learn more about the filter command and other functions from the dplyr package you can read the comprehensible Introduction to dplyr.

>


Task: Use the command filter() from the dplyr package to subset the data frame data and only show the countries where the variable Code is equal to: "SWE","NOR","DNK","USA","GBR","DEU".

#< task
# use the filter() command
# Hint: You can copy the Codes from above
#>
filter(data, Code %in% c("SWE","NOR","DNK","USA","GBR","DEU"))
#< hint
display("Use the info box on the filter function to see a similar example.")
#>

This command gives us too many observations. As we want to get a rough overview, we only want to show the most recent values of each country in our subset.

To accomplish this and only return the most recent values of the tax revenue per GDP and tax rates of each country we use the summarise_each_ command which allows to apply a function to one or more columns. In this case, we apply the tail() function that returns the last item of a vector or the last row of a data frame.

< info "summarise_each_()"

Because we want to apply one functions to many variables/columns we can use summarise() or the summarise_each_ function. In this case, I will give more info on the latter because it allows including/excluding specific variables. Note that you can use the American (summarize) as well as the British spelling (summarise).

summarise_each_(tbl, funs, vars)

The argument tbl requires the data frame and funs expects the list of function calls you want to apply on the data frame. The argument vars allows to include or exclude specific columns. If vars is missing, it defaults to all non-grouping variables.

If we wanted to calculate the mean of all non-grouping variables and just exclude the variable year from the calculation we could do so with the following command.

summarise_each_(data frame, funs(mean), -year)

For the argument vars you can use the same specifications as in the select function.

If you want to know more about aggregation with the dplyr package, take a look at the tutorial on milanor.net

>

Performing many operations at once, without saving the results of each step can be confusing. But dplyr allows writing elegant chained code with the help of the pipe operator %>% from the package magrittr. This operator pipes the output of one function to the input of another function. For an example on how to use the %>% operator click the info box below.

< info "Chaining with pipe operator %>%"

By using the chaining method with the pipe operator it is more intuitive to write "nested" commands. Instead of reading long command lines from the inside to the outside you can read the code block top-down from left to right.

# loading the required package
library(dplyr)

data %>%
  group_by(Country) %>%     #groups data and pipes the result 
  select(GDP, tax_revenue) %>%   #selects the columns "GDP" and "tax_revenue"
  # calculates the means and saves it to new variables 
  summarise(
    GDP.mean = mean(GDP, na.rm = TRUE),
    tax.mean = mean(tax_revenue, na.rm = TRUE)
  )

>


In the following, I provide the complete chained command to subset the data and summarize each variable of interest. In order to ensure that the function tail() gives us the most recent observations, it is crucial to sort the table by year with the arrange() function before collapsing it. The following task replicates the Table 1 from the paper. Note, that the figures are from 2012.

Task: Just click check to save the subset into tax_revenue and show the resulting table.

#< task

tax_revenue <- data %>%
  filter(Code %in% c("SWE","NOR","DNK","USA","GBR","DEU"))   %>%
  # sort rows ascending by Country, Code then year
  arrange(Country, Code, year) %>%
  group_by(Country) %>%
  summarise_each_(.,funs(tail(na.omit(.),n=1)), vars(ief_taxpergdp2012, oecd_incometax, oecd_propertytax, oecd_consumptiontax, pss_topmitr, oecd_entrytax_emp20to59)) %>%
  # sort rows descending by tax/gdp ratio
arrange(desc(ief_taxpergdp2012))

# print resulting data frame
tax_revenue
#>
#< hint
display("The solution is already given. Just click check.")
#>

This table clearly shows that in our subset the Scandinavian countries are leading in terms of tax revenue per GDP reaching from 42.8 to 48.2 %. Whereas the US on the opposite end of the spectrum has only a share of 24.8 % of the GDP.

When considering the "participation tax rate" (PTR) in the last column called oecd_entrytax_emp20to59 which is the effective average tax rate that captures the implicit tax on working, the contrast is even more impressive.

< info "participation tax rate"

The participation tax rate accounts for all labor taxes, consumption taxes, and means-tested or work-tested transfers. The net-of-participation-tax-rate is calculated as follows.

$$1 - \tau \equiv \frac{\Delta c}{W_f} = \frac{1 - \tau_i - \tau_{pw} - b }{\left(1 + \tau_{pf}\right)\left(1 + \tau_c\right)}$$

$W_f \equiv W * \left(1 + \tau_{pf}\right) : total~labor~cost~of~firms$

$W : before-tax~earnings~of~workers$

$\Delta c : extra~consumption~induced~by~labor$

$\tau_i : income~tax~rate$

$\tau_{pw} : payroll~tax~rate~(employee)$

$\tau_{pf} : payroll~tax~rate~(employer)$

$\tau_c : consumption~tax~rate$

$b : benefit~rate$

Kleven (2014) shows that his calculated participation tax rates based on macro data are closely correlated to the measures of tax distortions using micro data by comparing his tax rate estimates with the micro data from Immervoll et al. (2007).

For his estimates, he used different tax and benefit rates from OECD Revenue Statistics, OECD Social Expenditure Statistics, and OECD National Accounts. To understand how the PTR results, he refers to the OECD tax classification numbers and shows how the parts of the formula are calculated.

$$consumption~tax~rate = \frac{5110+5121+5123+5126+5128+5211}{C-GW-5110-5121-5123-5126-5128-5211}$$

C = national consumption (household and government)
GW = government wage outlays
5110 = general consumption taxes
5121 = excise taxes
5123 = customs and imports
5126 = taxes on specific goods
5128 = other taxes on specific goods and services
5211 = household motor vehicle taxes

$$income~tax~rate = (1110)/W$$

W = aggregate labor income
1110 = taxes on income and profits of individuals

$$payroll~tax~rate~on~employees~(workers) = (2100+2300+2400)/W$$

2100 = social security contributions by employees
2300 = social security contributions by self-employed or non-employed
2400 = unallocable social security contributions

$$payroll~tax~rate~on~employers~(firms) = (2200+3000)/W$$

2200 = social security contributions by employers
3000 = taxes on payroll and workforce

$$benefit~rate = (B/(1-P))/(W/P)$$

B = aggregate expenditures on means-tested and work-tested transfers (all social assistance benefits (in cash and in kind), housing assistance, unemployment insurance, and disability insurance)
P = employment rate between 20-59 year olds

>

The PTR is around 80% in Scandinavia. This means that an average worker in Scandinavia entering employment will be able to increase his consumption by only 20% of his earned income.

< quiz "PTR"

question: How much gets an average worker to keep of her earnings in the US when accounting for the full impact according to the concept of PTR? sc: - 63.4 %* - 36.6 % - 30.1 %

success: Great, your answer is correct! failure: Try again.

>

< award "Data Manipulation"

Congratulations, you've solved Excercise 1 and now know some basic dplyr commands.

>


What is also striking is the much higher top marginal tax rate (= pss_topmitr) in Scandinavia. It is the fraction of tax paid on an additional unit of income for the top income earners. This raises the question whether governments should tax high earners more to face the large deficits and the widening inequality.

One reason for lowering top tax rates in many countries was to increase work effort and boost business creation, thereby generating more economic growth. But Piketty et al. (2011) show by comparing the top tax rate changes and the average annual real GDP per capita of 18 OECD countries that there is no correlation.

The foregone table only gives a brief insight on the differences between countries. But they among other things make Kleven (2014) ask how Scandinavians can collect so much tax and still feature high levels of real activity. If there are just specific features of policy design in place it would have policy implications for other countries. But if it is a special culture or social behavior in Scandinavia it would mean that it is hard to replicate the outcomes.


Before we go to the next exercise, I want to give you the opportunity to explore some relationships on your own. The following task creates a motion chart from the package googleVis. Per default, it depicts the participation tax rate on the x-axis and the employment rate among 20 to 59-year-olds on the y-axis. The size of each bubble shows GDP per capita and the color visualizes the tax/GDP ratio. But you can play around with the settings. If you click on the play button, you can see how each country develops over time. Note, that the chart shows only a subset of all OECD-members to keep it clear.

If you just want to see how specific variables changed over time, you can switch to the line chart.

Task: Click check to create a motion chart. Note, that this takes a while to show.

#< task
# loading the needed package
library(googleVis)

motion_plot = gvisMotionChart(subset(data, OECD == 1), idvar = "Country",
                     timevar = "year", xvar = "oecd_entrytax_emp20to59", yvar = "oecd_emp20to59",
                     colorvar = "ief_taxpergdp2012", sizevar = "gdppc_ppp")

# plot the motion chart
plot(motion_plot, tag = "chart")
#>

Exercise 2 -- Recap of aspects of taxation & policy implications

Kleven (2014) identifies three policies in his paper that can help to explain the positive economic and social outcomes in Scandinavia. But before we get to the compiled policies, I want to briefly recap fundamental aspects of taxation. It's needless to say that it is necessary for every state to levy taxes for the continuity of political order.

Tax revenues are used to finance public goods (education, infrastructure, healthcare, internal and external security, etc.) and to assure the enforcement of the law (police and courts). Additional objectives are to steer people's behavior and redistribute wealth aiming to increase social justice. But at the same time, it is intended to minimize distortions to economic decisions induced by taxes and subsidies.

Through the years there have been developed several principles of sound tax policy which cannot be attained fully because all principles are somewhat competing objectives to each other.

But complete neutrality is impossible to achieve and even undesirable to some extent if it contradicts other goals like redistribution of wealth or discouraging harmful behavior (smoking or excessive alcohol consumption). For this reason, a certain level of distortion to behavior is inevitable and even desired.

Exercise 2.1 -- Third-party information reporting

The first part of the policy design that can help understand some questions is the Scandinavian tax system that has a wide coverage of third-party information reporting and well-developed information trails. This should ensure a low level of tax evasion.

< info "tax evasion"

The term tax evasion is understood to mean an illegal practice where someone intentionally avoids paying his tax liability. That includes nonpayment and underpayment of taxes.

For example, the IRS (2016) estimates a gross tax gap of USD 458 billion annually for the years 2008 - 2010. The tax gap is the difference between tax owed to the government versus the amount they receive. Underreporting alone accounts for USD 387 B. of the tax gap.

A report conducted by the Tax Justice Network (2011) estimates that governments worldwide (the research covered 98.2% of the world GDP) lose more thn US$ 3.1 trillion in annual revenue because of tax evasion.

>

Because of employers and financial institutions reporting taxable income of their employees or clients directly to the revenue department, there is no possibility for the taxpayer to evade taxes.

This fact is backed by the results of the tax compliance study conducted by the (IRS, 2012) where they found that the evasion rate is 56% for income with little or no information reporting, 8% for income with substantial reporting and only 1% when there is substantial reporting and withholding.

Given that self-employed are more likely to evade taxes because of the greater possibility due to self-reporting it is tempting to investigate the relationship between tax revenues and the ratio of self-employed workers across countries.

Indeed, the share of self-employed workers is a plausible proxy for the degree of self-reporting in tax systems, so we use this macro data to plot it against the tax-to-GDP ratio. But before we can plot anything, we have to adapt the data to our needs.

Specifically, we want to exclude - countries that are not OECD members or have more than 20 % natural resource rents of their GDP - countries where GDP per capita based on purchasing power parity is less than USD 5000 (in 2005 PPP terms)


"The total natural resources rents are the sum of oil rents, natural gas rents, coal rents (hard and soft), mineral rents, and forest rents." (World Bank)

Task: Just click the check button to filter the data and save it into tax_take. Note, that we only obtain the latest observation of each country with the summarise_each_ function.

#< task
# loading the data
data = readRDS("data_condensed.rds")
# loading the required package
library(dplyr)
#
tax_take <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  # calculating fraction of self-employed
  mutate(wb_selfemp_empselfemp = 1 - wb_employees_empselfemp) %>%
  # excluding rows with missing values in the defined columns
  filter(!is.na(wb_selfemp_empselfemp) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(ief_taxpergdp2012, wb_selfemp_empselfemp, wb_employees_empselfemp, wb_employers_empselfemp, style))
#> 
#< hint
display("Just click check.")
#>

If you are interested in the exact calculation of the self-employed measure, take a look at the following info box.

< info "variable of interest 1"

The share of self-employed workers is calculated as follows wb_selfemp_empselfemp = 1 - wb_employees_empselfemp.

wb_employees_empselfemp is the share of employees calculated as follows: $\frac{employees}{employees~+~self-employed}$

>

Next step is to visualize the data at hand. Therefore, we want to use the ggplot2 package which allows to easily create complex multi-layered graphs.

< info "ggplot2"

To create a new plot you can use the command ggplot. The option data requires a data frame. The aes option manages the aesthetic mappings that means how variables in the data are mapped to visual properties of geometric objects.

If you want to use the same aesthetics for all layers you can supply it to the ggplot command as follows.

ggplot(data= df, aes(x=taxes, y=GDP)) + geom_point() + geom_line()

Or you can use different aesthetics for each layer as the following command illustrates.

ggplot(data) + geom_point(aes(x=taxes, y=GDP)) + geom_line(aes(x=taxes, y=employment))

The geometric geom_point creates a scatter plot and geom_line gives you a line plot by connecting the observations, ordered by x value. And there are much more possibilities to visualize data e.g. geom_boxplot, geom_histogram, geom_density, etc.

If you want to know more about ggplot2, have a look at the documentation on docs.ggplot2.org.

>

To get a basic understanding of the ggplot2 commands and structure we first create a basic plot and add additional layers step by step later on.

Before looking at the plot, I'd like you to take the following quiz on the empirical relationship between the fraction of self-employed and the tax share collected by the government.

< quiz "self-employment and tax revenue"

question: In your opinion how is the self-employment correlated to the tax revenue? sc: - positively - negatively* - it is uncorrelated

success: Great, your answer is correct! failure: Try again.

>


Task: Create a simple scatter plot using the geom_point object. Pass tax_take to the data argument of ggplot. Plot the tax revenue per GDP ief_taxpergdp2012 on the y-axis and the measure for self-employed workers wb_selfemp_empselfemp on the x-axis and save the plot into the variable plot1. If you need more help take a look at the info box above for examples. After saving the plot, display it by calling the object plot1.

#< task
# loading the required package
library(ggplot2)
# Use the commented code as template
# variable = ggplot(data=..., aes(x=...,y=...)) + geom_point()

#>
plot1 = ggplot(data = tax_take, aes(x=wb_selfemp_empselfemp, y=ief_taxpergdp2012)) + geom_point()  
plot1
#< hint
display("Just uncomment the code above and replace the ... correctly.")
#>

< award "Plotting Starter"

Congrats! You created your first plot using the ggplot2 package.

>


Except the relation between the variables, this very basic plot provides too little information for well-founded statements. Thus, we want to add layers to the saved plot by just using the + symbol. To add axis labels, we use the layer command labs and for the labeling of some observations, we use geom_text_repel from the ggrepel package. As the name suggest ggrepel, implements functions that repel overlapping labels from each other. Adding a linear regression line is done with the command geom_smooth.


Note: The following plot replicates Figure 2A from the paper.

Task: Click check to extend the previous graph named plot1 with the above mentioned layers and save the new plot into plot1b.

#< task_notest
#loading the required package
library(ggrepel)

plot1b = plot1 + 
  # adding a linear regression line without confidence interval around smooth
  geom_smooth(method=lm, se=FALSE) +
  # adding axis labeling
  labs(x = "Fraction self-employed", y="Tax/GDP ratio") + 
  # label points according to the variable 'style'
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  # adding points with different style options
  geom_point(colour = ifelse(tax_take$style == 1 | tax_take$style == 2, "black", "grey")) +
  # Limit the axes
  coord_cartesian(xlim = c(0, 0.8), ylim = c(0,0.5), expand = FALSE) +
  # using a theme with a white background
  theme_bw()

#show the plot
plot1b
#>
#< hint
display("Just click check")
#>

Now the plot has more explanatory power and shows that the Scandinavian countries are clear outliers because their tax takes are much larger compared to countries with similar levels of self-employment. This suggests that the tax revenue of the Scandinavian countries can be explained partly by the wide coverage of third-party reporting.

The strong negative relationship between the fraction of self-employed workers and the Tax/GDP ratio indicates that third-party information is a key influence factor on a country's tax revenue.

But as with most economic interrelations, one has to review the situation with a focus on possible reverse causality. In this particular case, a higher tax/GDP ratio could result in less self-employment, if for example, the tax rates for entrepreneurs are too high and therefore discourage them taking business risks.

However, it is also possible that the Scandinavians are outliers because of the incompleteness of the self-employment measure as a proxy for self-reporting. To address this issue it is recommendable to take the evasive jobs into account as well to revise the self-reporting measurement.

< info "evasive jobs"

Evasive jobs are considered to be labor-intensive services which can easily be provided by a single worker in return for cash outside the third-party reporting system to escape taxation.

"These evasive sectors are defined according to ISIC (International Standard Industrial Classification) codes 4F: construction, 4G: retail, wholesale, and repair of motor vehicles, motorcycles and personal and household goods, 4I: hotels and restaurants, 4S: other service activities, and 4T: employees of private households (nannies, cooks, gardeners, etc.)" (Kleven, 2014)

>

Accordingly, we combine the fraction of self-employed and workers in evasive jobs to a new proxy for self-reporting. If you want to know how the variables are calculated, take a look at the following info box.

< info "variable of interest 2"

We create a new proxy variable for self-reporting called selfempevasive as follows.

selfempevasive = wb_selfemp_empselfemp + ilo_evasivesectemp

The share of self-employed workers is calculated as follows wb_selfemp_empselfemp = 1 - wb_employees_empselfemp as mentioned in the info box "variable of interest 1".

ilo_evasivesectemp: fraction of the workforce that (in part) provide labor-intensive consumer services. (source: ILO)

>

As before, we exclude


Task: Click check to perform the data manipulation and assign the data frame to the variable evasive. Again, we are only interested in the latest observations.

#< task

evasive <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  # create new variable: fraction of self-employed / (self-employed + employees)
  mutate(wb_selfemp_empselfemp = 1 - wb_employees_empselfemp) %>%
  # create new variable: adding fraction of workforce providing intensive consumer services 
  mutate(selfempevasive = wb_selfemp_empselfemp + ilo_evasivesectemp) %>%
  # exclude rows with missing values
  filter(!is.na(selfempevasive) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  # collapse the data frame, latest available observation
  summarise_each_(.,funs(tail(.,n=1)), vars(ief_taxpergdp2012, selfempevasive, wb_selfemp_empselfemp, wb_employees_empselfemp, wb_employers_empselfemp,style)) 

#>
#< hint
display("Just click check")
#>

Next step is to visualize the relationship of the new proxy for third-party information reporting and the Tax/GDP ratio to check if this new measure returns different results as before.

< info "geom_smooth()"

With the geometrical object geom_smooth it is possible to add a smoothed conditional mean to any ggplot.

For example adding a linear smoothed line without showing the confidence interval is done by the following command.

# lm = linear model, se = standard error 
geom_smooth(method=lm, se=FALSE)

If you want to know all possible arguments or see more examples, visit the documentation page about geom_smooth.

>


Task: Create a scatter plot (use the geometrical object geom_point) and save it into plot2. But this time plot the variable selfempevasive on the x-axis and ief_taxpergdp2012 on the y-axis. Add a linear regression line with the command geom_smooth like we did with plot1b. This time use the table evasive as the data source. Show the plot afterward.

#< task
# Just edit the following command 
# variable = ggplot(data = ..., aes(x=...,y=...)) + ... + ...

# don't forget to show the plot afterward
#>
plot2 = ggplot(data = evasive, aes(x=selfempevasive, y=ief_taxpergdp2012)) +
  geom_point() + 
  geom_smooth(method=lm, se=FALSE) 

plot2
#< hint
display("You can use the example in the comments. Just adapt it.")
#>

< award "Plotting advanced"

Great! You created a multi-layered plot.

>

Again this basic plot only gives a general overview about the correlation hence we need to add more objects such as labels to the graph to make it more informative.


Note: The following plot replicates Figure 2B from the paper.

Task: Click check to change the color of some points and to add labels to plot2.

#< task_notest

plot2b = plot2 + 
  # changing the color of some points
  geom_point(colour = ifelse(evasive$style == 1 | evasive$style == 2,"black","grey")) + 
  # adding axis labels
  labs(x = "Fraction of self-employed and employees in evasive jobs", y="Tax/GDP ratio") +
  # labeling points conditional to the variable style
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  # Setting limits on the coordinate system to make it comparable to plot1
  coord_cartesian(xlim = c(0, 0.8), ylim = c(0,0.5), expand = FALSE) +
  theme_bw()

#plot output
plot2b
#>
#< hint
display("Just click check")
#>

You can see that Mexico and Brazil have a similar fraction of people who are self-employed or work in evasive sectors. Here, it is noteworthy though they have the same fraction the tax revenue of Brazil is three times larger than the Mexican tax revenue.

The relation between the new self-reporting measure and the tax revenue seems to be alike but to compare the plots more closely we want to combine the plots by putting plot1b and plot2b with the grid.arrange() function from the gridExtra package side by side.

< info "grid.arrange()"

The command allows setting up a layout for multiple plots. For example, you can choose the plot objects you want to combine and specify the layout (number of columns and rows).

grid.arrange(plotA,plotB,plotC,ncol=2,nrow=2) # This command arranges the three plots in a 2x2 grid.

>


Task: Arrange plot1b and plot2b side by side (= 2 columns) using the grid.arrange() command. You can see an example in the info box above.

#< task
# loading the needed package
library(gridExtra)
#Type code here

#>
grid.arrange(plot1b, plot2b, ncol=2)
#< hint
display("Take a look at the info box")
#>

What we can see by comparing plot1b with plot2b is that the negative relationship strengthens. Which suggests that the availability of third-party information has a high impact on a country's tax take and the tax compliance of its citizens.

This seems to support the result of James Alm et al. (2006) which indicate that tax compliance rates decline with a higher share of non-matched income.

But the exemplary comparison of Brazil versus Mexico also suggests that third-party information reporting can only explain the tax-gdp ratio partly.

Summing up the above mentioned this leads to the conclusion that a well-developed third-party reporting not only for labor income but also capital gains is necessary to keep tax evasion rates low, which could be additionally supported by verifiable information trails generated through market transactions. This would include payments from credit cards or bank transfers but also contracts with business partners so that tax authorities could easily obtain information about secret income. (Kleven, 2014)

To establish such information trails a partial abolishment of cash could be supportive. Adding to this subject, the Harvard economist Kenneth Rogoff argues for a "less-cash-society" in his book "The Curse of Cash". To accomplish this, he proposes to phase out large-denomination bills which facilitate crime, corruption and tax evasion. (Swanson, 2016)

Exercise 2.2 -- Broad tax base

A broad tax base reduces economic distortions and therefore may lead to more efficient allocation and helps to keep tax avoidance low because households' ability to avoid taxes through income-shifting or income-timing is eradicated.

For every change in tax policy, it is important to consider the elasticity of taxable income (ETI) which can be used to calculate the revenue effects of tax rate changes and to further assess economic effects of changes in tax policy.

< info "ETI and Laffer Curve"

In contrast to the standard labor supply model, the ETI considers that individuals have additional margins than just the hours of work to respond to taxation such as form and timing of compensation, tax avoidance or tax evasion. (Saez,Slemrod, and Giertz, 2009)

The elasticity of taxable income with respect to the net-of-tax rate is the change in reported taxable income when the net-of-tax rate is increased by 1%. For example, an elasticity of 0.4 would mean that for each 1% fall in net-of-tax rate, the reported taxable income would decrease by 0.4%.

The formula for calculating the ETI with respect to the net-of-tax rate is:

$$e = \frac{ 1 - \tau }{z} \frac{ \delta z }{\delta \left(1 - \tau\right)}$$

$\tau : tax~rate$

$e : elasticity~of~taxable~income$

$z : taxable~income$

The higher the ETI the more distortionary are the changes in tax policy. If the ETI with respect to the marginal tax rate equals 1 then the top of the Laffer Curve has been reached and the marginal rate is at the optimum, maximizing the tax revenue. (Fieldhouse, 2013)

Basically, the Laffer Curve illustrates the tradeoff between tax rates and tax revenues. Arthur Laffer argued that raising taxes has not only an arithmetic but also an economic effect on tax revenues. The economic impact of low tax rates on work, output, and employment is positive. Which provides incentives for businesses, private households, and investors to increase their taxable activities. Whereas high tax rates can be seen as surcharges on these economic activities.

This means the more taxes have to be paid the fewer investments can be made by businesses. And workers will lose incentives to work harder if they have to pay a larger portion of their paychecks. Furthermore, the motivation to protect income from taxation among taxpayers rises with increasing tax rates. This is especially feasible for capital gains as they can easily be moved abroad, where lower tax rates are levied. These circumstances lead to a reduction in revenue the government receives.

Have a look at the following graph, which depicts an exemplary Laffer Curve with the possible tax revenues on the y-axis and the tax rates reaching from 0 to 100% on the x-axis. The exact shape of the curve is unknown. Though a parabolic shape is often assumed, but there is no reason that this is necessarily the case.

# following code creates a examplary Laffer Curve
tax_rates = seq(0,1,0.1) # create seuquence from 0 to 1 in steps of 0.1
revenue = -(tax_rates^2 - tax_rates)
plot(tax_rates,revenue,type = "b")

As you can see, for both tax rates of 0% and 100% the tax revenue is zero. And in this case, the revenue-maximizing tax rate (= $t^$) is 50%. From the lower levels up to the optimal tax rate, the arithmetic effects on tax revenues outweigh the negative behavioral effects. Whereas on the right side of the curve (tax rates higher than $t^$ = 0.5) the households and businesses decrease their economic activity or are more tempted to lower their tax liability due to some combination of tax avoidance and tax evasion. This means that the government could increase tax revenue by lowering taxes until it reaches $t^*$.

The changing slope illustrates the change in behavior and taxable activity. Because by raising the tax from 10% to 20%, the revenue increases by about 0.07 while raising tax rates from 40% to 50% only yields 0.01 more in tax revenues. That's because the mechanical effect is partly offset by the behavioral response of taxpayers.

To sum up the aforementioned: How much tax can be raised depends also on the behavioral responses to tax rates. To estimate a revenue-maximizing rate the Laffer Curve can be used. But the ETI with respect on the marginal rate must be taken into account. The higher the ETI is the larger the losses in economic efficiency is. This means that lower levels of ETI allow higher taxation.

>

For example, Gruber and Saez (2002) have found a higher 0.57 ETI after deductions and a lower 0.17 elasticity of broad income before deductions in the US. Which implies that the reported taxable income is less affected by tax rate changes when the tax base is broader. These findings make Fieldhouse (2013) demand to broaden the tax base and simultaneously raise top rates.

To broaden the tax base Fieldhouse (2013) suggests eliminating avoidance strategies through stricter tax enforcement, fewer deductions, exclusions, credits, exemptions, and preferential treatment of capital income over labor income.

Such reforms would not only raise tax revenues but also regard the principle of horizontal equity, meaning that people with the same income do not pay significantly different effective tax rates. (Fieldhouse, 2013)

Result until now: Low levels of ETI and near-absence of tax evasion can be achieved due to a wide coverage of third-party information reporting and little tax avoidance due to broad tax bases that offer little options to minimize tax liability.


We have covered the terms "tax evasion" and "tax avoidance" now which are often used interchangeably. Answer the following quiz question.

< quiz "evasion vs avoidance"

question: What is the difference between "tax evasion" and "tax avoidance"? sc: - There is no difference - Tax evasion is an illegal practice and tax avoidance is the legitimate minimizing of taxes* - Tax evasion is a legal way to minimize the tax liability whereas tax avoidance is illegal

success: Great, your answer is correct! failure: Try again.

>

< award "Quizmaster Lvl. 1"

>

Exercise 2.3.1 -- Subsidization of goods complementary to working (graphics)

Before you can understand the efficiency of a tax system, it is indispensable to consider how the revenue is spent. The Scandinavian countries spend large amounts on means-tested transfer programs which create additional implicit taxes on working. But at the same time, they spend huge amounts on public provision and subsidization of goods that are complementary to working including child care and elderly care which reduces the costs of market work. (Kleven, 2013)

In this problem set, we want to focus on the extensive margin of labor supply (a measure how many people work), which is a key measure for understanding aggregate labor supply.

First of all, we look on the distortions of labor participation induced by taxes and transfers. Therefore, we want to plot the total employment rate against the net-of-tax rate using the participation tax rate (see info box in chapter 1).


Task: Click check to filter the data set and get the most recent observation of each country and assign the resulting table to the variable labor.

#< task
# loading the data
data = readRDS("data_condensed.rds")
# loading the package
library(dplyr)

labor <- data %>%
  #exlude non-OECD members and observation with missing values:
  filter(OECD == 1 & !is.na(oecd_emp20to59) & !is.na(oecd_entrytax_emp20to59)) %>%
  # sort ascending:
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  # summarise: get the last available row of each Country:
  summarise_each_(.,funs(tail(.,n=1)), vars(oecd_emp20to59, oecd_entrytax_emp20to59, year,style)) %>%
  # calculate the net-of-tax rate:
  mutate(oecd_entrytax_emp20to59 = 1 - oecd_entrytax_emp20to59)

#>
#< hint
display("Just click check")
#>


But now let's see what story the data tells. Therefore, we plot the total employment rate among 20 to 59-year-olds in dependency of the net-of-tax rate.


Note: The following plot replicates Figure 4A from the paper.

Task: Click check to create and show the plot.

#< task
library(ggplot2)
library(ggrepel)

plot3 = ggplot(data = labor, aes(x=oecd_entrytax_emp20to59, y=oecd_emp20to59)) +
  geom_point(colour = ifelse(labor$style == 1 | labor$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "1 - participation tax rate", y="Employment rate") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) +
  coord_cartesian(xlim = c(0, 0.75), ylim = c(0.5,0.87), expand = FALSE) +
  theme_bw()

#show the plot
plot3
#>
#< hint
display("Just click check")
#>

Contrary to common sense you see that the two variables are negatively correlated across countries. Particularly Scandinavian countries impose large participation tax rates due to taxes and social contributions and nevertheless feature high employment. The Graph certainly does not allow a statement about causality but could point to other factors that confound the extensive labor responses. And may suggest that the structure of spending can either alleviate or reinforce tax distortions.


< quiz "Graph 4 b"

question: What will the slope of the regression line look like if we just consider female employment? sc: - it will be flatter - it will be steeper than the one of the total employment rate vs. net-of-tax-rate*

success: Great, your answer is correct! failure: Try again.

>

< award "Quizmaster Lvl. 2"

>

In the next graph, we want to examine the relation between the net-of-tax rate and the female employment rate. Therefore, we have to subset the data once again.


Task: Click check to create a subset for the female employment data and assign it to labor_f.

#< task
labor_f <- data %>%
  # excluding observations with missing values
  filter(!is.na(oecd_elderlycare) & !is.na(wb_laborforce_rate) & !is.na(oecd_entrytax_emp20to59) & !is.na(oecd_f_emp20to59)) %>%
  # exclude non-OECD countries
  filter(OECD == 1) %>%
  # calculating the net-of-tax rate
  mutate(oecd_entrytax_emp20to59 = 1 - oecd_entrytax_emp20to59) %>%
  # subsidies as share of aggregate labor income of gdp
  mutate(laborsubsidy_share = laborsubsidy_share / pwt_labshr_gdp) %>%
  mutate(oecd_elderlycare = oecd_elderlycare / pwt_labshr_gdp) %>%
  mutate(oecd_childcarepresch = oecd_childcarepresch / pwt_labshr_gdp) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(oecd_f_emp20to59, oecd_entrytax_emp20to59, laborsubsidy_share, oecd_elderlycare, oecd_childcarepresch, pwt_labshr_gdp,style, year))
#>
#< hint
display("Just click check")
#>

This time we plot the female employment rate among 20 - 59-year-olds versus the net-of-tax rate on participation to check if the slope for the female employment rate shows a different picture.


Note: The following plot replicates Figure 4B from the paper.

Task: Click check to create and show the plot.

#< task

plot4 = ggplot(data = labor_f, aes(x=oecd_entrytax_emp20to59, y=oecd_f_emp20to59)) +
  geom_point(colour = ifelse(labor_f$style == 1 | labor_f$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "1 - participation tax rate", y="Female employment rate") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  coord_cartesian(xlim = c(0, 0.75), ylim = c(0.5,0.87), expand = FALSE) +
  theme_bw()

#the next line prints plot4
plot4
#>
#< hint
display("Just click check")
#>

Both plots show a similar relationship between the net-of-tax rate and the employment rate. For a better comparison, we want to put the plots next to each other.

Task: Use the grid.arrange() function to create a two column-layout for plot3 and plot4.

#< task
# loading the needed package
library(gridExtra)
#Type code here
#grid.arrange(... , ... , ncol = ...)
#>
grid.arrange(plot3, plot4, ncol=2)
#< hint
display("Adapt the code in comments")
#>

The negative correlation between the tax-transfer incentives and employment is even stronger although women are considered to be more responsive to such incentives. It is also noticeable, that the right graph features a higher dispersion around the blue line.

It is important to note that there are many factors that affect the employment rate. Comparing Italy with the UK, you can see though they have almost the same PTR the difference in employment is significant. This suggests that the impact of taxes on employment is confounded by other influences. In this case the figures are from the year 2009 and could point out that the UK came through the crisis 2007/2008 more quickly than Italy, because of a more robust economy.

The relationships in the two graphs above stand in contrast to most macro literature, which claim that labor supply is positively correlated with net-of-tax rates. But Kleven's data would imply a strongly negative elasticity on the labor supply (at the extensive margin). The contrasting results can be explained by the neglect of the effect of means-tested transfers on the effective distortion of labor supply and differing time periods in the previous macro studies. This is demonstrated by Kleven (2014) in Figure A2.


To get a better understanding of the tax-transfer distortions, we want to analyze the interaction of employment and the participation subsidies (non-tax incentives) which consist of provision of childcare, preschool and elderly care. These subsidies should lower prices of goods that are complementary to working and therefore positively influence the labor supply.


Task: Click check to subset the data and save it into empl_subs.

#< task
empl_subs <- data %>%
  # exclude rows with missing values in the following columns
  filter(!is.na(laborsubsidy_share) & !is.na(oecd_emp20to59) & !is.na(oecd_f_emp20to59)) %>%
  # filter for OECD countries
  filter(OECD == 1) %>%
  # calculate labor subsidy share as a fraction of labor income share
  mutate(laborsubsidy_share = laborsubsidy_share / pwt_labshr_gdp) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(Country, Code, year, laborsubsidy_share, oecd_emp20to59, pwt_labshr_gdp, oecd_childcarepresch, oecd_elderlycare, style))

#>
#< hint
display("Just click check")
#>

< info "variable of interest 3"

The share of labor subsidies is calculated as follows $laborsubsidy~share = \frac{laborsubsidy~share}{labor~share~of~GDP}$

Labor subsidy share is defined as public expenditures on child-care, preschool and elderly care.

>

Now, we translate the data about employment and subsidies into an expressive graph. As before, we first consider the total employment rate and compare it with the female employment later.


< quiz "Graph 5"

question: How are the labor subsidies correlated with the employment rates? sc: - negatively - positively* - they are uncorrelated

success: Great, your answer is correct! failure: Try again.

>

< award "Quizmaster Lvl. 3"

Great Job!

>

Note: The following plot replicates Figure 5A from the paper.

Task: Click check to create the plot.

#< task

plot5 = ggplot(data = empl_subs, aes(x=laborsubsidy_share, y=oecd_emp20to59)) +
  geom_point(colour = ifelse(empl_subs$style == 1 | empl_subs$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "Participation subsidies (share of labor income)", y="Employment rate") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw()

#the next line prints plot5
plot5
#>
#< hint
display("Just click check")
#>

This figure shows the correlation between participation subsidies and the total employment rate. As expected, the two variables are positively correlated meaning that higher public support for childcare, preschool and elderly care is associated with a higher employment rate.

And again, the Scandinavian countries are outliers as they spend more on participation subsidies than any other country. Denmark spends about 5% of aggregate labor income. And Norway and Sweden are located at about 6%.

If you bear in mind that Italy and the UK showed same levels of PTR, it is particularly eye-catching that the spending for the subsidies differ strongly. This could indicate that it makes a difference on how the tax revenues are spent.


In the next figure, we want to take a look at the cross-country relationship between female employment rate and the participation subsidies, because it can be speculated that women have a greater demand for these subsidies. Therefore, we create a new subset and assign the data to empl_subs_f.

Task: Click check to subset the data and save it into empl_subs_f.

#< task

empl_subs_f <- data %>%
  # exclude rows with missing values in the following columns
  filter(!is.na(laborsubsidy_share) & !is.na(oecd_emp20to59) & !is.na(oecd_f_emp20to59)) %>%
  # filter for OECD countries
  filter(OECD == 1) %>%
  mutate(laborsubsidy_share = laborsubsidy_share / pwt_labshr_gdp) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(Country, Code, year, laborsubsidy_share, oecd_f_emp20to59, pwt_labshr_gdp, oecd_childcarepresch, oecd_elderlycare, style))

#>
#< hint
display("Just click check")
#>


Note: The following plot replicates Figure 5B from the paper.

Task: Click check to create the plot.

#< task

plot6 = ggplot(data = empl_subs_f, aes(x=laborsubsidy_share, y=oecd_f_emp20to59)) +
  geom_point(colour = ifelse(empl_subs_f$style == 1 | empl_subs_f$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "Participation subsidies (share of labor income)", y="Female employment rate") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw()

#the next line prints plot5
plot6
#>
#< hint
display("Just click check")
#>

The figure considering only the female employment rate tells a similar story as plot5. Higher participation subsidies seem to encourage labor supply. This applies especially to women as the slope of the line is greater than the one for the total employment. This corresponds with the finding of a research study that women give more elder care to aging parents than men. (Grigoryeva, 2014)

Overall, it seems that the effects of taxation on employment are not independent of how the money is spent. Meaning that non-tax incentives like social programs boost labor force participation and therefore help to reduce tax distortions.

Since working families have a greater need for these programs the positive influence of the labor subsidy share on the employment rate is not surprising.

Exercise 2.3.2 -- Subsidization of goods complementary to working (regressions)

Note: Regressions are not part of the paper this problem set is based on. I added some to demonstrate how you can run regressions in R.

To further investigate the relationship between the participation tax rate and the employment rate we want to perform linear regressions on the whole time series of our data set. But as we are interested in the responses of taxpayers to changes in the net-of-tax rate we use logarithmic values in order to obtain elasticity estimates.

Thus, our first regression model looks as follows.

$$log(employment~rate)~=~\alpha~+~e~\cdot~log(1~-~PTR)~+~\epsilon$$ Where $\alpha$ is the intercept, $e$ the elasticity estimate and $\epsilon$ the error or disturbance term.

Before we perform any regression, we need to prepare our data. First of all, we calculate the net-of-tax rate and exclude all observations with missing values in the columns needed for the regressions.


Task: Click check to calculate the net-of-PTR, filter the data and assign it to reg_data1.

#< task
# loading the data
data = readRDS("data_condensed.rds")

#loading the package
library(dplyr)

reg_data1 <- data %>%
  select(Country,year,oecd_emp20to59,oecd_f_emp20to59,oecd_entrytax_emp20to59,oecd_entrytax_emp20to59_notransf)  %>%
  mutate(oecd_entrytax_emp20to59 = 1 - oecd_entrytax_emp20to59) %>%
  # exluding rows with missing values in following columns
  filter(!is.na(oecd_entrytax_emp20to59) & !is.na(oecd_emp20to59) & !is.na(oecd_f_emp20to59) & !is.na(oecd_entrytax_emp20to59_notransf))

#>
#< hint
display("Just click check")
#>

< info "Linear Regression with lm()"

The assumption of the linear regression model is that the dependent variable (regressand) is linearly related to one or more independent variables (regressors). The typical procedure to fit a linear line is called the Ordinary least squares (OLS) method.

Normally we have a simple linear regression model in the form of: $Y = \beta_0 + \beta_1 X + \varepsilon$

Then we have to calculate $\beta_0$ and $\beta_1$ to minimize the error term $\sum \limits_{i=1}^n \varepsilon^2_i \overset{!}{=} min$

As stated by Wooldrige (2012) the OLS estimator in cross-sectional analysis is unbiased under following assumptions (Note, that (a) refers to the simple linear regression and (b) to the multiple linear regression:

  1. The regression model is linear in the parameters $\beta_i$

  2. We have a random sample

3a. We have a sample variation in the explanatory variable which means that the values for x are not all the same

3b. None of the independent variables is constant and there are no exact linear relationships among the independent variables.

  1. The error has an expected value of zero given any value of x $E(u|x_1+x_2+~...+~x_k) = 0$

  2. The error u has the same variance given any value of x

And we can conclude, under assumptions 1 through 5, that the OLS estimators have the smallest variance among all linear, unbiased estimators. But note that the 5th assumption has no effect on the unbiasedness.

The function lm() can be used to carry out a linear regression. As default, it uses the OLS method. The required arguments to lm are data and formula where the regression model is specified.

An Example of a simple linear regression model:

lm(formula = y ~ x, data = dat, )

>

As this is the first regression, the commands are already given. As you can see, we use a pooled OLS regression, which treats all data points as separate observations, even though we have panel data, that consists of many countries over multiple time periods and we cannot assume that the observations are independently distributed across time. According to Wooldrige (2012) this most likely causes serial correlation of the residuals ($corr(\epsilon_t,\epsilon_{t-1}) \neq 0$). Which could have the following consequences:

  1. smaller estimated variance for $\hat{\beta}$
  2. decrease in $se(\hat{\beta})$ and increase of the t-statistics


Task: Click check to perform two log-log regressions on the response variables oecd_emp20to59 and oecd_f_emp20to59 using the explanatory variable oecd_entytax_emp20to59 which is the net-of-tax rate on participation $1 - \tau$.

#< task
reg_PTR1 = lm(formula = log(oecd_emp20to59) ~ log(oecd_entrytax_emp20to59), data = reg_data1)
reg_PTR2 = lm(formula = log(oecd_f_emp20to59) ~ log(oecd_entrytax_emp20to59), data = reg_data1)
#>
#< hint
display("Just click check")
#>

To correctly interpret the results it is necessary to be aware of the calculation of the elasticity which has the following formula:

$$e = \dfrac{\dfrac{\Delta~employment~rate}{employment~rate}}{\dfrac{\Delta~(1-r)}{(1-r)}}$$

Now we want to generate an output of the regression results but instead of using the summary() command we want to utilize the showreg() function from the package regtools to show both regression results side by side.


Task: Click check to use the function showreg() to display the regression result of reg_PTR1 and reg_PTR2

#< task
# loading the package
library(regtools)

showreg(list(reg_PTR1, reg_PTR2), 
        custom.model.names = c("total employment", "female employment"),
        digits = 7, robust = FALSE)
#>
#< hint
display("Just click check")
#>

If we look at the output, we obtain upon regressing employment rate on the predictor (1 - PTR) we see that there is significant evidence (three stars) to conclude that there is a linear association between the logarithmized employment rate and the natural log of the net-of-tax rate in both regressions. But one has to be careful not to mix up correlation with causation.

In this case, we don't want to pay much attention to the intercept because it's interpretation is meaningless due to the fact that there are no x-values (1 - PTR) of zero.

For female employment, the estimated elasticity is -0.175 and for total employment rate, the estimated elasticity is -0.086. This would mean that if the net-of-tax rate increases by 1% the employment rate decreases by 0.175% and 0.086%, respectively.

This seems to contradict any logic because one would think that if employees get to keep more of their income it has a positive influence on the extensive margin of labor. As we have not controlled for anything else, it could also suggest that there is omitted-variable bias because of important factors missing in our model.

After interpreting the regression results we want to check our regressions for the assumptions of homoscedasticity by examining the vertical range in the residual plot. A well-behaved residual plot should be located randomly around the zero-line and form a "horizontal band" and feature no outliers.


Task: Click check to plot the residual plot of our regression.

#< task
plot(reg_PTR1,1)
#>
#< hint
display("Just click check")
#>

Unfortunately, the assumption of homoscedasticity is violated as residuals are close to zero for larger values of oecd_entrytax_emp20to59 and are more spread out for small values. Despite the presence of heteroskedasticity OLS regression estimators of coefficients are unbiased but the violation of the homoscedasticity assumption can invalidate inferences such as significance tests (Long and Ervin, 2000). To reduce these effects of heteroskedasticity on the inference we can employ clustered standard errors to our regression output later which are also heteroskedasticity-consistent.

Note: The residual plot of the second regression reg_PTR2 shows an identical pattern suggesting that there is heteroskedasticity.

To check if the error terms are normally distributed we use the normal probability plot of the residuals.


Task: Click check to plot the normal probability plot of the residuals.

#< task
plot(reg_PTR1,2)

#>
#< hint
display("Just click check")
#>

Since the relationship between the theoretical quantiles and the sample quantiles are approximately linear except the heavy tails we can conclude that the error terms are only approximately normally distributed. Because of the large sample size, the error terms can deviate slightly from normality, according to the central limit theorem. Further Pallant (2007) states that with large sample sizes the violation of the normality assumption should not cause major problems.


But as we have panel data at hand, we should be aware of two possible problems: endogeneity (one or more explanatory variables are correlated with the error term, i.e., $\mathbb{E[ \varepsilon | x ]} \neq 0$) and autocorrelation in the errors. Therefore, we want to plot the residuals of only one country to investigate if there is a trend present.

< info "autocorrelation"

Serial independence: The Assumption that the error terms $\varepsilon_t$ and $\varepsilon_s$ for different observations $t$ and $s$ are independently distributed, is often violated in time series data.

Serial correlation: In time series, the error terms $\varepsilon_t$ and $\varepsilon_s$ for $t \neq s$ are frequently correlated.

>


In the next task, we save the residuals and the fitted values from the regression object reg_PTR1 into the data frame reg_data1 and create a subset containing only observations of Germany.

Task: Click check to add the residuals and fitted values to reg_data1 and to create a new data frame deu containing only observations of Germany.

#< task

# saving the regression results into reg_data1
reg_data1$resid = reg_PTR1$residuals
reg_data1$fit = reg_PTR1$fitted.values

# creating a subset using the filter command
deu = filter(reg_data1, Country == "Germany")
#>
#< hint
display("Just click check")
#>


As mentioned before we want to examine the time series of Germany for a trend. To plot the residuals against the time periods we use the basic plot() command.

Task: Create a plot of the residuals for Germany on the y-axis against the year on the x-axis. Use the data frame deu as the data source. Remember, the residuals are stored in the variable resid.

#< task
# plot(x= ... , y = ...)
#>
plot(x=deu$year,y=deu$resid)
#< hint
display("Adapt the commented code.")
#>

This graph suggests, as there is a trend identifiable, that the residual of one year depends on its past values. As the Gauss-Markow theorem is invalid when there is autocorrelation of error terms, the OLS estimator will be inefficient. This means that the calculated standard errors are no longer the smallest.

Further, we assume that unobservable factors are not time-invariant which means that these country-specific factors are likely to change over time. This seems plausible as we observe a period from 1990 to 2011.

Therefore, we want to use cluster-robust errors at the Country level to allow for the error terms to be correlated within a cluster, but still assume that they are not correlated between countries. Furthermore, we want to relax the homoskedasticity assumption and account for the fact that there might be unobservable characteristics between countries leading to heteroskedasticity as we have encountered in the residual vs. fitted values plot before.

Now we regress using the felm function from the lfe package. In the function, we use the clustered error option which is defined after the third vertical bar (|). We need the clustered standard errors to account for deviation from the assumptions (homoscedasticity and independence of residuals). The cluster-robust errors allow for heteroskedasticity and autocorrelation within an entity (here country) but treat the errors as uncorrelated across countries (Stock and Watson 2010, p. 364).

< info "formula specification in felm"

A formula in felm is specified e.g. y ~ x1 + x2 | f1 + f2 | (Q|W ~ x3+x4) | clu1 + clu2

The first part is the normal linear model y ~ x1 + x2 as in the lm function. In the second part f1 + f2 you can specify variables (fixed effects) you want to project out. The third part allows to define instrumental variables. In the last part, you can set the cluster specifications for the standard errors (clu1 + clu2).

>


Task: Click check to perform the regression with clustering on country level.

#< task
# loading the required package
library(lfe)

clu1 = felm(formula = log(oecd_emp20to59) ~ log(oecd_entrytax_emp20to59) | 0 | 0 | factor(reg_data1$Country), data = reg_data1)
#>
#< hint
display("Just click check")
#>

To compare the results with the previous regression on the total employment rate reg_PTR1, we use once again the showreg function.

Task: Click check to generate the comparing regression output utilizing showreg.

#< task
library(regtools)

showreg(list(reg_PTR1, clu1), custom.model.names = c("total employment", "clustered se"), 
        digits = 7, robust = FALSE)

#>
#< hint
display("Just click check")
#>

As you can see, the estimated coefficients do not change but the cluster robust errors are much larger than the standard errors from the regular OLS. Still, the coefficients seem to be significantly different from zero which is indicated by the three stars behind the estimates.


The previous regressions only considered the effect of the net-of-tax rate on the employment rates. But, as we have already discovered in Exercise 2.3.1, there are other influencing factors that should be taken into account. To investigate this further, we want to run a multiple linear regression.

First, we create a new subset of our data set.

Task: Click check to create the subset and save it into reg_new.

#< task
reg_new = data %>%
  select(Country,year,oecd_emp20to59,oecd_f_emp20to59,oecd_entrytax_emp20to59,pss_topmitr, laborsubsidy_share,pwt_labshr_gdp) %>%
  # exclude observations with missing values
  na.omit() %>%
  # laborsubsidy share as a fraction of labor income of gdp
  mutate(laborsubsidy_share = laborsubsidy_share / pwt_labshr_gdp) %>%
  # calculate the net-of-tax rate
  mutate(oecd_entrytax_emp20to59 = 1 - oecd_entrytax_emp20to59)

#>
#< hint
display("Just click check")
#>

Before regressing, it is a good idea to investigate the relationship among our variables (oecd_emp20to59, oecd_entrytax_emp20to59, pss_topmitr, laborsubsidy_share). Therefore, we use a scatter plot matrix, which contains a scatter plot of each pair of variables.

Task: Use the pairs function to generate a scatter plot matrix. Just replace the placeholder with the missing variables.

#< task
# pairs(~oecd_emp20to59+ ... + pss_topmitr + ... ,data=reg_new)


#>
pairs(~oecd_emp20to59+oecd_entrytax_emp20to59+pss_topmitr+laborsubsidy_share,data=reg_new)

#< hint
display("Adapt the code")
#>

As you can see, the scatter plots help to determine if we have a linear correlation between variables. It seems that pss_topmitr and oecd_entrytax_emp20to59 have a linear correlation. Also, laborsubsidy_sahre is correlated with the oecd_entrytax_emp20to59. To verify it, we can calculate the pairwise correlation coefficients. This is helpful to detect multicollinearity, which is the condition where two or more predictor variables are highly correlated. The problem with multicollinearity is, that the estimates can drastically change to modifications of the model (e.g. adding/excluding explanatory variables) or the data (another sample of the population). Note, that the standard errors of the coefficients tend to be large of collinear predictors.

Task: Use the cor function to calculate the correlation coefficients of the predictor variables oecd_entrytax_emp20to59, pss_topmitr and laborsubsidy_share. You just have to fill the placeholders correctly.

#< task
#cor(reg_new$oecd_entrytax_emp20to59,reg_new$pss_topmitr)
#cor(reg_new$oecd_entrytax_emp20to59, ...)
#cor(reg_new$pss_topmitr, ...)

#>
cor(reg_new$oecd_entrytax_emp20to59,reg_new$pss_topmitr)
cor(reg_new$oecd_entrytax_emp20to59,reg_new$laborsubsidy_share)
cor(reg_new$pss_topmitr,reg_new$laborsubsidy_share)
#< hint
display("Just click check")
#>

The correlation coefficients of the net-of-tax rate between the top marginal tax rate and the net-of-tax rate between labor subsidy share seem to be severe. To further check for multicollinearity it is suggested to not only test for pairwise correlation but to check for sets of variables. We will get back to this topic after we have performed the regression.

In the next step, we want to run the multiple linear regression. The regression model we consider is: $Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \varepsilon_i$,

where $x_1$ is oecd_entrytax_emp20to59, $x_2$ is pss_topmitr and $x_3$ is laborsubsidy_share. Further, we assume that $\varepsilon_i$ have a normal distribution with a mean of zero and constant variance $\sigma^2$ as in the simple linear regression before.

Note that "linear" means that the model is linear in the parameters $\beta_i$. This implies that the predictor variables can be transformed e.g. $x^2$ or multiplied together $x_1*x_2$.

Task: Perform a multiple linear regression with the three explanatory variables mentioned above. Use the lm function for this task. Show the results with the standard summary function afterward.

#< task

# reg_emp = lm(... ~ oecd_entrytax_emp20to59 + ... + laborsubsidy_share, data = reg_new)

# show the regression results afterward

#>
reg_emp = lm(oecd_emp20to59 ~ oecd_entrytax_emp20to59 + pss_topmitr + laborsubsidy_share, data = reg_new)
summary(reg_emp)
#< hint
display("Just click check")
#>

The $R^2$ tells us that 36.63% of the variation in the employment rate is reduced by taking into account (1 - PTR), top marginal rate and the labor subsidy share.

The stars behind each variable indicate the significance level on which we can reject the null hypothesis that the coefficients are equal to zero ($H_0:\beta=0$) and thus have no effect on the regressand. As you can see, the p-values for the t-tests suggest that only the slope for oecd_entrytax_emp20to59 and the intercept are significantly different from zero.

Whereas, the null hypothesis of the global F-test is that the fit of an intercept-only model is equal to our model. If the p-value of the F-test is under the significance level of 5% we can reject the null hypothesis and conclude that our model provides a better fit than the intercept-only model. Here the p-value is at 1.062e-06 suggesting that our model is more useful in predicting the employment rate than the intercept-only model.

$\beta_1$ represent the change in the mean response, per unit increase in the net-of-tax rate. Meaning that the employment rate decreases by 0.23507 if (1 - PTR) increases by 1 when all other predictors are held constant. Interpretation of the other coefficients is inappropriate because they are not significant.


To get back to the investigation of the degree of multicollinearity, we want to use the variance inflation factor (VIF).

< info "VIF"

The VIF is one of the most widely-used diagnostic for multicollinearity. First linear regressions of all predictors $x_j$ on all other predictors have to be performed. Then the $R_j^2$ is obtained and the $VIF_j$ is calculated as follows.

$$VIF_j = \frac{1}{(1-R_j^2)}$$

As the name suggests, it indicates how much the variance of the coefficients are inflated. Thus, a VIF of 1.7 means that the variance of a coefficient is 70% larger than it would be if the predictors were uncorrelated.

>

In practice, a value above 10 is often used to conclude that multicollinearity is a problem. But solely looking on the VIF is not very meaningful because a VIF does not necessary mean that the standard deviation of $\hat{\beta_j}$ is too large to be useful. (Wooldrige, 2012, p. 86)

Task: Use the vif function to get the variance inflation factor of each explanatory variable in reg_emp. You just have to pass the regression model to the function.

#< task
#loading package
library(faraway)

#>
vif(reg_emp)
#< hint
display("Just click check")
#>

As you can see, the VIFs are reaching from 2 to 4 which suggests that multicollinearity is not a major problem in our regression model. If we had VIF higher than 5, a possible remedy would be to remove highly correlated predictors from the model.


To further evaluate the regression analysis results we want to create an effect plot that visualizes the impact of each explanatory variable on the dependent variable.

Task: Use effectplot() from the package regtools to show the impact of each explanatory variable in reg_emp for the variation of the explanatory variables from the 10% quantile to the 90% quantile. Additionally, plot the confidence intervals by using the option show.ci = TRUE.

#< task
# Enter your command here.
#effectplot(lm object , further options)


#>
effectplot(reg_emp, show.ci = TRUE)
#< hint
display("Adapt the code")
#>

As you can see, the effect plot allows comparing the impact of the variables better. At first glance, it illustrates if the effect on the outcome is positive or negative by differently colored bars. The impact is calculated by varying the explanatory variable from the 10% to the 90% quantile if the option numeric.effect is set to its default value.

What we are seeing is that the net-of-tax rate oecd_entrytax_emp20to59 has the greatest influence on the employment. It decreases the employment rate by 0.081 ceteris paribus when it is varied from its 10% to 90% quantile.

You may have noticed that the explanatory variables are ordered descending by their influence on the dependent variable. Thus, pss_topmitr with only -0.014 has the least effect, closely followed by laborsubsidy_share.

To get a better grasp on the results it may help to know the values of the above mentioned quantiles. In the effectplot you can see three figures beneath each variable name. The number in the middle is the calculated median and the figures to the left and the right are the predefined quantiles.

< quiz "effectplot quantiles 1"

question: What is the 10% quantile of oecd_entrytax_emp20to59 (three decimal places)? answer: 0.215 roundto: 0.001

>


After looking at the regression results we want to check the model assumptions with the same methods as we did with the simple linear regression.


Task: Plot the residual plot for reg_emp using the function plot(). You just need to adapt the code.

#< task
# plot(reg object,1)
#>
plot(reg_emp,1)
#< hint
display("...")
#>

The changing vertical range of the residuals is suggestive of heteroskedasticity or non-constant variance. This represents predictive information that is leaking over into the residual plot. In other words, the residuals contain some of the variance. The red line is a smoothed curve, which affirms the linearity condition because it is relatively flat and remains close to 0.

Additionally, we can test for heteroskedasticity with the Breusch Pagan test. The test fits a linear regression model to the residuals with the same explanatory variables as used in the previous regression model. Therefore, we can pass the "lm" object to the option formula. By default, it runs a studentized version of the test which is a robust modification.

< info "Breusch Pagan Test"

Assume we have a linear model: $y =\beta_0 + \beta_1x_1 + \beta_2x_2 +~...~+ \beta_kx_k + u$

First we estimate the linear model by OLS and obtain the squared residuals $\hat{u^2}$.
Secondly, we run the regression: $\hat{u^2}=\delta_0 + \delta_1x_1+\delta_2x_2+~...~+\delta_kx_k+error$ and keep the $R^2_{u^2}$.
Then we form $LM = n*R^2_{u^2}$ and compute the p-value. (Wooldrige, 2012, p. 251)

>


Task: Perform a Breusch-Pagan-Test on the model reg_emp using the function bptest() from the package lmtest.

#< task
#loading the required package
library(lmtest)
#>
bptest(reg_emp)
#< hint
display("...")
#>

The p-value is at 0.008615 which means that we can reject the null hypothesis and assume heteroskedasticity.

To see if the observed residuals are normally distributed we need to plot the normal quantile plot because one standard assumption in linear regression is that the theoretical residuals are independent and normally distributed.


Task: Plot the normal quantile plot for reg_emp using the plot() command again.

#< task
# plot(reg object,2)
#>
plot(reg_emp,2)
#< hint
display("...")
#>

We have a little deviation from the normal distribution in the tails. Some deviation is to be expected, as in reality a perfect normal distribution is rare. But as the sample size with only 64 observations is small, we should keep in mind that t-tests rely on the normality assumption. Thus, it is recommended to use more conservative p-values for conducting significance tests (Statistics Solutions, 2013).

As we have panel data at hand, generally it would be necessary to check if the error terms are independent and if they are autocorrelated. But as there are only up to four observations of each country with gaps, autocorrelaion is no problem. As mentioned above, the assumption of equal variance is violated, thus it is appropriate to correct the standard errors to reduce the effects of heteroskedasticity on inference and see if the coefficients are still significant.


There are different versions of heteroskedasticity-consistent standard errors. But Long and Ervin (2000) performed Monte Carlo simulations and found that HC0 often results in incorrect inferences for small sample sizes $N \le 250$. They recommend HC3 as this version works even for small samples with $N$ as small as 25.

Task: Click check to utilize the showreg function to show the regression results with a HC3 error.

#< task
showreg(list(reg_emp), custom.model.names = c("MLR with HC-error"), 
        digits = 5, robust = TRUE, robust.type = "HC3")
#>

#< hint
display("...")
#>

The output seems to be a bit suprising, as the robust standard error is even smaller than the conventional SE. But Jörn-Steffen Pischke (2010) mentions that the conventional SE can be either upward or downward biased. He states that the reason for the upward biased SE is that covariates located far from their mean are associated with lower variance residuals.

Exercise 3 -- Evidence on social and cultural influences

Some argue that Scandinavians collect more tax because of their "tax morale" that consist of intrinsic and social motivation. And indeed, there is micro evidence that social incentives play a role for tax compliance (Dwenger, Kleven, Rasul, and Rincke, 2016). In this chapter, we are going to look at some descriptive on the potential role of social and cultural aspects of tax income and redistribution of wealth.

For the first descriptive evidence, we use the results on trust from the World Values Survey. The variable wvs_trustpeople is the fraction of people who answered "yes" to the question "Whether or not most people can be trusted".


To get the same results as Kleven (2014), we filter the data in this chapter according to the following rules: Exlude countries - that are not OECD members or have more than 20 % natural resource rents of their GDP - where GDP per capita based on purchasing power parity is less than USD 5000 (in 2005 PPP terms)

Task: Click check to create the subset and save it into tax_trust.

#< task
# loading the data
data = readRDS("data_condensed.rds")
#loading the required package
library(dplyr)

tax_trust <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  # excluding rows with missing values for the needed variables
  filter(!is.na(wvs_trustpeople) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(cyear, ief_taxpergdp2012, wvs_trustpeople, OECD, wb_natresorents_pct_gdp, gdppc_ppp, style))


#>
#< hint
display("Just click check")
#>


Before we plot the relation of tax revenue against the survey results on trust, take a guess on the following quiz.

< quiz "tax trust"

parts: - question: 1. Do you think that a higher trust is correlated with a higher tax share of GDP? sc: - yes - no success: Great, your answer is correct! failure: Try again. - question: 2. Where do you think does Germany land? Lower end, middle or higher end when it comes to trust? sc: - lower end - middle - higher end success: Great, your answer is correct! failure: Try again.

>

< award "Intuition Lvl. 1"

Congrats! You answered correctly.

>


Okay, now let's create the corresponding visual to get a greater insight on how the trust people have for their fellows is associated with the tax revenue percentage of GDP.

Note: The following plot replicates Figure 6A from the paper.

Task: Click check to depict the data from tax_trust. On the x-axis, we plot the fraction of the people who answered with "yes" to the question and on the y-axis the tax/GDP ratio.

#< task
#loading the required packages
library(ggplot2)
library(ggrepel)

plot5 = ggplot(data = tax_trust, aes(x=wvs_trustpeople, y=ief_taxpergdp2012)) +
  geom_point(colour = ifelse(tax_trust$style == 1 | tax_trust$style == 2,"black","grey"), (aes(text = paste("Code:", tax_trust$Code)))) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "'Most people can be trusted'", y="tax/GDP ratio") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw()

# print the plot
plot5
#>
#< hint
display("Just click check")
#>

What's striking about this graph is the positive correlation between the trust measure and the tax take. Also, the fact that the Scandinavian countries feature higher levels of trust than elsewhere is noteworthy. This result supports a notion that "social cohesion enables citizens to live in societies where they enjoy a sense of belonging as well as trust, which makes policies more effective through a virtuous circle between a widely accepted social contract, increased citizens' willingness to pay taxes and improved public services" (OECD, 2011)

But Kleven (2014) also notes the caveat that there are doubts whether the trust measure represents cultural attitudes or are endogenous outcomes of deeper institutions.


Maybe there is a difference in the willingness to pay taxes because the beliefs about the poor differ across countries. Therefore, we use a descriptive that describes the view on people in need. We use the variable wvs_inneed_lazy which is the fraction of people who believe that the social beneficiaries are in need because of their laziness or lack willpower. This survey data also stems from the World Values Survey.


Task: Click check to subset the data and assign it to laziness.

#< task
laziness <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  filter(!is.na(wvs_inneed_lazy) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(cyear, ief_taxpergdp2012, wvs_inneed_lazy, OECD, wb_natresorents_pct_gdp, gdppc_ppp, style))

#>
#< hint
display("Just click check")
#>


Note: The following plot replicates Figure 6B from the paper.

Task: Click check to plot the tax/GDP revenue against the fraction of people who believe that people in need are self-inflicted in their situation.

#< task
plot6 = ggplot(data = laziness, aes(x=wvs_inneed_lazy, y=ief_taxpergdp2012)) +
  geom_point(colour = ifelse(laziness$style == 1 | laziness$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  # adding additional linear line for subset
  geom_smooth(data=subset(laziness, gdppc_ppp >= 19624), aes(x=wvs_inneed_lazy, y=ief_taxpergdp2012,color="red"),method=lm,se=FALSE) +
  labs(x = "'People in need because of laziness, lack of willpower'", y="tax/GDP ratio") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw() 

#print the plot
plot6
#>
#< hint
display("Just click check")
#>

This graph shows a weak negative relationship due to distortions of low-income countries. Kleven (2014) mentions that the slope of the blue line would be steeper if we control for income per capita or drop the low-income countries. To visualize the difference, I added the red line for countries with a GDP per capita higher than the median of about $19624.

In Norway, Sweden and Denmark only approx. 15% think that poor people are lazy or lack of willpower whereas 60% of Americans think that people are poor because of their shortcomings. Probably such rooted perception of the public prevents more redistribution and make tax law adjustments based on the model of Scandinavia impossible for some societies.

Perhaps the Scandinavian beliefs about poor people can help to promote the willingness to pay taxes.


In the next plot, we want to consider a behavioral measure. Kleven (2014) used a social index that combines civic participation, voter turnout and crime (as proxied by the homicide rate). Since the civic participation is only meaningful in democratic countries, all non-democratic countries are excluded.

< info "Social capital index"

According to Kleven (2014) the social capital index is obtained from a principal component analysis of the following variables: 1) civic participation: weighted-average of a binary indicator for active membership of an organization (latest available year, source: World Value Survey) 2) average voter turnout in elections held after 2000, excluding the European Parliament elections (source: Voter Turnout Database, IDEA) * 3) the inverse of the homicide rate (latest available year, source: UNODC, United Nations Office on Drugs and Crime).

>


Task: Click check to subset the data and assign it to social_cap.

#< task

social_cap <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  # excluding non-democratic countries
  filter(polity2 > 0 | is.na(polity2)) %>%
  filter(!is.na(sk_proxy_wo_relig) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(cyear, ief_taxpergdp2012, sk_proxy_wo_relig,polity2, OECD, wb_natresorents_pct_gdp, gdppc_ppp, style))

#>
#< hint
display("Just click check")
#>

In our subset, the minimum of the social capital index is at -1.65 and the maximum is at 3.04.

In the next step, we want to plot the social capital index against the tax/GDP ratio to explore one of the behavioral measures of social motivation.


Note: The following plot replicates Figure 6C from the paper.

Task: Click check to create a scatter plot with the social capital index sk_proxy_wo_relig on the x-axis and the tax/GDP ratio on the vertical axis.

#< task

plot7 = ggplot(data = social_cap, aes(x=sk_proxy_wo_relig, y=ief_taxpergdp2012)) +
  geom_point(colour = ifelse(social_cap$style == 1 | social_cap$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "Social capital index", y="tax/GDP ratio") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw() 

#print the plot
plot7
#>
#< hint
display("Just click check")
#>

As you can see in the plot above the social capital index is strongly positively related to the tax take and again the Scandinavian countries are scoring high.

We can infer that the more people of a society are involved in an organization and make use of their right to vote the higher the tax take is. The aforementioned components of the social capital index can be understood as a measure for how much people are invested in their society and local community. Additionally, the inverse of the homicide rate represents a measure for people's personal integrity.


Finally, we explore the hypothesis that mandatory contributions through tax payments crowd out voluntary contributions through donations. This would challenge the argument that countries with high tax take and generous social systems are more socially motivated than others.

Unfortunately, there is no information about the amount of charitable contributions across many countries. Therefore, Kleven (2014) considers the fraction of people who donated money using data from the World Giving Index.


Task: Click check to subset the data and assign it to donating.

#< task

donating <- data %>%
  filter(OECD == 1 | wb_natresorents_pct_gdp < 0.20) %>%                                               
  filter(gdppc_ppp >= 5000 & !is.na(gdppc_ppp)) %>%
  filter(year >= 2008 ) %>%
  # exclude rows with missing values
  filter(!is.na(wgi_donatemoney2012) & !is.na(ief_taxpergdp2012)) %>%
  arrange(Country, Code, year) %>%
  group_by(Country,Code) %>%
  summarise_each_(.,funs(tail(.,n=1)), vars(cyear, ief_taxpergdp2012, wgi_donatemoney2012, OECD, wb_natresorents_pct_gdp, gdppc_ppp, style))
#>
#< hint
display("Just click check")
#>


Again, take a guess on the quiz before we reveal the relationship of the fraction of people donating vs. the tax/GDP ratio.


< quiz "tax donations"

parts: - question: 1. Do you think that the willingness to donate money is higher in countries like the USA where the tax take is relatively low? sc: - no - yes success: You're right! failure: Try again. - question: 2. Can you guess the fraction of people donating money to a charity in the US? sc: - 70% - 62% - 40% success: Great, your answer is correct! failure: Sorry, this is not correct. Try again.

>

< award "Intuition Lvl. 2"

Great! You have good intuition

>

Note: The following plot replicates Figure 6D from the paper.

Task: Click check to plot the fraction of people donating to charity wgi_donatemoney2012 against the tax revenue ief_taxpergdp2012.

#< task
plot8 = ggplot(data = donating, aes(x=wgi_donatemoney2012, y=ief_taxpergdp2012)) +
  geom_point(colour = ifelse(donating$style == 1 | donating$style == 2,"black","grey")) + 
  geom_smooth(method=lm, se=FALSE) +
  labs(x = "Fraction of people donating money to charity", y="tax/GDP ratio") +
  geom_text_repel(aes(label=ifelse(style == 1 | style == 2 ,Country,''))) + 
  theme_bw()

# show the plot
plot8
#>
#< hint
display("Just click check")
#>

Unexpectedly the relationship is positive, suggesting that there is no crowding out and Scandinavia is almost involved in charity as countries with smaller tax takes.

But as mentioned before the tax-charity crowd-out may be proven by using donation amounts rather than the fraction of people donating to charity. According to the Charities Aid Foundation (2006) Americans donate 1.67% as a fraction of GDP whereas Germany and France donate 0.22% and 0.14%, respectively.

Exercise 4 -- Conclusion

It is difficult to provide a conclusive answer to the initial question of the problem set how it is possible for Scandinavian countries to achieve such economic outcomes or more broadly: "Is it possible to design a tax system that raises large tax amounts while keeping tax evasion and tax distortions low?"

But as the cross-country evidence in Exercise 3 shows, social and cultural factors have an effect on tax compliance. Of course, such cultural characteristics cannot be simply transferred. But in Exercise 2 we examined concrete policy implications discussed by Kleven (2014) that can easily be implemented in other countries. The first policy advice is to impose far-reaching information trails to promote tax compliance. Second, broadening the tax base limits effectively the range for tax avoidance. And to supplement the policy design large public expenditures for work complements is recommendable to make taxes less distortionary and promote high levels of employment.

But Kleven (2014) also notes that these factors are linked together in a sense that the social and cultural characteristics facilitate the implementation of these policies and at the same time the social and cultural norms may be driven by the policies and institutions involved.

With Scandinavian countries being small and homogeneous with limited racial and religious diversity and high human capital it is unclear how these policy implications apply to large, diverse and unequal countries.

The conclusion is that countries around the world should thoroughly think about how to levy taxes and redistribute wealth with lesser distortions.


Task: Execute the following code to see all your awards you have collected.

#< task
awards()
#>

Exercise 5 -- References

Academic Papers and Books

Websites

R and R packages



dhertle/RTutorTaxationScandinavia documentation built on May 15, 2019, 8:22 a.m.