user.name = 'ENTER A USER NAME HERE'
Author: Daniel Klinke
Welcome to this interactive R tutor problem set that is part of my bachelor thesis at the University of Ulm. If you want to explore the diversity of economics and improve your R skills in addition to your economic understanding, you will have a lot of fun with this problem set.
Have you ever thought about the economic importance of creativity, how to measure it and what effect the state of mind has on creativity?
This problemset is based on the paper "How Are You, My Dearest Mozart? Well-Being and Creativity of Three Famous Composers Based on Their Letters." and the corresponding Replication Data as well as the Online Appendix by Karol Jan Borowiecki (2017). Using linguistic analysis software (LIWC), the author analyzes over 1400 letters of the world-famous composers Wolfgang Amadeus Mozart, Ludwig van Beethoven and Franz Liszt to relate negative emotions to their most creative works. In addition to the interactive econometric analysis of Borowiecki's results, I also consider a growing economic analysis method - automated text analysis - with the package quanteda. For simplicity reasons "the paper" in the remainder of the problem set refers to the above mentioned study by Karol Jan Borowiecki.
Introduction
Excercise 1: Overview: Descriptive statistics
1.1 Examining emotions
1.2 Examining output
Excercise 2: Result analysis: Impact of emotions on creativity
2.1 Control variables and fixed effects
2.1.1 Omitted variable bias
2.1.2 Fixed effects
2.1.3 Comparison of the results
2.2 Instrumental variables regression
2.2.1 Theoretical introduction: Simple instrumental variable regression and two-stage least squares method
2.2.2 Applied instrumental variable regression
2.2.3 Results and discussion
Excercise 3: Main negative emotion
Excercise 4: Obtaining emotion indicators
4.1 Letter analysis using quanteda
4.1.1 Corpus creation
4.1.2 Dictionary creation
4.2 Dictionary comparison
4.2.1 Implementation of the NRC dictionary
4.2.2 Discussion
Excercise 5: Creativity with R
Excercise 6: Conclusion
Appendix
References
If this is your first problem set, the following section will provide you with a brief guide on how to solve it. Each exercise can be solved independently of the previous exercise. However, it is recommended to keep the given order, because the exercises build on each other didactically. Previous knowledge of the programming language R is helpful but not required. There exist numerous excellent introductions in the form of freely available books or learning videos to R. For example "An Introduction to R" by Douglas et.al (2021) or "Learning R" by Poulson (2019). The operation of the problem set is designed to be very native and will be explained in more detail at the appropriate point. For clarity, the most important elements you will encounter in the problem set are listed below.
Code Chunks: They will appear, whenever you have to enter R code. Within an exercise you have to solve the code chunk before you can proceed with the next one. The "Task" will always tell you what you have to do. Read the task carefully and follow it exactly, otherwise you will not be given the points for the correct solution.
edit : Normally you can type your code directly into the chunks. However, there are cases (for example, if you have not solved a previous optional task) where you have to press the edit button first to be able to edit the task.check: Runs the chunk and checks if it is correcthint : Generates an automatic hint if you get stuck on a task.solution : Shows you the sample solution of the task.data : Shows you the data of the task.run chunk : Runs the chunk but does not check if it is correct.original code : In case you accidentally changed the original code when adding code parts, you can restore the original version using this button.Info Blocks: Here you will find further information and more detailed explanations, which you can display according to your interest by pressing on the block.
Quizzes: Simple quizzes that are used to deepen your knowledge about the underlying data and regression results. Answering them is voluntary.
Awards: They are awarded to you when you successfully complete tasks. Sometimes additional interesting information is connected with them.
Enough explanations for now. Start the problem set by clicking on the button Go to next exercise.... Have fun!
In order to get a good start into the problem set, the focus of the first chapter is to optain an overview of the data. The data on which Borowiecki's results are based can be found here. For reasons discussed in the appendix, we use a reduced-observations version of the replication dataset, but these do not affect the main results. The steps and associated R code for reducing the data can also be found in the Appendix.
In R, data is loaded by saving it to a variable via a function. If the data is in table format, this is normally done in the form variable_name <- read.table("file_name"). Instead of using <- you can also use =. The original replication data is available as Stata code. I have saved the modified variant as Rds file for simplicity. It can be loaded without much effort by using the standard command readRDS().
Task: Use the function readRDS() to load the file "composer_data_reduced.rds". Load the file directly into the variable dat. The procedure is similar to the one described above.
# Enter your code here.
# Run for additional info in the Viewer pane info("The Working Directory")
To get a better overview of the data we can use the function sample_n().
The sample_n() command shows a specified number of randomly selected rows of a data frame. It is especially useful if you want to have a first look at a large data set, but the first rows do not give you a good representation of its structure. If you want more detailed information on the syntax and usage of the command, call help(sample_n) in your R console or visit the R documentation.
The following example shows fifteen randomly selected rows from the data frame dat.
You don´t have to solve the following junk, instead you can also click on the button data to view it
Task: You can click check to get a first overview of the data.
#You can use sample_n() to show the first 15 rows of dat and colnames() to just show the variable names of dat. Try it out. sample_n(dat, size=15) colnames(dat)
We now see the entire data set. It is a longitudinal data set over the lifespan of three composers (Mozart, Liszt and Beethoven). Roughly speaking, the data can be divided into two blocks. These are on the one hand the background data of the composers (e.g. income,age,output,death_of_relative,decade) and on the other hand the letter data (e.g. numberofletter,negemo,posemo,anxiety,category_relationship). These two blocks are obtained via two different routes: The letter data via the LIWC algorithm and the life data mainly from information of Grove Music Online.
Basically, one line represents a written letter of an artist and contains information about the content of the letter as well as the related background information at the time the letter was written.
At this level, we cannot get a good overview of the data and cannot yet extract any information from it. In the next steps we will dive deeper into the data by looking at a reduced version of the data set. But first let's solve a quiz. Just click on the right answer and then press check:
Quiz: How many variables does our data set have?
# Run line to answer the quiz above answer.quiz("variables")
As already noted, the dataset is still quite confusing this way. To manipulate data, the package collection tidyverse is excellent. We will come back to it several times in the further course and marvel at the wealth of functions. First we select the most important variables with select() from the dplyr package and then use the skimr package to get a good overview.
# Run for additional info in the Viewer pane info("Packages")
# Run for additional info in the Viewer pane info("dplyr package")
If you have skipped the last task, click edit before you can go on and solve the junk.
Task: Some variables in dat are redundant or we do not need them currently. Load the package dplyr, then use the select() function to select the variable columns of dat that are relevant for us and store them in a reduced version dat_red. This is a fill in task, therefore you just have to remove the gaps ___ with the correct terms.
#1. Load the package dplyr library(dplyr) #2. Store a reduced version of dat in dat_red. Insert the correct term in the gap. It refers to the data being used. dat_red <- select( ___ , composer, numberofletter, letters_annual, category_relationship, wc, posemo, negemo, anxiety, anger, sadness, social, sexual, death_concerns, financial_concerns, income, age_1, marriage_cohabitation, death_of_relative, tenure, touring, illness, output, theater) #3. Show all variable names of dat_red colnames(___)
Task: Now generate a nice summary using the command skim of the skimr package. Follow the instructions in the chunk.
#1. Load the package skimr library(skimr) #2. To prepare the summary, group dat_red by the name of the composer dat_redprep<-group_by(dat_red,___) #3. Now use skim() and generate the summary table of dat_redprep skim(___)
We now see an automatically generated summary table with the most important variables. Besides the number of rows and columns(variables) of the reduced and prepared dataset dat_redprep, we see a breakdown by variable type "character" and "numeric". Take a closer look at the numeric variables. Beside the variable name you can see for each composer how many values are missing (marked with NA), the completion rate, the mean, the standard deviation, the percentiles and the frequency distribution represented by an indicated histogram.
# Run for additional info in the Viewer pane info("Dummy variable")
Quiz: Just looking at the distribution of the variables and the percentiles. Which variables (skim_variable) are so called "dummy variables"? Pick the 5 correct ones.
[9]: marriage_ cohabitation
# Run line to answer the quiz above answer.quiz("InterpretData")
It is quite difficult to see. But dummy variables take at most two different values. So for a single variable, at least one and at most two bars must be recognizable per composer. If you look at the percentiles (p0, p25, p50, p75, p100), there should also only be the values 0 or 1. Remind, p75 = 1 means that 75% of the values are one or smaller. A higher value makes no sense, since dummy variables by definition can only have a maximum value of 1. And a value between 0 and 1 makes little sense, since that would mean that the variable can take on more than 2 values, which also speaks against the definition.
Quiz: Which of the composers never entered into marriage or cohabitation according to the data?
# Run line to answer the quiz above answer.quiz("PrivateLifeComposer1")
In fact, Beethoven once proposed marriage to a woman but was rejected. In the data summary, we can see that he actually was never married because the mean and standard deviation(sd) of the variable marriage_cohabitation are 0.
Quiz: Which artist do we have more information about than the other two?
# Run line to answer the quiz above answer.quiz("PrivateLifeComposer2")
We have information about the amount of Mozart's income. The overview does not provide informations for the other two composers. On the basis of the standard deviation, we furthermore see that there were considerable fluctuations here as with many freelance artists. This is also consistent with his biography. Mozart's life was characterized by surplus and scarcity. He did not build up reserves and spent money freely as soon as it was available. Depending on the situation this is probably not the way an economist would act.
Because it is appropriate at this point, you will get a detailed description of the variables in dat. The description and more detailed background of how the data was obtained can also be found in the replication file and the Online Appendix of Karol J. Borowiecki's paper.
# Run for additional info in the Viewer pane info("Detailed variable description")
To get an impression of how the important variables emotion and output behave, we plot the data over the life of an artist. As an example we will use the data of Mozart. Note, that the emotion data were generated in a special way. We will get into more detail regarding the generation process in the following part of the problemset. For now, it is sufficient to know that the emotion scores (e.g. negemo, posemo) are the proportion of the correspondingly connotated words in the total word count. The first step of plotting emotions is to prepare the data for it.
Besides select() and group_by() we now use another very helpful function of the dplyr package: filter(). We use our main dataset again, but this time we are only interested in the relevant data for the composer Mozart, and the columns age, negemo, posemo and output . The filter() function in combination with select() takes over exactly this task. Instead of using intermediate variables to implement the process as before, this is an opportunity to introduce the pipe operator. The "pipe operator" %>% combines the different commands at the end of each line. So it uses the result of the previous line as the first argument in the function call of the next line.
Task: Prepare the data. Load all data relevant for Mozart into datMozart and select the columns age,negemo, posemo and output. The pipe operator is used and at the end datMozart is shown. You just have to fill the gaps.
# As this is a new exercise we have to load the data again. dat <- read_dta("composer_data.dta") # Load all data relevant for Mozart into datMozart and select the columns age, negemo, posemo, output. Use the pipe operator to implement the task. At the end show datMozart. datMozart<- ___ %>% filter(___ == "Mozart") %>% select(age, ___, ___, ___) datMozart
In datMozart each row still represents a letter, but now only contains Mozart's letters with the selected columns of interest. This simplifies the creation of the graph in the next step and enhances the understanding of the data being examined.
You may already know the plot() function in R to create graphs quickly and easily. However, there is a graphically nicer and more popular way to create graphs, which we will use in this problemset: ggplot. Once you have understood how it works generating descriptive graphs is quite efficient.
# Run for additional info in the Viewer pane info("ggplot")
In the next code chunk we refer to the corresponding data set with ggplot() and define the axes with aes(x = ..., y = ...). Afterwards we rename the axes with labs(), so that it is clear which variables are plotted.
With geom_smooth() we help the eye to recognize patterns. It tries to draw a fitting line through the data cloud, in other words the data is smoothed. There are different smoothing methods. In our case we use the LOESS (locally estimated scatterplot smoothing) method. Roughly speaking, we try to use a local regression. Thereby surrounding data points are used to produce the corresponding point in the curve. Using the argument 'span' we can set the sensitivity of the line. The smaller the number the shakier the line. The larger the number the smoother the line. Via default the optimal span, i.e. the span minimizing the SSE, is being used. For more information see Jacoby (2000). Finally, we display the 95% confidence interval around the smoothed line by setting se = TRUE.
Task: Complete the gaps with the correct variables from datMozart. MoPos should plot the positive emotions over time and MoNeg the negative emotions.
# Plot the data negemo and age as well as posemo and age. MoPos <- ggplot(datMozart,aes(x= ___ ,y= ___ )) + labs(x="Age",y="Positive emotions") + geom_smooth(se=TRUE,method="loess") MoNeg <- ggplot(datMozart,aes(x= ___ ,y= ___)) + labs(x="Age",y="Negative emotions") + geom_smooth(se=TRUE,method="loess")
Task: You have successfully created the graphs. Run the code below to view them in direct comparison to each other. Just click the check button.
# Run the code to view the graphs in direct comparison to each other. library(gridExtra) grid.arrange(MoPos,MoNeg,nrow=2,top="Wolfgang Amadeus Mozart Emotions (1756-1791)")
The plot visualizes the positive emotions at the top and the negative emotions at the bottom as a function of Mozart's age. We can see the mirroring of the curves very well, especially in the data for Mozart. If the positive emotions of the composer are high, the negative emotions are low and vice versa. This makes perfectly sense, since one does not generally assume two states of mind at the same time. For the whole data set, the correlation coefficient between the positive and negative emotions is $-0,13$. This supports the argument that high positive emotions are associated with low negative emotions. Comparing the graphs with biographical data of Mozart one can also see that the emotion scores are describing his actual emotional state of mind very well.
Looking at the graphs, two distinctive points stand out. A drop in positive emotions at the age of 17-23 and a turning point associated with a continuous increase in emotions at the age of 25-33.
After Mozart stopped touring Europe at the end of his childhood, he began a permanent position at the court in Salzburg with 17. The in those days newly elected Prince Archbishop Hieronymus Colloredo made it continuously more difficult for local artists, including Mozart, to pursue their passion, by closing the University Theater, limiting the opportunities for musicians to perform in cathedrals and the court, and giving preference to Italian musicians. In addition, Mozart's family had a hard time getting performances or permanent positions during this period because they were not in the favor of the nobility. The death of his mother in 1778 and the resulting deterioration of family circumstances then resulted in a peak of negative emotions at the age of 23-24 (Halliwell, 1998).
At the age of 25, his life took a turn for the better. First he was offered a position as court organist in Salzburg, then in 1781, after a successful premiere of his opera "Idomeneo" in Munich, his relationship with his father also improved. He continued to generate numerous commissions as a freelancer in Vienna, and his reputation among the nobility rose again considerably. Finally, in 1782, he married Constanze Weber and soon after had his first child with her (Eisen et al. , 2013).
In summary, the charts created using emotion scores are useful. Comparison with biographical data also serves as a validity check. More on this will be discussed in the chapter on determining emotion scores.
We again use the data for Mozart to see how many outstanding pieces (output) he composed at which age. The graph is generated using the same principle as in the previous exercise, so just click on check.
Task: Click on check to display the output by year of life.
# Load data again because this is a new Exercise dat <- read_dta("composer_data.dta") datMozart<- dat %>% filter(composer == "Mozart") %>% select(age, negemo, posemo, output) datMozartOutput<- dat %>% filter(composer == "Mozart") %>% select(age, negemo, posemo, output) %>% unique() #Create the graph MoOut<- ggplot(datMozartOutput,aes(x =age,y =output))+ labs(x ="Age",y ="Output") + geom_smooth(se=TRUE,method="loess") #Plot the graph MoOut + geom_point(size=1) + ggtitle("Wolfgang Amadeus Mozart Output (1756-1791)")
The blue line again represents our smoothed data using the LOES method and the gray tube the 95% confidence interval. Newly added are the observations shown as black dots. We use the unique() method in the data loading process to remove multiple records. If we didn't do this, the blue line would be distorted, because corresponding years with multiple written letters would get a stronger weight or visually speaking more black dots would be placed directly on already existing black dots. However, since we are interested in the output independent of the number of letters, this makes no sense. Please note, that for the blue curve senseless values are generated for the first 6 years. Of course, a negative output is not possible.
Task: Plot the negative emotions as a function of age and the output as a function of age by running the following code. Press check.
MoNeg <- ggplot(datMozart,aes(x=age,y=negemo)) + labs(x="Age",y="Negative emotions") + geom_smooth(se=TRUE,method="loess") grid.arrange(MoNeg ,MoOut + xlim(14,NA) , top="Wolfgang Amadeus Mozart Output and Emotions(1756-1791)")
If we look at the graphs, it seems that the composer composes the most significant pieces when his negative emotions are high and as soon as they decrease, the output variable decreases again with a short delay. If we additionally look at the correlation of the two variables negemo and output, it seems that with a value of 0.13 there is a connection between them. But does the graph and correlation already mean that negative emotions causally affect the output? No! The previous graphs were deliberately meant to suggest such a dependency, as it is often the case in unobjective discussions. This informal fallacy is often referred to as "Post hoc ergo propter hoc" argument which translates to "after this, therefore because of this". This means that it is incorrectly assumed that event A causes event B only because event B followed event A. Furthermore events are mainly explained monocausal, ignoring that one event often has many influencing factors.
The following page presents this fallacy in a humorous way. The presented examples clearly do not make any sense and the majority of people probably can not be convinced that Math doctorates awarded causes an increasion of uranium stored at US nuclear power plants. However, there are situations in which the post hoc ergo propter hoc reasoning leads to severe problems. For example, some people still directly associate vaccination of infants with the development of autism, although this has been widely studied and refuted, as shown by Hviid et al. (2019) in a nationwide cohort study. The result is diseases that could be eradicated but continue to circulate in the population because herd immunity cannot be established.
Why it could be a fallacy to interpret negative emotions as a cause for creativity in our case as well, why we have to be very careful about the formulation of potential outcomes and why we can still say at the end of our problem set that negative emotions lead to a higher output is covered in the next exercises.
So far, we have obtained an overview of the data and used descriptive statistics to analyze it. We also found a certain positive correlation between output and negative emotions for Mozart. However, this correlation does not necessarily have to be causal. Our goal in the following exercises is to show whether there really is a causal effect between negative emotions and the creativity of an artist. At the end of the problem set we provide a satisfactory answer to this question based on our multiple methods.
As an entree to this part of the problemset, I provide a brief introduction to regression analysis with R. This problem set attempts to explain econometrics from a more intuitive perspective to sharpen economic understanding, drawing on the explanatory methods of Kennedy(2008) and Wooldridge(2016, 2020). Mathematical formulas, proofs and extensive matrix notations will therefore be found relatively few in this problem set. Nevertheless, reference is made to the relevant technical literature at the appropriate place. However, if you are more interested in exact mathematical explanations, I recommend the didactically excellently designed problem set R Tutor Water Pollution by Brigitte Peter.
If you know the system of regression analysis with R, you can skip the following chunks and go directly to the next exercise. However, if you want to refresh your knowledge or if the concept of regression analysis is completely unknown to you, the following summary is certainly very helpful for understanding the next exercises.
Regression analysis attempts to describe reality by quantitatively analyzing possible relationships between variables. To represent the association between two variables, they are represented in a model. For this purpose, we use a linear model, which generally looks like the following:
[ y = \beta X+u ]
where $y$ denotes the dependent variable, $X$ a matrix containing a constant vector and the k independent variables $x_1,…,x_k$, $\beta$ a vector of the true coefficients and $u$ the so called error term (also called disturbance term or vector of residuals). It can be thought of as measurement error, inherent randomness in human behavior or the omission of the influence of innumerable chance events (Kennedy, 2008 p.3) and is calculated by $y - \hat{y}= u$, where $\hat{y} = \hat{\beta}X$. The ^ symbol above $\beta$ and $y$ means that those are estimators and not the true coefficients.
For our example, a simple linear regression model written mathematically might look like this:
[ output = \beta_0+\beta_1negemo+u ]
where $output$ is the dependent variable (also called explained variable or regressand), $\beta_0$ is the constant axis intercept parameter, $\beta_1$ is the slope parameter, $negemo$ is the independent variable (also called explanatory variable or regressor) and $u$ is the error term.
The main goal in regression analysis is to produce good estimators for $\beta$. Since there are literally an infinite number of them, the difficulty is not to produce an estimator, but to produce such an estimator that fits the estimation problem well and consequently gives good estimates. But what exactly is a good estimator? You will find the answer in the next info block.
# Run for additional info in the Viewer pane info("A good estimator")
It is important to note that there is no super estimator that has the desirable properties in all situations for every estimation problem. It is therefore necessary to find out which estimator is the best estimator for the particular situation in which one finds oneself.
Often a standard estimation situation is used, called classical linear regression model (CLR model). It has been shown, and can be mathematically proven (e.g. Wooldridge, 2016 pp. 89-90 and pp. 724-726), that in this model the ordinary least squares (OLS) estimator is the optimal estimator. It is also said that the OLS estimator is BLUE in the CLR model, that is, the best linear unbiased estimator (this result is often referred to as the Gauss-Markov theorem). The CLR model consists of five key assumptions. If one of these assumptions is changed or violated, then often the OLS estimator is no longer the optimal estimator. The basic task is to characterize whether one of the five assumptions is violated, whether the OLS estimator then loses its desired form as a consequence and, in case it does, to what extent other estimators can be used.
# Run for additional info in the Viewer pane info("The CLR Model and its five assumptions")
In the next sections we will analyze if, in our scenario, assumptions of the linear regression model are violated. If this is the case, the model will be modified in order to detect the causal effect of emotions on creativity.
Now that we have refreshed the theoretical foundations, we can turn our attention to the concrete analysis. We have not yet introduced our final model, but it will probably be very complex in reality. So, for the start of the regression analysis, let's go back to our question "Do (negative) emotions have an influence on creativity?" and run a simple linear regression by regressing the negative emotions on the output. The output of an artist is used as a measure of creativity, as mentioned before.
Our first short regression model looks like this
[ output_i = \beta_0+\beta_1negemo_i+u_i \tag{1} ]
# Run for additional info in the Viewer pane info("lm() and summary()")
# Run for additional info in the Viewer pane info("Regression information of summary()")
Task: Perform the steps described in the chunk with lm() and summary() using the infoboxes above.
# Load the data dat <- readRDS("composer_data_reduced.rds") # Perform the regression with lm() as specified in Model (1) and safe it in reg1 # Show a summary of the regression results
Looking at the table generated by summary() we see estimates for our coefficients $\beta_0$ and $\beta_1$. More precisely, we can observe a positive value for our estimator $\hat\beta_1$, which is also significantly larger than zero (at the 0.1% level).
But how can these results be interpreted correctly? Solve the two quizzes below for a first intuition.
# Run for additional info in the Viewer pane info("Interpretation of Regression Coefficients")
A small reminder: as we have already seen in Exercise 1, negemo represents the proportion of words associated with negative emotions out of the total number of words (so the unit can be interpreted as percentage points). We discuss the generation process in more detail later. output denotes the number of outstanding pieces an artist has composed within one year.
Quiz: Which of these statements regarding the interpretation of the coefficient for negemo (i.e. $\hat\beta_1 = 0.4276$) in our first model is correct?
1: If the average proportion of negative emotions in the total word count increases by one percentage point, we predict that the annual output of outstanding compositions increases by 0.42 pieces in the following year. 2: If the average proportion of negative emotions in the total word count increases by 1 percentage , we predict that the annual output of outstanding compositions increases by 0.42 pieces in the following year. 3: If the average proportion of negative emotions in the total word count increases by 0.42 percentage points, we predict that the annual output of outstanding compositions increases by 1 piece in the following year.
# Run line to answer the quiz above answer.quiz("InterpretationRegressionCoefficients")
Quiz: Now consider the following interpretation. Is it correct, also? If the average proportion of negative emotions in the total word count increases by one percentage point, then the annual output of outstanding compositions increases by 0.42 pieces in the following year.
1: Yes. This is a similar interpretation and in general also possible. So it is correct. 2: This is a different statement. At this point it is implausible that this interpretation is also correct, therefore further investigation is necessary. 3: Yes. It is not the similiar interpretation, but you can still use it. We can already say, that this is also correct.
# Run line to answer the quiz above answer.quiz("CausalityAndPrediction")
The statement in italics suggests a causal link and is thus much stronger than the one before. Why is this the case? Because the word predict is missing and also no indication is given that this is only the coefficient of the model, the implicit indication is missing that 0.42 is only an estimate. As the exercise continues, we see that the estimated effect in Model $(1)$ captures other effects as well. Always be careful in your choice of words when interpreting the regression results.
The result of the previous quiz was that the estimate from Model $(1)$ cannot (yet) be interpreted causally. But why exactly is that the case? After all, our estimators are both highly significant? Significant results are desirable but do not say anything about causality in isolation. In other words, it is not sufficient to run a regression to prove causality, but rather must be discussed in more detail. This is particularly the case when our data do not come from a controlled randomized experiment and may not satisfy the five assumptions discussed earlier.
# Run for additional info in the Viewer pane info("Controlled randomized experiment")
Previously we had shown that for Mozart there is a graphically obvious connection between output and negative emotions. Therefore, the difference between correlation and causality will be discussed again in more detail in the next paragraph.
Correlation describes that two variables develop in the same (or exactly opposite) direction in the same time. According to Wooldridge (2020, p.799), correlation is "A measure of linear dependence between two random variables that does not depend on units of measurement and is bounded between -1 and 1". Note in particular that correlation only measures linear dependence between two variables. So, for example, it is possible that $u$ and $x$ are not correlated with each other, but $u$ and $x^2$ (as a function of $x$) are. For more information on this, see Wooldridge (2020, pp. 697-704).
Causality describes the connection between a thing that happens and a thing that causes it. Expressed again in the words of Wooldridge (2020, p.798), causality is "A ceteris paribus change in one variable that has an effect on another variable". Ceteris paribus means that all other (relevant) factors or things are remaining the same. You can imagine that it is partly not easy/impossible to keep all other effects constant. If the other effects cannot be held constant, we cannot determine the causal effect.
Correlation therefore does not mean causality. Often it is said that correlation is necessary for a causal connection, but it could also be the case that we do not see any correlation and have a causal relation (e.g. if we have opposite bias effects that add up to zero together with the true effect).
So, returning to our example, the significant results for our estimator so far simply say that negative emotions are correlated with output.
Recall again the assumptions made for linear regression we introduced in the previous exercise. The model $(1)$ violates assumption II, which states that explanatory variables should not contain any information about the error term on average and vice versa. In the error term $u$ of $output_i = \beta_0+\beta_1negemo_i+u_i$ are other factors that influence $output$, for example whether the artist has a lot or little time and motivation to compose. A suitable indicator that captures this is whether he works freelance or currently has a permanent position. The dummy variable $tenure$ provides us with information exactly about this. If the composer has a permanent job it takes the value $1$, if not then the value $0$.
Assumption II $\mathbb{E}(u|x_1,...,x_k)=0\:$ translates for model $(1)$ accordingly that the employment relationship of the artist (tenure) should be independent of the value of negative emotions (negemo). Consequently, the assumption would be violated if the artist felt better or worse about taking a secure job. This is presumably the case.
By obviously excluding a relevant variable (tenure) from the regression in model $(1)$, we obtain a biased estimator for $\beta_1$. This problem is also referred to as the "Omitted Variable Bias". We can estimate the direction of the bias in different ways.
Let us just assume for the explanation of the omitted variable bias that we are in a world where model $(2)$ is the long "true" model and model $(1)$ is the short underspecified model. In reality, model $(2)$ is still underspecified and cannot be used to describe a causal effect, but we assume this for didactic reasons. An updated model by integrating tenure looks like this.
[ output = \beta_0+\beta_1negemo+\beta_2tenure+u \tag{2} ]
We notate model $(1)$ accordingly
[ output = \beta_0+\beta_1negemo+v \tag{1} ]
If, out of ignorance or for reasons of data availability, one now only estimates model $(1)$ although model $(2)$ is the true model, the error term from model 1 contains the information about a tenured position.
[ v = \beta_2tenure+u ]
One can see well that in the short model negemo is correlated with the error $v$, since it includes tenure. Thus, in the short model $(1)$, the estimator for $\beta_1$ is a biased estimator since it violates assumption II. We notate this estimator as $\tilde{\beta_1}$instead of $\hat{\beta_1}$ to emphasize that it comes from the short, underspecified model.
On the other hand, since we assume that model $(2)$ is the true model, it also satisfies all the assumptions we made in the previous chapter. Consequently, the estimators for $\beta_1$ (notated $\hat{\beta_1}$ to emphasize it is from the long model) and $\beta_2$ (notated $\hat{\beta_2}$) are unbiased. Given the algebraic connection of the short and long model $\tilde\beta_1=\hat\beta_1+\hat\beta_2\tilde\delta_1$ (proof can be found in Wooldridge 2020, pp. 114-115), where $\tilde\delta_1$ comes from the regression $tenure=\delta_0+\delta_1negemo+\epsilon$ (Wooldridge 2020, pp.84-86), it holds that
$\mathbb{E}(\tilde\beta_1)=\beta_1+\beta_2\tilde\delta_1$
and thus the bias of $\tilde\beta_1$ can be derived as follows
[ Bias(\tilde\beta_1)=\mathbb{E}(\tilde\beta_1)-\beta_1 = \beta_2\tilde\delta_1 ]
Because $\tilde\delta_1=corr(negemo,tenure)\cdot\frac{sd(tenure)}{sd(negemo)}$ (proof can be found in Wooldridge 2020, pp.25-26) the direction(sign) of the bias is only depending on the correlation of the two explanatory variables (the sample standard deviations can only take positive values) and $\beta_2$. The following table based on Wooldridge (2020, p.85) gives a good overview of the bias in the estimator for $\beta_1$ when a second relevant explainatory variable is omitted:
# Run for additional info in the Viewer pane info("Omitted Variable Bias")
In our case, what are the signs of $\beta_2$ and $corr(tenure,negemo)$? Make an educated guess in the next two quizzes.
Quiz: What do you think which sign has $\beta_2$ in the long model?
3: (\beta_2=0) and therefore has no sign.
# Run line to answer the quiz above answer.quiz("Sign1")
In reality, we cannot be sure whether $\beta_2$ is positive or negative, since it is an unknown population parameter. Nevertheless, we often have a rough idea of how $x_2$(tenure) affects $y$(output). If someone takes a permanent job, there is probably less pressure to produce creative output. In addition, one has to fulfill certain duties and tasks, such as teaching students or conducting orchestras, which are very time-consuming. Thus, there is less time for composing new creative output when one assumes a tenured position, so the sign is negative. This seems plausible if one thinks, for example, of the publication productivity of a professor before and after his tenured position at a university.Thus, Holley (1977) finds a negative correlation between tenured positions and the research productivity of academics.
In case of the correlation between tenure and negemo different perspectives are possible.
If the artist accepts a permanent position, he does not have to worry about whether he can afford his accommodation for the next month. This additional financial security has a positive effect on his state of mind. Therefore tenure and negemo are negatively correlated.
If the artist accepts a permanent position, he probably does so because he has money problems or is otherwise indirectly forced to do so. He also has to pursue activities that give him little pleasure because he is in an employment relationship. As a freelancer, he is completely free in the activities he does and when he does them. Therefore, a permanent position has a rather negative effect on his state of mind. tenure and negemo are positively correlated.
However, since we observe both tenure and negemo, we can calculate the correlation. cor(dat$tenure,dat$negemo,use="pairwise.complete.obs")= -0.1204937 provides a slightly negative correlation of the two variables with each other.
In summary, $\tilde\beta_1$ has a positive bias and systematically overestimates the "true" causal relationship, since both $\beta_2$ and $corr(tenure,negemo)$ are less than zero. The statement is only valid for the case when we imagine to be in a world where model $(2)$ is the true model. Therefore "true" is also written in quotation marks here.
As already mentioned in the infobox, the omitted variable bias can also be described for the case of a multiple linear regression model. However, the bias formula then differs considerably from that in the simple case. For a detailed mathematical explanation, see Wooldridge (2020, pp. 114-115).
Another way to find out in which direction the bias of our first model goes without regressing again for a new model in which we integrate tenure is by drawing causal graphs, which provide a good intuitive understanding. A detailed examination of graphical visualization methods for answering research questions and causal relationships is provided by Chen and Pearl (2014).
In the next section, the considerations we made in the section on the bias formula are graphically translated into a causal graph.
# Run for additional info in the Viewer pane info("Causal graph")

The generated figure illustrates the simple world we assumed with the introduction of Model $(2)$. In this world, the output of a composer is systematically influenced only by his negative emotions and the employment relationship. Negative emotions influence output positively (beta1), and tenure influences output negatively (beta2). Moreover, tenure and negemo are negatively correlated with each other.
One can see well from this graph that negemo is an endogenous explanatory variable if one does not include tenure as part of the error term in the regression.
# Run for additional info in the Viewer pane info("Exogeneous and endogeneous explanatory variables")
Keep in mind for the correct interpretation of the causal graph that our goal is to measure the direct causal effect of the negative emotions negemo on the output or in other words to find out if the negative emotions are one of the reasons for more creativity i.e. additional output. So our arrow of interest connects negemo with output.In addition, however, the type of employment tenure influences both variables, as explained in detail before. Normally, of course, many other influencing factors would exist that affect negemo and/or output. However, due to clarity and the previously made assumption about a simplified world in which we find ourselves, these are also left out of our causal graph.
Imagine you can walk along the graph. The only way we can walk along the graph to get from negemo to output should be the direct connection, because that is the connection we are interested in. However, if alternative connections exist, then it must be ensured that one cannot run over an indirect path. In this case this is the path from negemo via tenure to output. So you have to try to close this path to be able to run or measure the path of interest.
How can you close a path? In this case, closing a path means nothing more than controlling for the variable tenure. By adding tenure in Model $(2)$, the endogeneity problem is solved and the only remaining path is negemo on output. Our path of interest.
So, in summary, in Model $(1)$ we measure the direct positive effect of negemo on output and the indirect positive effect of negemo via tenure on output. The estimator $\tilde\beta_1$ in the short model $(1)$ is thus too large, as it is the sum of these two effects, and does not answer our "research question" correctly.
How can we see from the diagram that the indirect effect of negemo on output is also positive? Due to the negative correlation between tenure and negemo, artists with high negative emotions also have less frequent permanent positions, the negative connection between tenure and output then in turn means that less frequent permanent positions result in a higher creative output, since the artists can devote all their time to composing their sonatas, symphonies and arias again.
As we will see in the further parts of the problem set, there are also cases where we must use other tools than the control with variables. This is the case, for example, when we do or cannot observe important variables.
Now we verify our results by running the longer regression $(2)$ and comparing it directly to the first short regression $(1)$.
Task: Run the regression and display the results by pressing check.
library(stargazer) reg1<-lm( output~negemo, data=dat) reg2<-lm( output~negemo+tenure, data=dat) stargazer(reg1, reg2, type="text", digits=3, omit.table.layout="s", model.numbers=FALSE, column.labels=c("Short Model (1)", "Long Model (2)"))
One can see that the estimate for $\beta_1$ in model $(1)$ with $0.428$ is larger than in model $(2)$ with $0.208$. Thus, $\tilde\beta_1$ indeed systematically overestimates the true effect in the short model. Also, the negative effect of tenure on output as already described becomes clear when looking at the estimate for $\beta_2$ with a value of $-3.633$, which moreover is highly significant. The significance of the estimate for $\beta_1$ decreases in the long model.
Even though we controlled for tenure it is still very unlikely that our estimator $\hat\beta_1$ in the long model measures the true causal effect $\beta_1$. So we end our thought experiment in which model $(2)$ is the true model and return to reality.
Other relevant variables for which we do not control are conceivable. In fact, the two previous models have been oversimplified and there exist many more measurable variables for which we can control. The full model with which Borowiecki performed the OLS regression still includes the following control variables: touring, marriage_cohabitation, illness, letters_annual and uses fixed effects for age, composer and addressee.
The models introduced and discussed so far, especially the simple linear regression, were not sufficient to describe the causal effect of negemo on output. To address the omitted variable bias, we learned about controlling for other relevant variables as a possible solution. For example, by adding tenure,touring, marriage_cohabitaion,illness,letters_annual to the model. However, this variant might have its limitations especially when one wants to control for unobserved characteristics. Therefore, another option that we will learn about in this subsection is fixed effects regression, which attempts to account for all independent, time-constant factors that influence our dependent variable. (Wooldridge, 2020, pp.439-440)
In summary, fixed effects for Age,Composer and Addressee are effectively a special variant of control variables, which we explain below with an example of composer fixed effects. The implementation of Adressee fixed effects works similarly, they control for the relationship to the addressees of the letters and thereby eliminate the bias that could arise if the composers to different addressees change their degree of openness regarding their state of mind. The author also controls for age using a nonlinear polynomial and refers to this as Age fixed effects. The motivation stems from trying to account for increased productivity at middle age and lower productivity rates at low and older ages. In our problem set, however, we refer to the author's titled Age fixed effects as "age polynomials" from now on, since they are not included in the regression like usual Age fixed effects. Rather, they are nonlinear control variables that are included as additional variables in the regression. Here, age_1 corresponds to unchanged age, age_2 to age squared divided by 100, age_3 to age cubed divided by 10000, and age_4 to age to the power of 4 divided by 1000000.
Let's try to understand the idea behind fixed effects with a concrete example. The idea behind composer fixed effects is to control for unobserved differences (characteristics) between composers. Conceivable here are for example the different strength of expression and perception of the composers at a comparable event, similarly as it is the case with fund managers if they express their views regarding an asset in a raw score. In our case this means, that for one composer a failed dress rehearsal before a performance is bad. The other categorizes it as catastrophic, disastrous, and possible end of his career, and yet another classifies it as not worth mentioning.
If we now use model $(1)$ again and add composer fixed effects, this basically means that different artists result in different intercepts $\beta_0$ and we control for this. We model this difference by allowing different axis intercept parameters for each person (Beethoven, Liszt, and Mozart) and call this model 3.
[output_{i,t} = \beta_0 + \beta_1 \cdot negemo_{i,t}+\rho_1 \cdot composerBeethoven_i+\rho_2 \cdot composerLiszt_i+u_{i,t} \tag{3} ]
The dummy variable for the composer Mozart is missing. This is necessary because we compare the dummies in comparison to a reference category, which in this case is composerMozart. Furthermore Model $(3)$ has three different intercepts, one for each composer (Hanck et. al 2021, Section 10.3)
Task: Run a regression for Model $(3)$. The implementation in R can be done as usual with the command lm(). Remember that composer1 corresponds to Beethoven, composer2 corresponds to Liszt and composer3 corresponds to Mozart.
# Enter your code here.
Another less computationally intensive possibility is to use the function felm from the lfe package. The structure of the felm function is felm(y ~ x1 + x2 | fixed_effects , data=dat). Note that the variables used in the section fixed effect can be the dummy variables connected with a + or a single column from the data set that is referred to and in which the different categories are listed. Referring to a single column of the data set is also possible by using the lm() function.
The column composer is a combination of the three previous variables and takes the values Beethoven, Mozart or Liszt.
Task: Perform the fixed effects regression for model $(3)$ using the felm function and referring only to one column (composer).
library(lfe) reg3 <- felm(output ~ ___ | ___, data=dat)
Now that we have learned about both adding control variables and fixed effect Regression as a way to correct for a possible violation of assumption 2, it is time to combine our learned methods and introduce the full OLS model used by Borowiecki. Since adding additional 10 dummy variables for the fixed effects would have made the regression very confusing, keep in mind that Age Polynomials, Composer FE, and Addressee FE are added to Model $(4)$.
[output_t = \beta_0+\beta_1 \cdot negemo_t+\beta_2 \cdot tenure_t+\beta_3 \cdot touring+\beta_4 \cdot marriage+\beta_5 \cdot illness+\beta_6 \cdot lettersannual+...+u \tag{4} ]
Task: Run the regression for model $(4)$ and then compare models $(1)$, $(2)$, $(3)$, and $(4)$ with each other. Just click check to view the results.
library(lfe) library(stargazer) reg4 <- felm(output ~ negemo + age_1 + age_2 +age_3+age_4 + tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) stargazer(reg1, reg2, reg3, reg4, type="text", digits=3, model.numbers=FALSE, column.labels=c("Short Model (1)", "Long Model (2)", "Fixed Effects(3)", "Combined Model(4)"), omit = c("age_1","age_2","age_3","age_4","Constant"), add.lines = list(c("Age Polynomials","NO","NO","NO","YES"), c("Composer FE","NO","NO","YES","YES"), c("Adressee FE","NO","NO","NO","YES")) )
It is noticeable that from left to right the apparent effect of negemo on output becomes smaller and more insignificant until it is not even significant at the 10% level when the methods are combined. After adding another relevant explanatory variable tenure, we reduced some error resulting from the violation of assumption II. Assumption II required that the expected value of the residuals given an explanatory variable is zero. We solved this endogeneity problem and reduced the omitted variable bias. Another way to correct for correlation between error term and explanatory variable was to consider fixed effects. In Model $(3)$, Model $(1)$ is corrected for Composer fixed effects. More specifically, it controls for unobserved composer characteristics.
Unfortunately, in many examples, it is not possible to control for all confounders, because the data is simply missing, or the variable is not measurable. Thus, unobserved confounders are a major obstacle in measuring the true causal effect. Sometimes there are methods to circumvent these problems. One of them, instrumental variable regression, we will learn about in the next exercise.
In our previous attempts to study the causal effect of emotions on creativity, we encountered endogeneity problems that we were able to solve by considering additional observed variables. In the previous exercise, we learned how control variables and fixed effects can help mitigate the effect of biased and inconsistent estimators. Let us recall the state of our most recent model $(4)$. We have estimated
[output = \beta_0+\beta_1 \cdot negemo+\beta_2 \cdot tenure+\beta_3 \cdot touring+\beta_4 \cdot marriage+\beta_5 \cdot illness+\beta_6 \cdot lettersannual+...+u \tag{4} ]
with Fixed Effects for Composers and Adressees, as well as Age Polynomials. We have already done our best to include all relevant observable explanatory variables and information in the model. But what do we do if there is still substantial correlation between an explanatory variable of interest and the error term? Reasons for this could be, according to Kennedy (2008 , pp.139-140), reverse causation or an unobserved omitted explanatory variable that is correlated with an included explanatory variable and therefore causes the latter to be correlated with the error term.
Thinking about our example, it could well be that the output itself influences the emotions. Borowiecki (2017, p.596) lists as possible reasons that the completion of a creative work could lead to a sudden drop in the tension and well-being that exists through the creative process, or the composer could get more public focus due to the increased notoriety that may result from the completion of a piece, and thus come under more criticism and pressure if the piece does not please the public.
Moreover, a relevant explanatory variable income that could be related to both negative emotions and output is only observable for a ten-year period of Mozart's life. A higher income may generate an additional incentive to generate more output or may reflect appreciation for creative mastery, as is often the case for managers with bonuses. The direction of the effect of higher income on negative emotions is not entirely clear. Higher income only increases positive emotions to some extent (Kahneman and Deaton, 2010) but could at least provide a more carefree life, as financial worries would no longer be part of negative emotions. On the other hand, higher income could also be associated with longer working hours, more responsibility and higher pressure, which increases negative emotions as discussed by Hentschke et. al (2017). Thus, we potentially have an omitted variable bias that cannot be controlled for due to the lack of availability of the variable income.
One possible solution to this dilemma is instrumental variable regression. It is a widely used and frequently applied method in econometrics to eliminate the bias in an OLS estimator due to the correlation between the explanatory variable and the error term of the regression. We introduce IV regression theoretically in the next section before discussing it in more detail using our example.
Assume again a simple regression model $(a)$
[y= \beta_0+\beta_1 \cdot x+u \tag{a} ]
in which $x$ and $u$ are correlated i.e. $x$ is endogenous. A regression using OLS estimators would thus produce inconsistent, biased estimators. To estimate $\beta_1$ consistently, we need the information of another variable, called the instrumental variable, which in this example is noted as $z$. Not only in music the choice of an appropriate instrument is important and must meet certain criteria, but in econometrics the instrument (as Instrumental Variables are often simplistically called) must also meet two important criteria (Wooldridge 2020, p. 497):
Relevance condition: $z$ is correlated with the endogenous explanatory variable $x$. Thus, $Cor(z,x)\ne0$ must hold.
Exogeneity condition: $z$ is uncorrelated with the error term $u$. Therefore, $Cor(z,u)=0$ must hold.
If $z$ satisfies these two conditions, we can call $z$ an instrumental variable for $x$. Do you think it is easy to check these conditions? Trust your intuition and solve the quizzes below.
Quiz: Do you think the relevance condition e.g. $Cor(z,x)\ne0$ can be easily mathematically tested?
# Run line to answer the quiz above answer.quiz("Relevance")
As we normally observe both variables there do exist tests to proof whether this condition is fulfilled. The easiest way to do this, is to use a simple regression between $x$ and $z$.
Quiz: Do you think the exogeneity condition e.g. $Cor(z,u)=0$ can be easily mathematically tested?
# Run line to answer the quiz above answer.quiz("Exogeneity")
We do not observe the error term u. Therefore we cannot generally hope to test this assumption mathematically. Provided further information about the data generation process is given and economic expertise is used, however, this condition can also be tested.
Since $Cov(z,y)$ follows the linear model, we obtain in combination with $(a)$
[Cov(z,y)= \beta_1 \cdot Cov(z,x)+Cov(z,u) ]
If we now add the relevance and exogeneity condition, we obtain, because of the exogeneity condition $Cor(z,u)=Cov(z,u)=0$, after we have rearranged
[ \beta_1=\frac{Cov(z,y)}{Cov(z,x)} \tag{b} ]
Using equation $(b)$, the relevance condition also becomes apparent in algebraic form, since the denominator must be non-zero.
Truncating the sample size in the numerator and denominator, we obtain the following instrumental variable estimators for $\beta_1$ and $\beta_0$:
[ \hat\beta_1=\frac{\sum_{i=1}^{n}(z_i-\bar{z})(y_i-\bar{y})}{\sum_{i=1}^{n}(z_i-\bar{z})(x_i-\bar{x})} ]
[ \hat\beta_0=\bar{y}-\hat\beta_1\bar{x} ]
It is also worth noting that for the case when $x$ is exogenous, it can serve as its own instrument, so $z=x$ holds and the IV estimator is then identical to the OLS estimator (Wooldridge 2020, p. 499).
An even better intuition for IV estimation can be obtained via the Two-Stage-Least-Squares(2SLS) method. Here, two OLS regressions are performed. The first OLS regression asymptotically tries to remove from the variable $x$ the influence of the correlation with the error term $u$. Then, the original regression is performed with the now adjusted explanatory variable $\hat{x}$ containing only the exogenous variation. Specifically, this means that instead of $(a)$, we first regress the endogenous explanatory variable on the instrument.
[ x=\alpha_0 +\alpha_1z+\epsilon ]
We then calculate the predicted value $\hat{x}$ of the regression
[ \hat{x}= \hat\alpha_0 +\hat\alpha_1z]
and finally perform the second stage regression by replacing $x$ in the initial model $(a)$ with $\hat{x}$.
[y= \beta_0+\beta_1 \cdot \hat{x}+u \tag{c} ]
Provided $z$ again satisfies the relevance and exogeneity condition, $\hat\beta_1$ is again a consistent estimator.
In R, instrumental variable regression can be performed using the ivreg function from the AER package or manually using the 2SLS method.
However, if the 2SLS method is performed manually, invalid standard errors, T statistics and p-values are obtained, because it is not being adjusted for using predictions from the first-stage regression as regressors in the second-stage regression. It is therefore recommended to use the existing econometric packages in R (Wooldridge 2020, pp. 509-511).
# Run for additional info in the Viewer pane info("ivreg() function")
That the instrumental variable regression can also be carried out in the multiple case with several instruments and control variables and thereby yield consistent estimators can be understood in Wooldridge (2020) on pages 505-509 and 513. For the 2SLS method, only the endogenous explanatory variable has to be regressed on all exogenous explanatory variables and all instruments in the first step. The further procedure corresponds to the method in the simple version.
Let us now return to our example and try to perform the instrumental variable regression for Model $(4)$. To start the analysis we have to load the data again.
TaskPress check in order to load the data.
#Load the data. dat <- readRDS("composer_data_reduced.rds")
We have already clarified the question of how instrumental variable regression works. So far, however, the questions of how to find a suitable instrument, how to test whether this instrument is suitable, and how to subsequently interpret the results have remained unanswered.
Often it is a very difficult task to find a suitable tool because the process of searching always depends on the context of the problem. Therefore, this process will not be discussed in more detail here. However, I would like to refer to a list by Kennedy (2008, pp. 142 - 143) in which he gives numerous examples in which clever researchers have found suitable instrumental variables. In our case, we are looking for an instrument that meets the above described conditions and influences the creativity of an artist only through the path of negative emotions. Borowiecki has thereby identified death_of_relative as fitting and introduces the following equations in the context of the 2SLS method.
First stage: [negemo = \beta_0+\beta_1 \cdot death\:of\:relative+\beta_2 \cdot \vec{Z}+\mu \tag{5} ]
Second stage with predicted values for negemo:
[output = \beta_0+\beta_1 \cdot \widehat{negemo}+\beta_2 \cdot \vec{Z}+\nu \tag{6} ]
$\vec{Z}$ represents a vector containing the previously introduced control variables $tenure,touring,marriage,illness,lettersannual$.
Let us now check for ourselves whether this instrument fulfills the two required conditions of relevance and exogeneity.
The relevance condition in this case is $Cor(death_of_relative,negemo)\ne0$. We have already briefly mentioned that it is relatively easy to test the validity of this condition. We only have to regress the endogenous variable negemo on the instrument and see if the explanatory content of the variation $R^2$ is high enough. Note that this method is only correct if we have no other control variables. However, since in practice we usually have regressions with multiple control variables, the relevance condition is tested by activating it in the summary of the ivreg calculation. We will look at this in the task following this one.
Task: Regress the endogenous variable on the instrument by inserting the correct variable names in the gaps.
summary(lm( ___ ~ ___, data=dat))
While death_of_relative appears to be a highly significant variable, the $R^2$ of the regression returns a value of $0.009963$. This means that the model explains just under 1 percent of the variation. This does not seem like much. In fact, a bias problem may arise for the IV estimator if the instrument is too weak, i.e., has an $R^2$ that is too low.
But at what point is the association between instrument and endogenous variable large enough? Kennedy (2008, p.145) assumes as a rule of thumb in the case of one endogenous variable that the F-value should be larger than 10. If this is the case, the IV bias is assumed to be smaller than 10% of the OLS bias and the relevance condition can be considered as sufficiently fulfilled. Even though our instrument does not have an exceedingly large association with the endogenous variable negemo, the relevance condition can be considered satisfied under the rule of thumb since the F-value of our regression is $14.49$. However, for F values that are too small, it is strongly advisable to avoid IV regression, since instruments that are too weak (e.g., when the instrument is mildly endogenous) may lead to the IV estimator being more biased than the OLS estimator.
Task: As already noted, there is also a simple reliable way to test the relevance condition directly. Complete the following code intuitively (insert TRUE or FALSE) for our final model $4$.
IVRelevance <- ivreg( output ~ negemo +age_1 + age_2 +age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual | death_of_relative+age_1+age_2+age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual , data=dat) summary(IVRelevance, diagnostics= ___ )$diagnostics
We will discuss the results of the IV regression later. Let's focus on the diagnostics first. If we set diagnostics to TRUE, the summary command displays the weak instrument test, the Wu Hausmann test, and the Sargan test in addition to calculating the coefficients. The weak instrument test is used to check the relevance condition. It performs an F test for the first stage regression, where the null hypothesis is that the instrument is weak (i.e. the relevance condition is not fulfilled). In our case we can reject the null hypothesis (p-value < 0.1%), so there is no weak instrument and the relevance condition is fulfilled.
A much more difficult condition to test is the exogeneity condition $Cor(death\:of\:relative,\nu)=0$. When there are more instruments than endogenous explanatory variables, some sort of test (e.g. the Sargan test) is possible. However, since in our specific case there is only one instrument for an endogenous variable, it is actually mathematically impossible to test the exogeneity condition (Kennedy 2008 p. 144).
Therefore, it is of particular importance to discuss the fulfillment of the exogeneity condition logically. The use of economic theory and a detailed argumentation are often indispensable for the justification.
Ideally, the instrument should be part of a perfectly randomized experiment, because then we could be sure that whether a relative died was completely random and independent of any confounders. Unfortunately we have no perfectly randomized experiment here. However, if the death of a relative (death_of_relative) was random or unexpected, i.e. independent of certain characteristics, this variable is a candidate for an instrument.
Unexpected or sudden causes of death used to be frequent illness or death in childhood. Thus, it is important to assume that the probability of becoming ill is randomly distributed and independent of social status or spending on health. In relation to modern times, we can say with a fair degree of certainty that this assumption is not fulfilled. For Example Demakakos et.al(2008) could show that subjective social status was related positively to almost all health outcomes. In addition, Mirowsky and Ross(2003) identify, among other things, more control over one's own life and learned effectiveness as reasons why people with higher education and associated higher social status have healthier lifestyles. Considering those findings, death_of_relative would not be classified as suitable instrument.
Borowiecki does not discuss such reasons in detail in his paper. Rather, he lists biogafic information on sudden causes of death, to argue for the independence between social status and catching an illness. As a further supporting argument that death is random, he argues that in the results of the first stage regression, with the addition of more variables, the coefficient for death_of_relative does not change (Borowiecki 2017, p.602).
Task: Run the regression to compare the results of the two First Stage regressions. Just click check.
FirstStageShort <- felm(negemo ~ death_of_relative +age_1 + age_2 +age_3+age_4 , data=dat) FirstStageLong <- felm(negemo ~ death_of_relative + age_1 + age_2 +age_3+age_4+ tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) stargazer(FirstStageShort, FirstStageLong, type="text", digits=3, model.numbers=FALSE, column.labels=c("FirstStageShort","FirstStageLong"), omit = c("age_1","age_2","age_3","age_4","Constant","composer1","composer2","composer3","relationship1","relationship2","relationship3","relationship4","relationship5","relationship6"), add.lines = list(c("Age Polynomials","YES","YES"), c("Composer FE","NO","YES"), c("Adressee FE","NO","YES")) )
Based on the results, we see that the coefficient for death_of_relative remains almost identical and also the standard deviation and the significance level do not change. This means that the additional control variables and fixed effects have only a small association with the variable death_of_relative. Since Borowiecki substantiates his argumentation with meaningful figures and since we are dealing specifically with the case of three famous composers, we can say that the probability of dying was randomly distributed for the death of relatives and does not influence the productivity of a composer over any other observable variable.
In summary we could already conclude from the previous regression that the death of a family member does not seem to affect output via any of the known control variables. An associated causal graph illustrates this.

However, a violation of the exogeneity assumption can still exist if the death of a family member additionally influences the artist's output via a previously unobserved link. One conceivable problematic link is the one via the variable income. The death of the composer's parents could significantly change his financial situation in the form of an income shock. Possible effects are either an increase in disposable income due to inheritance or a reduction in income due to the loss of an income-generating person in the family community.
If the income furthermore is correlated with the output, then this leads to the fact that the exogeneity condition is not fulfilled and we would have an invalid instrument with death_of_relative. We could not solve this by controlling for income because we do not have the data to do so.
The described problem, can be represented graphically in the form of a causal graph.

For the case of an invalid instrument, the red colored part in the causal graph describes the newly added path via which death_of_relative could exert an influence on output.
Different possibilities are conceivable to test whether the death of a relative affects income. For example, one could regress income on death_of_relative. As noted earlier, we have information on the annual income of a composer only for Mozart in the period 1781-1791 (Baumol and Baumol, 1994). Although the results are insignificant, with a sample size of 11, the regression is not very informative and is not reported here. Another consideration is as follows: if the death of a relative had an effect on income, one should be able to observe that the composer talks more often about money issues in his or her letter, whether in a positive or negative sense. A variable for the frequency of talking about money topics in letters is provided via the LIWC algorithm.
Task: Regress financial_concerns on death_of_relative.
summary(felm( ___ ~ ___, data=dat))
We see that the coefficient for death_of_relative is positive, but not significant at any appreciable level. Even with the addition of various fixed effects for composer, age, and addressee, no significant effect of death_of_relative on financial concerns is found.
If we now additionally consider that no evidence of inheritances could be found in the historical records, this supports the argument that there is probably no income shock of a positive nature.
Another argument that there is probably no negative income shock is the following: Assuming that the role model of the 18th century was that the man in the family was responsible for the main part of the income, a negative income shock for the death of the father should be most pronounced. If one re-estimates the model using death_of_relative as an instrument without the deaths of the father, these results should differ strongly from the results of the basic model. However, since Borowiecki measured effects similar to those in the basic specification, with only a slightly smaller standard deviation, there is probably no substantially negative income shock.
Considering the abundance of arguments, we can conclude that the death of a relative affects the negative emotions on the direct path and the exogeneity condition is not violated by substantial correlations with the error term. We therefore use death of relative relative as a robust instrument for the instrumental variable estimation.
Now that we have discussed everything in detail, we can finally run the instrumental variable regression for Model $4$.
[output = \beta_0+\beta_1 \cdot negemo+\beta_2 \cdot tenure+\beta_3 \cdot touring+\beta_4 \cdot marriage+\beta_5 \cdot illness+\beta_6 \cdot lettersannual+u \tag{4} ]
Task: Fill in the gaps to run the IV regression. Do not be surprised by the large number of variables in the IV regression.
#Run the IV Regression of Model 4 reg4IV <- ivreg( ___ ~ ___ +age_1 + age_2 +age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual | ___ +age_1+age_2+age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual , data=dat)
Task: Click check to show a comparison of the discussed models.
# Already estimated models and additional IV regression for model 1 reg1 <-lm( output~negemo, data=dat) reg2 <-lm( output~negemo+tenure, data=dat) reg3 <- felm(output ~ negemo | composer, data=dat) reg4 <- felm(output ~ negemo + age_1 + age_2 +age_3+age_4 + tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) reg3IV <- ivreg( output ~ negemo+ composer1+composer2 | death_of_relative+ composer1+composer2, data=dat) stargazer(reg1, reg2, reg3, reg4,reg3IV,reg4IV, type="text", title="Creativity and negative emotions. Model summary", digits=3, model.numbers=FALSE, column.labels=c("Short Model (1)", "Long Model (2)", "Short Model FE(3)", "OLS Model(4)", "IV Model(3)", "IV Model(4)"), omit = c("age_1","age_2","age_3","age_4","Constant","composer1","composer2","composer3","relationship1","relationship2","relationship3","relationship4","relationship5","relationship6"), add.lines = list(c("Age Polynomials","NO","NO","NO","YES","NO","YES"), c("Composer FE","NO","NO","YES","YES","YES","YES"), c("Adressee FE","NO","NO","NO","YES","NO","YES")) )
If we look at the columns of the table from left to right, we see ascending in chronological order the regression results of the models we have introduced in this problemset. First, we started with the simplest case in Model $(1)$ and regressed only output on negemo. Afterwards, we thought about a possible omitted variable bias and in Model $(2)$ added tenure as another explanatory variable to get a first understanding about this problem. We solved possible remaining endogeneity problems by integrating fixed effects and adding additional control variables in Model $(3)$ and $(4)$. Until this point we have always used the OLS estimator. In columns 5 and 6, we estimate our models with the IV estimator, since we want to remove the remaining bias caused by unobservable confounders. For this we were able to identify death_of_relative as a valid instrument.
Looking at the table from left to right, the coefficient on our variable of interest negemo became smaller and less significant. That is, until we observed a sudden difference in both the magnitude and precision of the estimate with the introduction of the IV estimator. When we compared the short and long model in the previous chapter we assumed an overestimation of the true coefficient. However, when we compare the OLS results at the beginning with the IV results at the end, we see that we underestimated the true effect. This change in the direction of the bias clearly shows that there are individual effects that point in different directions. In general, any premature interpretation of effects should be treated with caution.
Looking at the Adjusted $R^2$, we see that our added control variables and fixed effects make a substantial contribution to increasing the explanatory power of the models. Note, however, that the interpretation of (Adjusted) $R^2$ in instrumental variables regressions is not the same as in OLS regressions. This is because one of the explanatory variables $x$ is correlated with the error term $u$ and we therefore can’t decompose the variance of the outcome $y$. “IV methods are intended to provide better estimates of the ceteris paribus effect of $x$ on $y$ when $x$ and $u$ are correlated; goodness of fit is not a factor” (Wooldridge 2020, p. 505)
The final results, which Borowiecki refers to in his paper, can be examined in the last column. We get significant results for almost all explanatory variables. If we look closer at the explanatory variables, they also have the direction we expect. A fixed position tenure has a negative effect on the output, since the composer must pursue other tasks and has less time for creative work. Similarly, marriage_cohabition has a negative effect on the output of artists, since the composer is likely to invest more time in the relationship and probably has a high amount of positive emotions. The number of letters (letters_annual) is positively related to the output of the composers, which is probably due to advertising reasons.
What is the interpretation of the IV estimator of our variable of interest negemo? Solve the following quiz to find out.
Quiz: Please choose the most suitable word that fills the gap. An increase in the average proportion of negative emotions in the total word count by one percentage point, _ an increase in the annual output of outstanding compositions by 2.5 (2.537) pieces in the following year.
1: leads to 2: is correlated with 3: is caused by
# Run line to answer the quiz above answer.quiz("Causal Interpretation")
So now we can do what we've been waiting to do all exercise: Interpret the effect of negative emotions on creativity causally. "leads to" is a popular phrase for the causal interpretation and in this case also - in contrast to "is caused by" - gives the right direction. The causal interpretation is enabled by identification, discussion, and methodological remediation of bias problems.
We get a better sense of the size of the effect if we relate it to the average values. Note that, unlike in the paper, we use the average values of the actual observations used and not those of the entire replication dataset (for more information see the appendix). The percentages may therefore differ slightly.
Task:If the average proportion of negative emotions in the total word count increases by 0.1 percentage points (approximately 9.35 percentage), how much (in percentage of the mean value) does the annual output of outstanding compositions increases in the following year?
___/mean(dat$output)*100
Task: By what percentage must negative emotions increase compared to the mean for the composer to create one additional piece in the next year?
(1/2.537)/mean(___)*100
So far we have talked about negative emotions in general, but they can be further subdivided. Borowiecki used the LIWC dictionary, which will be discussed in more detail in the next task, to collect data for the subgroups sadness, anger and anxiety. He then performed three IV regressions similar to the one used in Model $4$. In each case with the difference that negemo was replaced with anxiety,anger or sadness (Borowiecki 2017, p.602). In the following chunk, the table of results is replicated to locate the main contributory negative emotion. The previously introduced instrument deathofrelative remains identical for all regressions.
Task: Fill in the gaps with the appropriate subcategories by which negemo was replaced.
dat <- readRDS("composer_data_reduced.rds") Stage1Anx <- felm(anxiety ~ death_of_relative +age_1 + age_2 +age_3+age_4 + tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) AnxietyIV <- ivreg( output ~ ___ +age_1 + age_2 +age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual | death_of_relative+age_1+age_2+age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5 + tenure + touring + marriage_cohabitation + illness + letters_annual , data=dat) Stage1Anger <- felm(anger ~ death_of_relative +age_1 + age_2 +age_3+age_4 + tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) AngerIV <- ivreg( output ~ ___ +age_1 + age_2 +age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual | death_of_relative+age_1+age_2+age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual , data=dat) Stage1Sad <- felm(sadness ~ death_of_relative +age_1 + age_2 +age_3+age_4 + tenure + touring + marriage_cohabitation + illness + letters_annual| composer + category_relationship , data=dat) SadnessIV <- ivreg( output ~ ___ +age_1 + age_2 +age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5+ tenure + touring + marriage_cohabitation + illness + letters_annual | death_of_relative+age_1+age_2+age_3+age_4 + composer1+composer2+relationship1+relationship2+relationship3+relationship4+relationship5 + tenure + touring + marriage_cohabitation + illness + letters_annual , data=dat) stargazer(Stage1Anx,AnxietyIV,Stage1Anger,AngerIV ,Stage1Sad,SadnessIV, style="default", type="text", title="Creativity gains by type of negative emotion", column.labels=c("Anxiety First Stage", "Output Anxiety", "Anger First Stage", "Output Anger", "Sadness First Stage", "Output Sadness"), omit = c("age_1","age_2","age_3","age_4","Constant","composer1","composer2","composer3","relationship1","relationship2","relationship3","relationship4","relationship5","relationship6"), add.lines = list(c("Age Polynomials","YES","YES","YES","YES","YES","YES"), c("Composer FE","YES","YES","YES","YES","YES","YES"), c("Adressee FE","YES","YES","YES","YES","YES","YES")))
We see that the instrumental variable shows a positive significant effect for each of the regressions. The effect is strongest for the First Stage regression with sadness. Furthermore, only in the instrumental variable regression with sadness a significant effect is observable. The p-value for anxiety and anger is smaller than $0.12$, whereas the p-value for sadness equals $0.052$. As Borowiecki summarizes, his results are in line with psychological research by Monroe et al. (2009) who finds a strong link between sadness and depression and Andreasen (2005) who finds that depression leads to increased creativity. Also it is consistent with psychological research, that in the First Stage Regression for Sadness, the number of letters written per year has a significant negative effect (i.e. fewer letters are associated with greater sadness) because Goleman (2012) lists isolation and withdrawal as coping mechanisms for sadness.
In summary, we find an effect for sadness. It is worth adding, however, that just because anxiety and anger are not significant at any of the usual levels does not mean that we find nothing. If we look at the point estimates, they are almost twice as large compared to sadness.
We have already successfully studied the relationship between emotions and creativity. But it is always of interest how the author has obtained the variables with which he performs his analysis. Since the emotion data were generated in a special way from the composers' letters, in the following section we pay special attention to the generation of these so-called LIWC variables. LIWC stands for Linguistic Inquiry and Word Count. With the help of LIWC, we now perform text mining by attempting to extract computer-based information from unstructured text data.
It is an analysis method that is especially recognized in psychology for answering questions of personality and social psychology. It is also used in literary studies, for example, to examine the emotional dimension of texts. The name already suggests that it is an automated one word analysis (word by word analysis). The basis for this is a lexicon designed by experts (Flüh, 2019). In the following, the generation of LIWC variables is explained using the variable negemo as an example.
Basically, the variable is simply composed of the total number of words wc and the number of words associated with negative emotions. negemo expresses the occurrence of negative emotions as a proportion of the total number of words. Mathematically expressed
[negemo = \frac{number\:of\:negativ\:words}{wc}]
This means that as the relative occurrence of words reflecting negative emotions increases, a higher score on the negative emotion scale is observed. The procedure can be exemplified by a letter from Liszt through clicking on the next info box.
# Run for additional info in the Viewer pane info("Example letter")
The number of words wc can be determined very easily. But how and according to what criteria does the program decide whether a word is counted as having a negative connotation? This is where the already mentioned dictionary comes into play.
The lexicon contains a predefined categorization of words. Words or whole word stems are thus assigned to one or more categories (so called LIWC dimensions) and the individual text, or in our case the composers' letters, are matched word for word with the implemented LIWC lexicon via an algorithm. Thus, if in the LIWC lexicon the word afraid is categorized as a negative emotion, for the previous example this means that the variable number of negative words increases by one unit.
The underlying lexicon can be changed and also an own categorization can be used. However, it is important to note that the choice of lexicon is crucial for the internal reliability and external validity of LIWC.
In the following exercises, we will model the analysis with the package quanteda using some letter examples.
We have already clarified how the emotion indicators are composed. In this exercise, we will go one step deeper and look at how exactly the words are assigned. For this we will use the content analysis tool quanteda. Since automated content analysis is a word-by-word analysis that cannot look at the "big picture", the analysis basically requires two components that are compared with each other. These are on the one hand the corpus and on the other hand the lexicon. A corpus is a collection of texts with the corresponding metadata. In our case these are the letters.
The creation of a corpus allows us to combine letters from different sources into one object and to provide it with metadata. In the further course of the analysis, we no longer work with an R-Data frame but with a quanteda corpus. The disadvantage is that many previously learned methods cannot simply be applied to the corpus class in the same way. However, especially for many texts to be analyzed, the corpus class is an advantage. The lexicon gives us the definitional framework. Different lexicons exist, which are generated in different ways, are thematically specialized in different ways, and contain a different number of words. Hence, if you use a different lexicon, you will most likely get a different result.
Not only the lexicon but also the design of the corpus has a decisive influence on the results. In the further course of the application we encounter the central adjusting screws and also understand why automated content analysis must be discussed critically in certain points. Let us first start with the creation of the corpus.
For the creation of the following tasks the detailed documentation about "Automated Content Analysis with R" by Cornelius Puschmann(2021) was very helpful in understanding automated text analysis. First we load again the necessary packages and data we want to work with. In this case, these are two example letters for each composer, which are available as txt files and also occurred in already analyzed form in our input dataset. The example letters are taken from the online appendix of the paper and have been stored in an Rds file for simplicity. The number of the letter in combination with the name of the composer serves as a unique identifier, which is why the two columns numberofletter and composer were still added manually for this small dataset.
Task: Just press check and load the data.
# Just press check and load the data. library(quanteda) library(readtext) library(tidyverse) library(quanteda.textplots) dat_letters <- readRDS("dat_letters.Rds") dat_letters_named <- dat_letters %>% mutate(numberofletter = c(109, 15, 324, 348, 36, 8), composer=c("Beethoven","Mozart", "Liszt", "Liszt", "Mozart", "Beethoven")) dat_letters_named
Our data set contains six individual letters. By default, the file name is stored in doc_id and the file content in text for the imported data. The information from our two manually added columns numberofletter and composer can also be taken from the file name, respectively doc_id.
Now we can start with the creation of the corpus. The data type 'corpus' offers us a help to structure the letters clearly. It is very useful, because we usually analyze not only a handful of letters, but a multitude.
Task: Create a corpus object with the function corpus(). Set the document identifier docid_field to a variable that uniquely describes the letters. It is sufficient that the variable only describes the letters in our small sample sufficiently. Do not use the filename. Replace the ___ with the correct variable name.
# Replace ___ with the correct variable name and press check corpus <- corpus(dat_letters_named, docid_field = ___) corpus.stats<-summary(corpus) corpus.stats
The summary overview shows us the new variables Types,Tokens and Sentences. Types tells us the number of unique words in the letter. Tokens gives us the number of repeated words, and Sentences tells us how many sentences the letter contains. Here the concept of words is not yet taken so strictly. A word in this case can also be only a punctuation mark or a comma.
In the last step of tokenization, the previously created corpus is split up to the word level and cleaned up even further. Tokenization is necessary for the analysis with lexicon. Through the process of tokenization, syntax information is lost, but this is not a problem for the word by word analysis, since only the words are analyzed individually and no sentence structure or sentence structure elements are examined.
Task: Insert the correct boolean values TRUE or FALSE into the ___. Our corpus should be as close as possible to the one used for the original data. Therefore remove only dots and symbols but no numbers. Additionally store the number of words per letter in the variable letterlength. The function ntoken() is helpful for this.
# Fill in the gaps. corpus.tokens.removed <- corpus %>% tokens(verbose=TRUE, remove_numbers = ___, remove_punct = ___, remove_symbols = ___) %>% tokens_tolower() # Assign the number of words in each letter to letterlength. letterlength<-ntoken( ___ )
Besides removing numbers, dots and symbols, you can also remove filler words. It is worth noting at this point that the number of words per letter changes depending on how one determines which characters are to be included. This in turn affects the emotion indicator, as the denominator changes accordingly when more or fewer words are included in the corpus.
There are two ways to create the dictionary. Either you fill it with your own selection of words directly in R or you load an already pre-categorized lexicon. The LIWC lexicon used in the analysis of the paper is unfortunately not publicly available. However, the German version of LIWC is available for academic use upon request.
Since this problemset is written in English and the goal is to make the analysis comprehensible to all users, it makes little sense to use the German dictionary to analyze the original German version of the letters. As an alternative, based on the letters of the paper provided in the online appendix, I created an English subversion of the LIWC lexicon, which at least for our reduced version of the corpus provides the same emotion indicators as in dat. For this, I simply packed the appropriately marked words from the sample letters in the online appendix into two vectors. One contains all words associated with positive emotions and the other contains all words associated with negative emotions.
Which of the vectors sounds more positive to you? Of course, this question is subjective, but in this case it is probably quite clear.
Task: Add the terms positive and negative to the blank spaces ___ in the correct place.
# Add the terms positive and negative to the blank spaces in the correct place. Which emotion vector sounds more positive to you? my.dictionary <- dictionary(list( ___ = c("well", "lively", "happy","glad","praised","kiss","dear","friend","pleasure","good","better","brilliant","advantage","faithfull","admirer","friendly","friends","credited","fine","affectionate","grateful"), ___ = c("ache", "afraid", "mad","scarcely","ruined","bothers","burdens","cry","disagreeable"))) my.dictionary
Congratulations you have created your own lexicon. Theoretically, an even finer categorization is possible here. You could subdivide the negative emotions into terms that are associated with sadness, anger or anxiety. For didactic reasons, however, we will leave it at that for now.
In the last step, a Document Feature Matrix (DFM) is created. The DFM contains information about the absolute frequency of words in individual texts and can thus compare them with each other. If only the DFM of a corpus is formed, it indicates how often each word of the entire corpus occurs in the individual letters. If one applies the DFM as in the next codechunk in combination with a lexicon, then the lexicon is applied step by step to the corpus. All mentions of the terms occurring in the lexicon are replaced by their category. What remains is a table that returns the number of positively and negatively connotated terms for each of our letters.
Task: In the first step you build the Document Feature Matrix for the corpus with the dictionary my.dicitonary and store them in the variable my.dfm.sentiment. This process can take a long time depending on the size of the dictionary.
# Form Document Feature Matrix and store it in the variable my.dfm.sentiment. ___ <- dfm(corpus.tokens.removed, dictionary = ___ )
Task: In the second step you use a dplyr chain to convert the DFM into a data frame. Then, with the custom sorting of the columns and the addition of the composer names, a few more beauty operators are performed. We had already calculated above the number of words per text with ntoken() and stored it in the variable letterlength. Next to the names of the composers of the respective letters we add this column to the dataset under the name wordcount. Fill in the gaps appropriately.
# Add the commands as described in the Task. my.dfm.sentiment.df <- ___ %>% convert(to = "data.frame") %>% mutate(composer = c("Beethoven","Mozart", "Liszt", "Liszt", "Mozart", "Beethoven"), ___ = ___) %>% select(composer,doc_id,positive,negative,wordcount)
Task: In the last step the final result is generated. Do you remember how negemo and posemo are calculated? Fill in the gaps with the appropriate variable names and then display my.dfm.sentiment.df.
my.dfm.sentiment.df <- my.dfm.sentiment.df %>% mutate(negemo = ___ / ___ *100, posemo = ___ / ___ *100) my.dfm.sentiment.df
The table provides the same results for negemo and posemo as specified in the replication data set dat. The first two columns are used to uniquely identify the letter. This is followed by the absolute number of positive and negative words and a column with the total number of words in the letter. The last two columns give the emotion indicator as the proportion of positive and negative emotions in the total word count, as described in detail earlier.
Besides the already presented variant to create a lexicon, another possibility is to load an already predefined lexicon. We use the freely available NRC Word-Emotion Association Lexicon (also called EmoLex). The NRC Emotion Lexicon was created by Peter Turney and Saif Mohamad with the help of crowdsourcing at the National Research Council Canada. The creation and development process can be traced in Mohammad and Turkey (2010) and Mohammad and Turkey (2013). For example they were able to reduce malicious word input or comprehension difficulties in the crowdsourcing process by using word choice questions.
The NRC dictionary contains the basic spectrum of psychological states of mind: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust. Thus, theoretically, more in-depth emotion analyses would also be possible, similar to the LIWC. For reasons of simplicity and comparison with the previous lexicon, we perform only a sentiment analysis here, i.e., we consider only how many words associated with positive or negative emotions occur in the individual letters. Another advantage of the NRC lexicon is that it is available in different languages and therefore could be used for multilingual texts. However, we only access the English version here.
The lexicon is originally available as an Excel file. In order to be able to load it faster, I have saved it as an RDS file. We simply load it with the command readRDS(). The dictionary basically consists of a column with words and further dummy columns or dummy variables (0 the word is not associated with the emotion and 1 the word is associated with the emotion), which contain the different basic emotions and sentiments (positive,negative). Similar to the manual reading in our self-created dictionary, we want to filter all words from NRCEmotion that have positive connotations and all words that have negative connotations. We are not interested in the basic emotions for now.
Task: Complete the commands to store all the words associated with positive and negative emotions in pos.emo and neg.emo respectively. After you have entered the correct commands, the process will take a short moment because the dictionary contains over 14,000 entries.
# First the data set and our variables of the previous exercise need to be loaded again. my.dfm.sentiment.df <- readRDS("my.dfm.sentiment.df.rds") letterlength <- readRDS("letterlength.rds") NRCEmotion <- readRDS("NRCEmotionLExikon.rds") # Now complete the gaps ___ pos.emo <- ___ %>% filter(Positive == ___ ) %>% select(`English (en)`) neg.emo <- ___ %>% filter(Negative == ___ ) %>% select(`English (en)`) # Thirdly, the dictionary is created nrc.dictionary <- dictionary(list(positive = pos.emo, negative = neg.emo )) nrc.dictionary
The dictionary contains 2312 positively and 3324 negatively connotated words. Significantly more than our self-created dictionary. At this point, a comparison of the two dictionaries is a good way to work out differences.
Task: Click on check to view a comparative treemap of the LIWC dictionary and the NRC dictionary. The information for the LIWC dictionary is taken from Table 6 of the Online Appendix in Borowiecki's paper. The information for the NRC dictionary is taken directly from the loaded RDS file.
library(treemapify) library(ggplot2) library(RColorBrewer) groupLIWC <-c ("Negative emotions","Negative emotions","Negative emotions","Negative emotions","Positive emotions") subgroupLIWC <- c("Anxiety","Anger","Sadness","Other","") valueLIWC <- c(91,184,101,123,406) statsLIWC <- data.frame(groupLIWC, subgroupLIWC, valueLIWC) LIWCgraph <- ggplot(statsLIWC, aes(area = valueLIWC, fill = valueLIWC, label = subgroupLIWC, subgroup=groupLIWC)) + geom_treemap() + geom_treemap_subgroup_border(colour = "white", size = 4) + geom_treemap_subgroup_text(place = "centre", grow = TRUE, alpha = 0.5, colour = "black", fontface = "italic") + geom_treemap_text (colour = "white", place = "centre", size = 0.7, grow = TRUE) + ggtitle("Words per category in LIWC dictionary")+ theme(plot.title = element_text(hjust = 0.5, size=20))+ labs (fill="Number of Words") groupNRC <-c ("Positive","Negative","Anger","Anticipation","Disgust","Fear","Joy","Sadness","Surprise","Trust") valueNRC <- c(sum(NRCEmotion$Positive), sum(NRCEmotion$Negative), sum(NRCEmotion$Anger), sum(NRCEmotion$Anticipation), sum(NRCEmotion$Disgust), sum(NRCEmotion$Fear), sum(NRCEmotion$Joy), sum(NRCEmotion$Sadness), sum(NRCEmotion$Surprise), sum(NRCEmotion$Trust)) statsNRC <- data.frame(groupNRC, valueNRC) NRCgraph <- ggplot(statsNRC, aes(area = valueNRC, fill = valueNRC, label = groupNRC)) + geom_treemap() + geom_treemap_text (colour = "white", place = "centre", size = 0.8, grow = TRUE) + ggtitle("Words per category in NRC dictionary")+ theme(plot.title = element_text(hjust = 0.5, size=20))+ labs (fill="Number of Words") library(gridExtra) grid.arrange(LIWCgraph,NRCgraph,nrow=2)
The treemap graphically shows us the proportions of words per category. The coloring provides us with a rough intuition about the absolute orders of magnitude of the dictionaries. First of all, the NRC dictionary contains significantly more words than the LIWC dictionary used for the analysis of the paper. NRC contains a far greater number of terms connoted with positive and negative emotions. This suggests that subsequently the respective emotion scores will also be larger than those for the LIWC dictionary. We also see that the two dictionaries differ in structure. The LIWC uses subgroups. So Anxiety, Sadness and Anger are subcategories of negative emotions. The NRC, on the other hand, does not use this subdivision. A word that is associated with Anger may also be associated with negative emotions, but it does not have to be.
It is important to note that the LIWC dictionary, like the NRC dictionary, has different versions. So, depending on the use case, a dictionary with more or fewer categories can be used. I have used the expressions in the present form because with the listed LIWC dictionary the emotion scores in dat were formed, and with the present NRC dictionary I would like to point out a suitable alternative for the analysis of our letter example.
In the following chunk, a Document Feature Matrix (DFM) is again created for the NRC Dictionary. However, you do not have to enter any code for this. It is identical to the previous procedure except for the renaming of two columns. In the task only the procedure is described again. However, you can press Check directly if you have already internalized the procedure and want to continue with the comparison of the results.
Task: In the first step you build the Document Feature Matrix for the corpus with the dictionary nrc.dicitonary and stored it in the variable nrc.dfm.sentiment. This process usually takes a long time due to the size of the dictionary. You will find the command that would normally be executed commented out. So that you don't have to wait unnecessarily long, I have already executed it for you in advance and saved the result in a RDS file.
In the second step you use a dplyr chain to convert the DFM into a data frame. After that, some beauty operators are performed e.g. renaming the columns and apply a different sorting. Above, we had already calculated the number of words per text with ntoken() and stored it in the variable letterlength. Next to the names of the composers we add this column to the data set under the name wordcount.
In the last step, the final result is generated by calculating the values for negemo and posemo again.
#Step 1: Build DFM # Do not remove the # sign if you want to continue quick. #nrc.dfm.sentiment <- dfm(corpus.tokens.removed, dictionary=nrc.dictionary) nrc.dfm.sentiment <- readRDS("nrc_df_sentiment.rds") #Step 2 + 3: dplyr chain to convert nrc.dfm.sentiment.df correspondingly and perform the calculation of negemo and posemo nrc.dfm.sentiment.df<- nrc.dfm.sentiment %>% convert(to = "data.frame") %>% rename(negative = `negative.English (en)`, positive = `positive.English (en)`) %>% mutate(composer = c("Beethoven","Mozart", "Liszt", "Liszt", "Mozart", "Beethoven"), wordcount = letterlength) %>% select(composer,doc_id,positive,negative,wordcount) %>% mutate(negemo = negative/wordcount*100, posemo=positive/wordcount*100)
You have now successfully calculated your emotion scores. Let us compare the results with each other. As a reminder, the results of the self-produced dictionary in my.dfm.sentiment.df correspond for our small sample to those of the LIWC lexicon in the underlying paper. We use the terms identically here, but since it is only a very small sample, the results are not generalizable to the LIWC lexicon.
Run the next chunk to display both tables so that we can examine the differences resulting from the use of two different lexicons.
Task: Show both Document Feature Matrix data frames by just typing their names into the chunk.
We see that the use of the lexicon for our example has no effect on whether a text contains words associated with intensified positive or negative emotions. Thus, the ratio of positive and negative emotions to each other is not reversed by the use of the lexicon.
Because the NRC lexicon contains more than 5 times (5.69) as many positive words and more than 6 times (6.66) as many negative words as the LIWC dictionary (see Comparison of the Two Dictionaries), one would expect the number of identified positive and negative words, and thus the positive and negative emotion scores for each letter, to be greater for the NRC data set. However, this is not the case. While it is true that we identify more positive and negative words overall for NRC, however, the proportion is relatively small. Comparing the means of negemo and posemo respectively, we see that the increase for negemo when using the NRC dictionary compared to using the LIWC dictionary in our example is 26.27 percentage and for posemo 11.05 percentage. The figures given here apply only to our example and are not very representative, since the corpus contains only 6 elements.
So, in summary, how can we evaluate the analysis with emotion scores?
It is easy to see that the use of the dictionary makes a big difference. How many words a dictionary contains and how they are categorized is relevant. Besides the dictionaries discussed here, there are also dictionaries with differentiated weighting of words. For example, a word can be perceived particularly positively or negatively and take on a number between 0 and 5 or similar. But also the choice of the corpus makes a difference. If I include numbers or location information in my corpus, the total number of words is larger. The emotion scores are correspondingly smaller. This may not make much difference when analyzing large texts. However, the smaller the texts to be analyzed, the greater the difference caused by this simple definitional categorization.
In a more general sense, it can also be stated that the emotion score clearly simplifies the complexity of actual emotions. When assigning to a category, the dominant sense of the words often prevails. The context in which the words are used is completely left out using a word by word analysis, but in reality plays an important role. Moreover, whether the categorization is done by a panel of experts (LIWC) or crowdsourcing (NRC), there may be unconscious errors in the assignment that cannot be eliminated. We are all socioculturally biased and differ in our perceptions of emotional association as critically reflected by Mohammad(2020).
Even though in the previous two paragraphs there was increased criticism of automated text analysis, especially of the different dictionaries, there are also arguments for the use of different dictionaries. It is true that different dictionaries do not produce exactly the same results. However, this does not necessarily have to be a disadvantage. Maybe it is necessary to use different dictionaries for different research questions. Just think of the temporal perspective. Language is always changing. We use different terms than the generations before and after us. Depending on the time from which the texts originate, we therefore need updated dictionaries. In addition, the medium through which we communicate also plays an important role. Especially in the 21st century, the variety of communication media is staggering. Certainly, the style and variation of emotionality in your language differs when you communicate via mail, messenger service or letter. The use of application-specific dictionaries therefore represents an opportunity rather than an obstacle.
In summary, the text analysis method is probably significantly underestimated in economics but represents a way to generate information from previously unused data. As we have already seen in the chapter Examining Emotions, by matching with biographical data it can be concluded that the emotion scores seem to be very valid for our case. However, it is important to use a large number of texts to draw inferences. Strictly speaking, we should also always speak of "the use of negative words increased" or "the negative emotions score increases" instead of "negative emotion increases" when interpreting the results. Since the automated text analysis methods cannot yet extract all the information from the texts and capture the complexity of language, but obviously has great potential, further improvement of the method using e.g. neural networks is profitable.
Creativity was the focus of this problem set and was the variable of interest. It therefore makes sense to look at ways of becoming creative in R in order to improve not only one's own creativity but also one's R knowledge.
This conclusive Exercise is inspired by the work of Antonio Sánchez Chinchón. On his website he shows many different ways to draw pictures with R.
I present one of them here. The basis is an arbitrary png file, which generates an abstract image in combination with a machine learning algorithm.
To create the graph, I just modified Chinchóns code slightly by changing the settings for the parameters. Feel free to unfold the info table to see the code I used.
Like already mentioned, the author used a machine learning algorithm to draw those abstract paintings. For this, he loads the original image and binarizes it into a black and white image. Then a random sample of black points is drawn. Now the machine learning algorithm (more precisely hierarchical clustering algorithm) comes into play. This measures the Euclidean distance between each pair of points. The points are connected via a bottom up process such that between the clusters the Euclidean distance is minimal. Initially, each point is treated as a cluster and then connected via the resulting dendograms (hierarchical tree diagrams) until we arrive at the minimum number of clusters (a cluster containing the entire sample). So, with N observations at the beginning, we have N clusters, which initially contain only one point each (N in this case is 6000). Now the stepwise combination begins. Here the closest cluster is combined under a certain condition, which in our case is the smallest Euclidean distance. Then again the two closest clusters are searched and combined to one cluster. This process is repeated until we have only one large cluster of size N. Finally, the cluster is drawn with ggplot and geom_curve.
Task: To see the result just click edit and then check in the following junk.
#Just click check to see a nice picture, generated by using the described Method above. picture<-load.image("PictureSol.png") plot(picture, axes=FALSE)
Do you recognize who is represented here? Maybe it helps to squint your eyes and look at the picture from different angles. The original painting, drawn by Joseph Karl Stieler can be found here.
You can take a look at the underlying code by clicking on the info block.
# Run for additional info in the Viewer pane info("License and code")
The aim of my thesis was to shed more light on interdisciplinary creativity research by presenting the analysis of the causal effect of negative emotions on the creativity of three world-renowned artists.
We started with a graphical analysis of the results to show that it is of interest to investigate a possible relationship between emotions and output. After explaining the necessary basics, we were then able to solve endogeneity problems with the help of different methods. After comparing the results of different models, we could see that the type of method has a great influence on the final result. In particular, the use of an appropriate instrumental variable estimator in our last model ensured that we measured a positive significant effect of negative emotions on creativity.
In numbers we could now say, that an increase in the average proportion of negative emotions in the total word count by one percentage point, leads to an increase in the annual output of outstanding compositions by 2.5 pieces in the following year. By putting the numbers in relation, we were able to say that compared to the average, an artist's emotions would have to increase by 36.85 percent to compose one additional piece in the following year.
In addition to this interesting main result, we were able to confirm other effects that we had already suspected. For example, the number of letters written in a year and the fact that an artist is currently on tour have a positive effect on output. On the other hand, permanent employment and marriage have a negative effect on emotions, because the artists do not (can not) fully devote themselves to their passion, composing.
If we take a closer look at which negative emotion can best explain the increase in creativity, it is sadness. This is in line with psychological studies that hold depression responsible for increased creativity.
All results found are based on the application of text analysis methods, which are still underrepresented in economic research. They allow us to compute valid emotion scores for the composers. In order to further spread the awareness of this method, the approach using the R package quanteda was documented in detail and its chances and risks were discussed.
Although a certain degree of measurement error cannot be excluded, the application of this method has great potential and should be used to conduct further economic research in addition to research already done, e.g. on the influence of news on stock prices (Boudoukh et. al ,2013). It is thinkable, for example, to use the method for the analysis of the dynamics of speeches in the Bundestag, the analysis of expert interviews and newspaper articles to assess the economic situation, or the analysis to determine the effectiveness of selected advertising texts.
Likewise, the method itself has great potential for development. Using machine learning, not only the categorization within the dictionaries could be done automatically, but also the complexity of language itself could be better captured by extending the matching of corpus and dictionary with network analysis techniques.
This problem set combines parts from the fields of psychology, literary studies, statistics, and data science under the umbrella of economics. In conclusion, it is desired that the didactic nature of the problem set will help to prepare other research results in a similar fashion and provide additional creative incentives to engage disciplinarily with a major driver of our prosperity.
In a final task you can run the following code chunk to show all the awards you have collected during the problem set. In total, there were 7 awards to be earned.
awards()
While working with the original replication dataset, I noticed some points that I would like to discuss below. As mentioned in the problem set, the dataset basically consists of 2 parts: The letter data (generated via LIWC) and the life data (generated via Groove Music Online) of the composers. For various reasons, datasets may not be entirely complete. This can sometimes be more or less dramatic. In our case, we cannot use observations when one of the important groups of variables - the emotion data collected via the LIWC algorithm or the background information (more precisely the output variable) - is missing. Thus, this results in a total of 3 possible cases: Either the LIWC variable is missing or the output variable is missing or both variables are missing. Theoretically, the problem could also be reduced to two cases, but we keep the 3 cases for overview reasons.
Regardless, there also exists an observation that contains letter numbering but no values for the emotion data. So the letter seems to exist, but cannot be evaluated by the LIWC algorithm. Unfortunately the exact reasons cannot be reliably determined because the letters are not available in pure form. From a computational and practical point of view, the LIWC algorithm could only fail to evaluate the letters if they did not contain a single word. However, sending an empty letter seems very unlikely. Another plausible reason could be that the letter was written in a language or a special encoding that was not translatable and therefore the LIWC algorithm could not perform a meaningful evaluation. The letter in question was written by Mozart, who was known to make strange jokes, perhaps also in his letters.
The replication dataset is reduced by the above observations, since they are not relevant for the formation of results and only unnecessarily inflate our dataset. Differences compared to the original replication dataset only appear in the creation of the descriptive summary tables. At this point, it is debatable whether data that are ignored in the final econometric analysis anyway should play a role in the summary table (e.g., in the formation of averages and standard deviations). From my perspective, the inclusion of these dispensable data sets would have two crucial disadvantages: First, it artificially inflates the number of observations. At first glance, the reader is led to believe that the data set is almost 30% larger (in terms of the number of observations) than it actually is. On the other hand, the inclusion does not add any value from a didactic perspective and possibly distracts from the main points.
Additionally, the column age is renamed to age_1. This has the background that with stargazer() no listing of the variable marriage_cohabitation would be possible due to the construction of the omit argument.
A total of 428 rows are removed from the original replication dataset. The dataset thus shrinks from 1860 to 1432 observations. The manually removed rows could be due to problems in the join process. However, as mentioned above, this does not matter for the final results, since rows with NA values are automatically removed by the regression commands. Looking at the number of observations used to generate the main results from Table 4 in the paper, this also corresponds to 1432 pieces. The R code used to clean the dataset is shown below.
# Run for additional info in the Viewer pane info("composer_data_reduced")
Another conspicuous feature of the original replication data set is that there is not consistently a unique identification for the letters of the composer Liszt. Normally, the unique classification is made up of a combination of the composer's name 'composer' and the letter number 'numberofletter'. In Liszt's case, however, there are about 260 duplications (some of them triplications) in the column numberofletter. The following R command creates a table from which the problem becomes apparent when looking at row 2 and 3 as well as 6 and 7 compared to the other rows.
composer_data_reduced <- readRDS("composer_data_reduced.rds") Liszt <- composer_data_reduced %>% filter(is.na(numberofletter) == FALSE) %>% arrange(numberofletter) %>% select(composer,numberofletter,letters_annual,wc,posemo,negemo) head(Liszt,8)
However, this problem only affects the numberofletter column and thus the unique indexing of the letters. If you look at the duplicates, you will quickly notice that they are different letters even if they have the same numbering. The corresponding data records are not further modified or removed, since it concerns data, whose incorrect indexification does not carry further weight. Only the assignment of a letter to its evaluation in the data set is no longer possible. Since we do not have the actual letters - except for a small excerpt - this is not a big problem. However, for the sake of completeness, this issue has also been pointed out.
In the corresponding folder on github the reduced dataset composer_data_reduced.rds and the original replication dataset composer_data.dta provided by Borowiecki can be found.
Andreasen, N. C. (2005): The creating brain: The neuroscience of genius. Dana Press.
Auer, B.; Rottman, H. (2020): Statistik und Ökonometrie für Wirtschaftswissenschaftler. Eine anwendungsorientierte Einführung. 4th Edition. Wiesbaden: Springer-Gabler. ISBN: 978-3-658-30136-1
Borowiecki, K. J. (2017): How Are You, My Dearest Mozart? Well-Being and Creativity of Three Famous Composers Based on Their Letters. The Review of Economics and Statistics, 99 (4): 591–605. DOI: https://doi.org/10.1162/REST_a_00616
Boudoukh, J., Feldman, R., Kogan, S., & Richardson, M. (2013): Which news moves stock prices? A textual analysis (No. w18725). National Bureau of Economic Research. URL: https://www.nber.org/papers/w18725
Chen, B.; Pearl, J. (2014): Graphical Tools for Linear Structural Equation Modeling. Technical Report. URL: https://apps.dtic.mil/sti/pdfs/ADA609131.pdf
Demakakos, P., Nazroo, J., Breeze, E., & Marmot, M. (2008): Socioeconomic status and health: the role of subjective social status. Social science & medicine, 67(2): 330-340. DOI: https://doi.org/10.1016/j.socscimed.2008.03.038
Eisen, C., Rieger, E., Sadie S., Angermüller R., Oldman C. B., Stafford W. (2018): Mozart. Oxford University Press. Grove Music Online. DOI: https://doi.org/10.1093/gmo/9781561592630.article.40258
Flüh, M. (2019): LIWC. In: forTEXT: Literatur digital erforschen. URL: https://fortext.net/tools/tools/liwc
Galenson, D. W. (2010): Understanding Creativity. Journal of Applied Economics, 13(2): 351-362. DOI: https://doi.org/10.1016/S1514-0326(10)60016-5
Goleman, D. (2012): Emotional intelligence: Why it can matter more than IQ. Bantam. URL: http://dspace.vnbrims.org:13000/xmlui/bitstream/handle/123456789/4687/Emotional%20Intelligence.pdf?sequence=1
Halliwell, R. (1998): The Mozart family: four lives in a social context. Oxford University Press. ISBN: 0-19-816371-1
Hanck, C., Arnold, M., Gerber, A., Schmelzer, M. (2021): Introduction to Econometrics with R: Section 10.3. URL: https://www.econometrics-with-r.org/10-3-fixed-effects-regression.html
Heiss, F. (2016): Using R for Introductory Econometrics. First Edition. Self published via CreateSpace. ISBN: 978-1-523-28513-6
Hentschke, L., Kibbe, A., Otto, S. (2017): Geld in der Psychologie: Vom Homo oeconomicus zum Homo sufficiensis. In: Peters S. (eds) Geld. Wiesbaden: Springer-Gabler. DOI: https://doi.org/10.1007/978-3-658-15061-7_5
Holley, J.W. (1977): Tenure and research productivity. Res High Educ 6: 181–192. DOI: https://doi.org/10.1007/BF00991419
Hviid, A., Hansen, J. V., Frisch, M., & Melbye, M. (2019): Measles, mumps, rubella vaccination and autism: a nationwide cohort study. Annals of internal medicine: 170(8): 513-520. DOI: https://doi.org/10.7326/M18-2101
Jacoby, W. G. (2000): Loess: a nonparametric, graphical tool for depicting relationships between variables. Electoral Studies: 19(4), 577-613. DOI: https://doi.org/10.1016/S0261-3794(99)00028-1
Kahneman, D.; Deaton, A. (2010): High income improves evaluation of life but not emotional well-being. Proceedings of the National Academy of Sciences: 107 (38). DOI: https://doi.org/10.1073/pnas.1011492107
Kennedy, P. (2008): A guide to Econometrics. Sixth Edition. Malden: Blackwell Publishing Ltd. ISBN: 978-1-4051-8257-7
Mankiw, N.G. (2017): Makroökonomik. 7th edition. Stuttgart: Schäffer-Poeschel Verlag. ISBN: 978-3-7910-3783-7
Mirowsky, J., Ross, C.E. (2003): Education, Social Status, and Health. First Edition. Routledge. DOI: https://doi.org/10.4324/9781351328081
Monroe, S. M., Reid, M. W. (2009): Life stress and major depression. Current Directions in Psychological Science: 18(2), 68-72. DOI: https://doi.org/10.1111/j.1467-8721.2009.01611.x
Mohammad, S. (2020): Practical and Ethical Considerations in the Effective use of Emotion and Sentiment Lexicons. arXiv preprint. URL: https://arxiv.org/abs/2011.03492
Viereck, G. S. (1929): What Life Means to Einstein. Saturday Evening Post: Oct. 26 1929. URL: https://www.saturdayeveningpost.com/wp-content/uploads/satevepost/what_life_means_to_einstein.pdf
Wooldridge, J. M. (2016): Introductory Econometrics. A Modern Approach. Sixth Edition. Boston: Cengage Learning. ISBN: 978-1-305-27010-7
Wooldridge, J. M. (2020): Introductory Econometrics. A Modern Approach. Seventh Edition. Boston: Cengage Learning. ISBN: 978-1-337-55886-0
Augie, B. (2017): gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3. URL: https://cran.r-project.org/web/packages/gridExtra/index.html
Barthelme, S., Tschumperle, D., Wijffels J., Assemlal, H. E., Ochi S. (2021): imager: Image Processing Library Based on 'CImg'. R package version 0.42.11. URL: https://cran.r-project.org/web/packages/imager/index.html
Benoit, K., Obeng, A.(2021): readtext. R package version 0.81. URL: https://github.com/quanteda/readtext
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., Matsuo, A. (2021): quanteda: An R package for the quantitative analysis of textual data. R package version 2.1.3. URL: https://cran.r-project.org/web/packages/skimr/index.html
Gaure, S. (2021): lfe: Linear Group Fixed Effects. R package version 2.8-7.1. URL: https://cran.r-project.org/web/packages/lfe/index.html
Henry L., Wickham, H. (2020): purrr: Functional Programming Tools. R package version 1.0.7. URL: https://cloud.r-project.org/web/packages/purrr/index.html
Hlavac, M. (2018): stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. URL: https://cran.r-project.org/web/packages/stargazer/index.html
Iannone, R. (2020): DiagrammeR: Graph/Network Visualization. R package version 1.0.6.1. URL: https://cran.r-project.org/web/packages/DiagrammeR/
Kleiber, C., Zeileis, A. (2020): AER: Applied Econometrics with R. R package version 1.2-9. URL: https://cran.r-project.org/web/packages/AER/index.html
Kranz, S. (2020): RTutor. R Problem Sets with Automatic Test of Solution and Hints. R package version 2020.11.25. https://github.com/skranz/RTutor.
Murrell, P. (2014): gridBase: Integration of base and grid graphics. R package version 0.4-7. URL: https://cran.r-project.org/web/packages/gridBase/index.html
Neuwirth, E. (2014): RColorBrewer: ColorBrewer Palettes. R package version 1.1-2. URL: https://cran.r-project.org/web/packages/RColorBrewer/index.html
Waring, E., Quinn, M., McNamara, A., De la Rubia, E. A., Zhu, H. (2021): skimr: Compact and Flexible Summaries of Data. R package version 2.1.3. URL: https://cran.r-project.org/web/packages/skimr/index.html
Wickham, H. (2016): ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. R package version 3.3.5. ISBN: 978-3-319-24277-4. URL: https://ggplot2.tidyverse.org.
Wickham, H. (2021): tidyverse: Easily Install and Load the 'Tidyverse'. R package version 1.3.1. URL: https://cran.r-project.org/web/packages/tidyverse/index.html
Wickham, H. , Miller, E. (2021): haven: Import and Export 'SPSS', 'Stata' and 'SAS' Files. R package version 2.4.3. URL: https://cran.uni-muenster.de/web/packages/haven/index.html
Wickham, H., Francois, R., Henry, L. and Müller, K. (2021): dplyr. A Grammar of Data Manipulation. R package version 1.0.7. URL: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Wilkins, D. (2021): treemapify: Draw Treemaps in 'ggplot2'. R package version 2.2.5. URL: https://cran.r-project.org/web/packages/treemapify/index.html
Borowiecki, K. J. (2016): Replication Data for: "How Are You, My Dearest Mozart? Well-Being and Creativity of Three Famous Composers Based on Their Letters". Harvard Dataverse: V1. DOI: https://doi.org/10.7910/DVN/CP3FYH
Mohammad, S., Turney,P. (2010): Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text: 26-34. URL: https://aclanthology.org/W10-0204
Mohammad, S., Turney,P. (2013): Crowdsourcing a Word-Emotion Association Lexicon, Computational Intelligence: 29 (3): 436-465. DOI: https://doi.org/10.1111/j.1467-8640.2012.00460.x
Figure 1: Causal graph of model 1 (simple world); Source: own diagram.
Figure 2: Causal graph fulfilling exogeneity condition; Source: own diagram.
Figure 3: Causal graph violating the exogeneity condition; Source: own diagram.
All of the above links were accessible as of January 07,2022
Author: Daniel Klinke

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.