Problem Set Product Attribute Trade-Offs in the Automobile Sector

Author: Marius Breitmayer

< ignore

Run code below to generate and show the problem set

library(RTutor)
library(yaml)

setwd("C:/Users/MariusPC/Desktop/AttributeTradeoffs") #adapt path
ps.name = "Attribute Tradeoffs"
sol.file = paste0(ps.name, "_sol.Rmd")
libs = c("ggplot2", "broom", "dplyr", "foreign", "googleVis", "lfe", "stargazer")

# Generate problem set files in working directory
create.ps(sol.file = sol.file, ps.name = ps.name, user.name = NULL,libs = libs, var.txt.file = "Attribute_Tradeoffs_Var.txt", stop.when.finished = FALSE, addons="quiz")

# Show in webbrowser. You can adapt the argumets below
show.shiny.ps(ps.name, load.sav = FALSE, launch.browser = TRUE, sample.solution = FALSE, is.solved = FALSE)

>

In his paper "Automobiles on Steroids: Product Attribute Trade-Offs and Technological Progress in the Automobile Sector" Christopher R. Knittel (2012) estimates the technological progress since 1980 and the trade-offs faced when choosing between different attributes such as weight, fuel economy or engine power characteristics. In this interactive problem set, we are going to reproduce his study and discuss it.
(The public data as well as the article are provided on the website of the American Economic Association. You can simply click here to download it.)

Exercise Overview

Have you ever been at the gas station and wondered how much fuel your car would need if all the innovation achieved over the last 25 years has improved fuel economy instead of engine power characteristics, weight or other characteristics?

Manufacturers as well as consumers face technological trade-offs when choosing between fuel economy, engine power characteristics or even weight of a car every time they want to produce or buy one. The goal of this problem set is not only to better understand these trade-offs, but also to try to estimate the technological progress that has occurred over the observed timeframe.
Using data from 1980 to 2006, we will reproduce the estimates for technological trade-offs that manufacturers and consumers face when choosing between fuel economy, weight and engine power characteristics. We will also examine how the relationships between these different factors have changed over that time.

In this problem set we would like to find an answer to the central question of how fuel economy in 2006 would compare to fuel economy in 1980 if we had held size and power constant.

This problem set has the following structure:

Exercise 1: Descriptive statistics

a) Loading the required data

Before we are able to work with data, we need to load it into our working space.

Usually in R this works as simple as assigning a name to the imported data: new variable = read.table("data_name") .

But because in our case the data is coming from STATA code, and therefore being saved as a .dta file, the read.table() command will not work.

In many cases R-Packages are great ways to save a lot of time, because they provide solutions to common problems. You will see, that we will use several different packages within this problem set.

For this situation I recommend the package foreign documentation. Load it with the command library() and add the name of the package.

#< task
# Use library() to load the foreign package.

#>
library(foreign)

After loading the foreign package we can now use the command read.dta(). It works the same way as read.table() but as the name already suggests it works with .dta files. Please use the read.dta() command to load the Steroids_AER_data_post.dta and assign it to a variable called dat.

< info "read.dta()"

dat = read.dta("tablename") loads the data with the name tablename and assigns it to a data frame dat. You have to use quotation marks for the name of the data file.

>

#< task
# Use the `read.dta()` command to load the `Steroids_AER_data_post.dta` and assign it to a variable called `dat`

#>
dat = read.dta("Steroids_AER_data_post.dta")

The data contains model-level data on almost all vehicles sold in the United States between the years 1980 and 2006. In the next exercise, we will take a closer look at how the data is structured.

< award "Data Loader"

Congratulations, You've earned this award for loading the required data into your working space!

>

b) Data Overview I

Since we are now able to work with our data, we will first take a look at which columns are in the data set. Use the colnames() command with our newly generated dat to show the names of the columns.

#< task
# use `colnames()` on our loaded data "dat" to show the names of the columns.

#>
colnames(dat)

If you want to know what the different variables stand for, just click the "Info"-Button.

< info "Interesting Variables"

Source: Serway, R. A. and Jewett, Jr. J. W. (2003). Physics for Scientists and Engineers. page 232, 300

>

< info "Dummy variable"

Dummy variables are artificial variables created to either take the value one or zero. Which of the two is taken, is depending if the given qualitative phenomenon occurs or not.

Let me give you an example from our data:

In our data d_turbo is a dummy variable. It takes the value 1 is the car is turbocharged, and 0 is the car is not turbocharged.

Dummies can be used in a classic linear regression just as any other explanatory variable yielding standard OLS results.

Source: Kennedy Peter (2008), A Guide to Econometrics, p.232

>

After we know which columns are in our data, we might as well take a look at the data at hand. Simply click check, to see a brief overview of the data.

#< task
# click check, to show the first couple of rows of the data. 
dat
#>

c) Data Overview II

Now that we know which variables are part of the data set, let's see if we can get some rough estimations on the values for the variables. Therefore summary() is a really great way to get a first impression. Because the data consists of cars as well as trucks, we first need to select the cars only. We are also going to take out outliers here for the first time. For more information on outlier, see the info box.

< info "Outlier"

An outlier is an observation that is far away from other observations. For example: you have a small sample of (1,2,4,2,1,4,2,3,3,98626). Clearly 98626 is way off the other values. Therefore we could say it is an outlier and we are not going to take it into account anymore. If we look at our data, there are a couple of outliers. For example the 2006 Bugatti Veyron EB 16.4 whose horsepower of 1001 is way above other cars, or the 1998 Cadillac DeVille with a torque of 5600. Some other data are outliers, because there are missing values.

Source: Stock, J. H., Watson, M. W. (2007): Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. page 27

>

To select cars only, and get a first impression on the variable mpg (which will later be also referred as fuel economy), click check.

#< task
# we first want to use cars only
cars = filter(dat, d_truck == 0 & outlier == 0)
# now we want to have a summary of the variable curbwt
summary(cars$mpg)
#>

Min is the smallest value for mpg in the dataset.

Max is the biggest value for mpg in the data set.

1st Qt. is the 25 percent quantile, and 3rd Qt. is the 75 percent quantile of our data set. This means, that 50 percent of all the values for mpg are between these two values.

Mean is the average mpg of our data set and Median is the amount of mpg, which separated the higher and lower half of the data set.

In case you don't know what the median is, click the info box.

< info "Median"

If we look at our data as an ordered vector, the median is the value in the middle. 50% of the values are higher, and 50% of the values are lower than the median. We can therefore say that the median is the value that separates the higher and lower half of the data set.

The median can be displayed as:

$$ median \space x = \begin{cases} x_{(n+1)/2} & n\text{ odd} \\ \frac{1}{2}\bigl(x_{(n/2)}+x_{(n+1)/2}\bigr) & n\text{ even} \end{cases} $$

Source: Hogg, R. V. and Craig, A. T. Introduction to Mathematical Statistics, 5th ed. New York: Macmillan, 1995.

>

For now, we are mostly interested in the mean values, and therefore we will now use the mean() command. Please use the mean() command to get an estimation for horsepower.

#< task
# use the `mean()` command on `cars` for horsepower `hp`.

#>
mean(cars$hp)

Great job!

If we look at the given values, we can assume that, regarding the means, an average car has a fuel economy of 27.9 miles per gallon, and 157 horsepower.

In order to give evidence about the average car, the data should be weighted with for example sales data:

$$ \bar x = \dfrac{\sum_{i=1}^n w_{i} * x_{i}}{\sum_{i=1}^N x_{i}} $$

This way, cars that were sold more often would be more meaningful and we would have a better representation of "the average sold car".

Unfortunately we don't have any sales data in our data.

Therefore in our data every cars values are weighted equally regarding means:

$$ \bar x = \dfrac{1}{n}\sum_{i=1}^n x_{i} $$

What this results in is that a car that was sold only once, is represented the same as a car that was sold a million times.

Even though our mean values are not a proper representation of the average car sold, it still helps us to get a first impression of our data at hand.

After we now got some means over the whole observation (years 1980 to 2006), we might as well be interested at how these car attributes differ over time. For now, let's keep looking at Fuel Economy (mpg) and Horsepower (hp) only.

We will therefore plot the mean values of mpg and hp for each year.

To do so, click check:

#< task
# First we use the package 'dplyr'
library(dplyr)
# The idea is, to plot the means of every year with X-values regarding the year, and Y-values being the mean in a given year.
# Therefore it is useful to have groups of years in our data. 
# We use group_by from the package dplyr to generate these groups. 
# Every group now contains all the cars produced in the given year. 
# If we now use the mean() command on mpg, and horsepower we can plot the data easy. 
qdata = summarise(group_by(cars, year), mpg = mean(mpg), hp=mean(hp))
# we now simply plot the values for mpg using qplot from the package 'ggplot2'
# we take year as our x, and mpg as our y values, and use the recently created data `dat`.
library(ggplot2)
q1 = qplot(x=year, y= mpg, data = qdata)
# then we just show the plot
q1
#>

As you can see, we grouped the cars by year, and saved the means for mpg and hp into qdata, then we plotted every years mean value for mpg, to get an idea of how it changed over time. Your task now is to do this kind of plot for horsepower. You don't have to summarize and group anymore, just simply use qdata.

#< task
# save the plot for hp as q2 and display it.

#>
q2 = qplot(x=year, y= hp, data = qdata)
q2 

Let's take a look at the two graphs:

The difference in the two graphs is quite obvious. While horsepower has increased almost linear every single year with a slightly higher increase in the last years, fuel economy has drastically increased in the first five years but then started to fluctuate between 27 and 29 miles per gallon. Unfortunately even though fuel economy was fairly constant a small negative trend can be identified later in the sample. Overall, mpg increased by ~18 percent from 1980 to 2006. Especially in the last years, the years horsepower drastically increased, a decline in fuel economy is recognizable. One possible reason for this development of mpg will be discussed in the next exercise.

< award "Overviewer!"

Congratulations, You've earned this award for getting an overview of the data we will be using in this Problem Set!

>

This exercise refers to page 3377 of the paper.

Exercise 2: Motivation: CAFE - Standards

After getting a rough overview of the data at hand, let us take a small step away from the data and talk about the motivation of our problem set.

In exercise 1, we concluded that horsepower and fuel economy developed differently between 1980 and 2006.

But why was there such a huge increase in horsepower, while fuel economy remained relatively static?

What could have possibly had an impact on that development?

In order to find an answer to this question, it might be interesting to see how policy makers could incentivize manufacturers to produce new cars with higher fuel economy.

The first fuel economy standards, called Corporate Average Fuel Economy (CAFE) standards, were established in 1975 with the Energy Policy and Conservation Act, in response to the 1973 oil embargo.

The purpose of the CAFE standards is to reduce the energy use by increasing the fuel economy of cars and light trucks.

CAFE standards are fleet-wide averages in fuel economy, weighted by sales, which each manufacturer's fleet has to achieve each year, since 1978.

In case a manufacturer does not comply he will be fined. The fine is $5.50 for every 0.1 mpg below standard, multiplied by the number of cars in the manufacturer's new car fleet in that year.

With these fines, policy makers try to incentivize manufacturers to increase their fleet`s fuel economy. Between 1983 and 2003, for example, the penalties collected totaled slightly over $600 million, and were mostly paid by small European manufacturers. (Source: Yacobucci , Bamberger, Automobile and Light Truck Fuel Economy: The CAFE Standards (2008) p. CRS-3 link)

Starting off in 1978, the intention of the CAFE Standards was to double the fuel economy of all new car models sold on the U.S. market. The value to be achieved by 1985 was 27.5 miles per gallon.

If we now take a look at our data at hand, we can try to estimate which manufacturers had met with the required fuel economy standards, and which manufacturers had not. We have to keep in mind that our mean values are still not weighted by sales, and because of this might differ from the values taken in reality.

As usual, we will load our data, and filter the data to only have the cars produced in 1985.

#< task
# first we read the data. 
dat = read.dta("Steroids_AER_data_post.dta")
# then we only take the cars in year 1985.  
cafe1 = filter(dat, year == 1985 & outlier == 0 & d_truck == 0)
#>

If we now want to see how the mean fuel economy of a manufacturer looks like in 1985, we can use a command chain from the package dplyr.

We "chain" commands together using the %>% operator.

If you are familiar with UNIX, this can be compared to the "pipe" operator. By using the %>% operator the output of one command becomes the input for the next command.

#< task
# load dplyr
library(dplyr)
# after this, we are going to create a command chain using %>%
# we are first going to group our data by the manufacturer, since cafe Standards are manufacturer fleet averages
cafe1 = cafe1%>% 
  group_by(mfr) %>%
# after this we summarize by the mean fuel economy.
summarise(mpg = mean(mpg)) %>%
# lastly we subset those manufactuerers whose mean fuel economy is lower than 27.5.
filter(mpg < 27.5)
# finally we then show the data frame
cafe1
#>

As we can see in our case Audi, BMW, Ferrari, Fiat, Jaguar, Lotus, Maserati, Mercedes, Peugeot, Pininfari, Porsche, Renault, Saab and Volvo did not achieve the target fuel economy of 27.5 mpg by year 1985.

a) Manufacturer fuel economy over the years

Now that we know how fuel economy across manufacturers looked like in 1985, would it not be interesting how the fuel economy changed across all manufacturers within the years?

To visualize this, the package googleVis gives us the opportunity to create a Motion Chart.

This allows us to see how specific values have changed over the course of time. Some values are represented on the axis, others with color or size of the bullet.

For more info on googleVis, click the info box.

< info "googleVis"

The googleVis package offers the possibility to visualize R data frames with interactive Google Charts. It will be in html code, which can easily be opened in a new browser window, or embedded into the problem set as I did here. Because it allows us to exceed the possibilities of normal plots by far, it is great tool to analyse data. We are going to use maybe the most popular kind of Google chart: The Motion Chart. To do so, we utilize the function gvisMotionChart().

To see the documentation of googleVis, click here.

>

Click check, to see the MotionChart. Then click the Play-Button to start the animation.

#< task 
# as before, we will use the pipe command
df = dat %>%
# first we filter outliers and trucks out
  filter(outlier==0 & d_truck==0) %>%
# then we group the data by year and manufacturer
  group_by(year,mfr) %>%
# we set the value for each mfr to the mean values of hp, mpg, accel 
# we also generate a new variable called models which represents the amount of different cars in each year
  summarise(mpg=mean(mpg), hp=mean(hp), accel=mean(accel), models=n()) %>%
# lastly we create a new column containing the manufacturer name and call it mfr
  mutate(id = mfr)

library(googleVis)
# Then we use the, gvisMotionChart() command to generate the html code and save it as mp. 
# within the command, we then select the generated data, and specify how every attribute should be represented in our plot.
# It is important to set idvar to id (which is equal to mfr), since we are interested in different manufacturers. 
# we set timevar to year, because 'year' is the variable that depicts time in our data set.
# As the variable shown on the x and y axis we assign hp and mpg
# different manufacturers will be represented in different colors and the amount of different models will determine the size.
mp = gvisMotionChart(df, idvar = "id",
                     timevar = "year", xvar = "hp", yvar = "mpg",
                     colorvar = "mfr", sizevar = "models")
plot(mp, tag = "chart")
#>

In our motion chart, each circle represents a different manufacturer.

Let us take a closer look at the motion chart above. You can always watch it again by clicking the play button. In the early years of our sample (1980 - 1985) a trend to increase fuel economy can be recognized. Most of the circles are moving upwards in the coordinate system and only a few manufacturers are showing an increase in horsepower. If we then take a look at the years after 1985, there is hardly any increase in fuel economy with most manufacturers. In contrast, almost all circles are moving further to the right side of the coordinate system, which represents an increase in horsepower. This could imply the possibility of incentives in the early years in order to lay the manufacturers' focuses on improving their fuel economy as opposed to other characteristics such as for example horsepower. If we now take a look at the values of 2006 we can see that manufacturers with low fuel economy (<20 mpg) tend to have small circles. This suggests a small number of different models in that year. In our case, the manufacturers Ferrari, Bentley, Aston Martin and Mase-rati all have less than 5 models. This is consistent with Yacobucci and Bamberger if they say that most of the CAFE fines were paid by small European manufacturers.

b) Mean fuel economy vs. CAFE Standards

After getting an idea of how the manufacturers' average non-sale weighted fleet looked like over the years, we might now be interested in how CAFE Standards changed compared to the average fuel economy.

To visualize how the unweighted fuel economy changed compared to CAFE standards over the years, we can plot both values. The data on CAFE standards were taken from the "Summary of Fuel Economy Performance" by the National Highway Traffic Safety Administration link.

Here we will use the general unweighted mean value for each year, not taking different manufacturers into account, because we would like to get an overall idea on how all cars have changed. At this point we are not interested in which manufacturers did not reach the target fuel economy.

I have already prepared the data for you. It contains three columns: year, the mean fuel economy in the given year and the CAFE standard requirements in that year. It is saved as cafecars.txt. To see the plot, please click check.

#< task
# we load the data first
cafe= read.table("cafecars.txt")
# then we create a plot with x values as year and y values as fuel economy
# afterwards we just add two lines, one for the cafe standards the other one for the mean fuel economy
cafecars_mpg = ggplot(aes(x=year, y = "fuel economy", colour = Legend), data = cafe)+ 
  geom_line(aes(y = cafe, color = "CAFE standard"))+
  geom_line(aes(y = mpg, color = "Mean fuel economy"))
# last, we display the plot
cafecars_mpg 
#>

As we can see, there was an important increase in CAFE Standards for cars in the early years, but after 1985 the standards remained on the same level. In contrast, CAFE Standards decreased in 1986, but rose up back to the 1985 level in 1990. In relation to this, the average car fuel economy of our sample was higher than the requirements most of the time. We have to note here again, that our mean values are not sales weighted. This might result into biased values, as car models with lower sales are "overrepresented", while car models with higher sales are "underrepresented". Nevertheless we can estimate a trend. While standards increased, fuel economy was raised as well. After 1990, the year when CAFE standards were modified for the last time in our sample, the mean fuel economy decreased almost every year. In 2006, it even fell below the requirement line. Therefore we might be able to find a correlation between fuel economy and CAFE standards.

Let's have a look at the correlation coefficient.

#< task
cor(cafe$cafe, cafe$mpg)
#> 

Our correlation coefficient is 0.8058689 and even though our sample is quite small, this suggests a strong positive linear relationship. This indicates that if we increased the CAFE Standards, this would result in higher average fuel economy. This would be one way to explain why there was an increase in fuel economy early, which later changed to increasing other characteristics such as for example horsepower.

< info "cum hoc ergo propter hoc"

cum hoc ergo propter hoc (lat.) means that correlation does not imply causation. This means that just because two factors are correlated, this does not automatically imply that there is a causal relationship between them.

One example for this might be the amount of firefighters in correlation with the fire damage. Obviously the more firefighters are in duty for a certain fire, the bigger the fire is. As a result the fire damage is high as well. This does lead to correlation, but does this imply causation? There might be an unknown externality (in our case the size of the fire), that influences both factors.

In case you are interested in more examples of high correlations that clearly don't indicate causation, you might as well take a look at Spurious Correlations.

>

What we know after this exercise is that, implying a causal relationship, an increase in CAFE Standards should result in an increase in fuel economy. As long as the CAFE Standards are fulfilled, which equals to not having to pay a fine, manufacturers will increase characteristics which are important to the customer. Examples of these characteristics are acceleration or horsepower. Manufacturers who do not comply with the standards value the consumer preferences higher than the resulting fine. But before we are using this information in order to give advice, we have to get a better understanding of how different car attributes influence fuel economy.

This will be the topic of exercise 3.

Source: U.S. Department of Transportation (2014): Corporate Average Fuel Economy (CAFE) Standards link, 28.11.15

Source: U.S. Department of Transportation (2003): CAFE - Fuel Economy link , 28.11.15

Source: Union of Concerned Scientists, Fuel Economy Basics link , 28.11.15

Exercise 3: Graphical Evidence

After getting a first idea of our data and how policy might create incentive for manufacturers to increase their fuel economy in exercises one and two, we will get a deeper understanding how different car attributes influence fuel economy in this exercise.

a) Density Plots

First we wil get a graphical view of the data at hand. The R package ggplot2 contains a nice array of tools to use when creating graphics. For the beginning, since we still want to know a little more about our data, we will create some density plots.

< info "Density"

A Density plot shows the relative likelihood for the random variable to take on a given value.

Because a continuous random variable can take on a continuum of possible values, the probability distribution used for discrete variables, which lists the probability of each possible value, is not suitable for continuous variables. The probability is summarized by the probability density function. The area under the probability density function between any two points is the probability that the random variable falls between those two points.

Source: Stock, J. H., Watson, M. W. (2007): Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. page 21

>

Let me give you an example of how this is done: First, we need to load the needed data. I've already prepared this for you, so just click the check button to load the data.

#< task
# we first load out data again
dat = read.dta("Steroids_AER_data_post.dta")
# then, we need to filter our data 
# first off we only use data from the years 1980 or 2006 
# we take out the outliers, select only cars using gasoline as fuel, and cars with less than 50 mpg 
dens = filter(dat, year == 1980 | year == 2006, outlier == 0 & fuel == "G" & mpg<50 & d_truck ==0)
#> 

Then we will plot the density of Fuel Economy (mpg) for cars in the years 1980 and 2006. 1980 was the first year in our data and 2006 was the last one. By plotting these two years, we can see how the cars have changed over the course of our observation. Just click the check button to see the plot:

#< task
# as usual, we need to load a package again 
library(ggplot2)
# we add year as a factor to our data.
dens$Year = as.factor(dens$year)

# With 'ggplot(aes(...), data=dens)' we select the data that ggplot is going to use. 
# In our case it is the one we just created.
# x = mpg describes which values from dens are being used, 
# while fill is responsible to use different densities for the years. 
# geom_density adds a density to the ggplot object. 
# alpha = 0.5 alpha just fades the color.
p1 <- ggplot(aes(x=mpg,  fill=Year),data=dens) + geom_density(alpha=0.5)
p1

#>

We can see that the fuel economy density in 1980 is slimmer than in 2006. This means that in 1980 the fuel economy was more similar between cars than in 2006. Another point to mention is that the peek has shifted from around 18 miles per gallon in 1980, to roughly 26 miles per gallon in 2006. Because the density is wider in 2006, we can assume that customers have more options to give their preferences to fuel economy when buying a car. In addition it is to notice that the density for 23 miles per gallon is roughly the same in 1980 and 2006. This means, that there are almost the same amount of cars that are capable of going 23 miles per gallon in 1980 and 2006. The main difference though is that in 1980 23 miles per gallon has been on the top half of fuel economy, while in 2006 23 miles per gallon is on the bottom half of fuel economy. In total we can say, that fuel economy has increased from 1980 to 2006.

Now please try to plot the density of accel with ggplot. It is not needed to do the data preparation as I did in the example. Use the same syntax as in the example above and save your object as p2. Don't forget to show your plot in the end.

#< task
# use p2 <- ... here for accel
# then show your plot

#>
p2 <-ggplot(aes(x=accel, fill=Year),data=dens) + geom_density(alpha=0.5)
p2

Let's take a look at the acceleration density. It is of course obvious that cars got faster over the years. Most of the cars in 1980 took around 13 seconds from 0-80 mph, whereas in 2006 the peak (which means that most of the cars take this time from 0-80 mph) is roughly 8 seconds. Another very interesting observation might be that one of the slower cars (accel > 12 sec) in 2006 accelerates a little bit faster as the average car in 1980.

< award "Master of Densities!"

Congratulations, You've earned this award for creating density plots!

>

b) Scatter Plots

Now that we have only been looking into single variables so far, we will take a look at how two variables stack up in this exercise. Since we are interested in how the Fuel economy, represented as mpg, changed over time it makes sense for us to plot mpg against another Variable. For this situation, Scatter plots are a really nice way to visualize how two variables stack up.

< info "Scatter Plot"

A scatter plot is a plot of n observations on $X_i$ and $Y_i$ in which each observation is represented by the point ($X_i$ , $Y_i$).

Source: Stock, J. H., Watson, M. W. (2007): Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. page 93

>

If we start thinking about which attributes might be important for fuel economy, we could start with weight. So if we want to see how weight stacks up against fuel economy, we will use a scatter plot for fuel economy and curbwt. I will give you an example here:

#< task
# we take our data again
scat = dens 
scat$year = as.factor(scat$year)
# Now we use ggplot again. 
# we would like to plot curbwt on the X-Axis 
# and mpg on the Y-Axis
# We use scat as our data.
# geom_pint adds the points, and geom_point(shape=1) changes the appearance of the points.
# geom_smooth() adds the smoothed line though the data for each year.
p3 <- ggplot(aes(x=curbwt,y=mpg, color=year),data=scat) + geom_point() + geom_point(shape=1) + geom_smooth()
# then we show the plot
p3 
#>

This figure suggests that a 3,000 pound passenger car gets roughly 10 more miles per gallon in 2006, compared to 1980. This increase is roughly constant over the weight distributions, which can be seen by the lowess smoothed line which is fitted through the data points. The other way around, a car with a fuel economy of 30 miles per gallon had a curb weight of 2000 pounds in 1980, and a curb weight of almost 3000 pounds more in 2006. This equals an increase of almost 1000 pounds over the given timeframe.

Your task now is, to create a scatter plot called p4 for horsepower hp and fuel economy mpg. As last time, it is sufficient if you start with p4 <- ggplot(...)+...

#< task 
# to do so, you just have to replace the ??? in the code below with the correct values.
# p4 <- ggplot(aes(x=???,y=???, color=year),data=scat) + geom_point() + geom_point(shape=1) + geom_smooth()
# p4
#>
p4 <- ggplot(aes(x=hp,y=mpg, color=year),data=scat) + geom_point() + geom_point(shape=1) + geom_smooth()
p4

Good job on that Scatter plot.

It is very interesting that in 1980 a car with more than 200 horsepower has been almost a rarity. Most of the cars had between 80 and 180 horsepower. In 2006, 200 horsepower can almost be considered as a standard amount of horsepower for a new car.
Our scatter plot here suggests that a car with a fuel economy of 20 miles per gallon was able to have 280 more horsepower in 2006 than in 1980. On the other side, a car with 200 horsepower gets roughly 15 more miles per gallon in 2006 than in 1980.

c) Google Vision Plot

For this graphic, we need to use a new data set. I took data from the years 1980 to 2006 for six different cars in our dataset. In case there have been multiple data rows for the same year, I took the one with the highest mpg. The six cars are: - Honda Accord - Honda Civic - Toyota Corolla - GMC Grand Prix - Ford Mustang - GMC Corvette

I then saved the data into a file called gviscars.txt. We will now plot a Google Motion Chart again. This allows us to see how specific values have changed over the course of time. Some values are represented on the axis, others with color or by size of the bullet. Click the check button, followed by the Play button, to see the animation.

#< task 
# We first load the prepared data
gviscars = read.table("gviscars.txt")
# Then we use the library googleVis, in order to have acces to the needed commands
library(googleVis) 
# Then we use the, gvisMotionChart() command to generate the html code and save it as mp. 
# within the command, we then select the loaded data, and specify how every attribute should be represented in our plot.
# It is important to set idvar to nameplate, since we are interested in different cars data. 
# we set timevar to year, because 'year' is the variable that depicts time in our data set.

mp = gvisMotionChart(gviscars, idvar = "nameplate",
                     timevar = "year", xvar = "hp", yvar = "mpg",
                     colorvar = "torque", sizevar = "curbwt")
plot(mp, tag = "chart")
#>

If we now take a look at this motion chart, we can see the six different cars each being represented by a circle.

We can see that the circle representing the Corvette is the one with the lowest miles per gallon over the course of the sample. But it also is the car with the highest horsepower, fastest acceleration (you can display acceleration by changing one of the values, for example mpg to accel), highest weight and highest torque. In contrast to this, the cars with the highest miles per gallon values, the Honda Civic, has the lowest values for horsepower, torque, curb weight and the slowest acceleration. Therefore we could assume that there might be a relationship between these values and fuel economy. We will take a look at this in later exercises.

< award "Dr.Plot!"

Congratulations, You've earned this award for creating scatter plots!

>

This exercise refers to page 3378 and 3379 of the paper.

Exercise 4: Theoretical Model

Before we start using our data to get some empirical results, we will think of a theoretical model.

As we already know, we don't have any sales data. Therefore we can't take sales into account.

What we can do though, is taking costs into account. If we assume the costs of producing a car $i$ with given attributes $mpg_{it}$,$w_{it}$,$hp_{it}$,$tq_{it}$ at a certain time $t$, this will be represented as a marginal cost function:

$$ c_{it} = C(mpg_{it},w_{it},hp_{it},tq_{it},t) $$

$c_{it}$ are the costs that will arise from this vehicle.

$mpg_{it}$ is the fuel economy of the to be produced car.

$w_{it}$ is the curb weight of the to be produced car.

$hp_{it}$ is the horsepower of the to be produced car.

$tq_{it}$ is the torque of the to be produced car.

$t$ is the year at which the car is to be produced, and will later be used to represent technological progress $T_t$.

For more information on marginal cost, click the info box.

< info "marginal cost function"

In economics, marginal costs are the costs that arise when the amount of production is increased by one. Let's look at an example. If we produce a new car, the costs will consist of variable costs such as the materials needed or the work needed as well as some fix costs.

$$ C(x) = VC * x + FC $$

$C(x)$ are the total costs generated by producing the additional car

$VC$ are the variable costs generated by the car (e.g. material, labor,...)

$x$ is the amount of cars produced

$FC$ are the fix costs. These are the costs that will arise irrespective of how many cars produced(e.g. rent for your factory building)

Mathematically the marginal costs are represented as the first derivation of the cost function:

$$ marginal cost = \dfrac{\mathrm{d} C}{\mathrm{d} x} $$

>

Since a normal car consists of more characteristics than $mpg_{it}$,$w_{it}$,$hp_{it}$ and $tq_{it}$, this is obviously not a very accurate representation of a car.

Therefore we will simply add some more characteristics.

If we differentiate between attributes that are related to fuel economy represented as $X_{it}$, and attributes related to other aspects of the vehicle represented as $Z_{it}$, this yields:

$$ c_{it} = C(mpg_{it},w_{it},hp_{it},tq_{it},X_{it},Z_{it}, t) $$

Attributes stored in $X_{it}$ might be a supercharger, a turbocharger or the kind of transmission used in the car.

Attributes stored in $Z_{it}$ are not related to fuel economy. These could be interior quality, a sun roof, a navigation system, a tow-bar and so on.

If we would like to estimate the Technological Progress in our current model, we can try to estimate how this function has changed over time. But there are two major problems:

(1): The dimension of $Z_{it}$, which is needed to control the changes in vehicle attributes across other dimensions, is very big.

(2): We have no cost data available. An obvious proxy would be price data.

But there is also a problem with price data. Given the numerous changes in the industrial structure of the automobile industry a concern when taking price data into account is that the estimates of technological progress would also capture changes in mark-ups over time. As a result, we will instead focus on the iso-cost curves (level sets) of the function.

Now, we would like to get a more precise model than this. One of the problems is that we cannot control the size of $Z_{it}$. If we now assume, that the attributes unrelated to fuel economy $Z_{it}$ are additively separable, this results that our function

$$ c_{it} = C(mpg_{it},w_{it},hp_{it},tq_{it},X_{it},Z_{it}, t) $$

changes to:

$$ c_{it} = C^{1}(mpg_{it},w_{it},hp_{it},tq_{it},X_{it},t) + C^{2}(Z_{it} ,t) $$

This allows us, to have two separate components of our marginal cost function:

$C^{1}$ which contains all the fuel economy related attributes

$C^{2}$ which contains the components of the function that are not related to fuel economy.

Because we are interested in how fuel economy has changed and $C^{2}$ does not contain any components related to fuel economy, we can ignore the $C^{2}$ part from here on.

Since we want to focus on the level sets of our function, we should transform our function into such. This yields to:

$$ mpg_{it} = f(w_{it},hp_{it},tq_{it},X_{it},t | C^{1} = \sigma) $$

The $C^{1} = \sigma$ part of the function represents that costs will be hold constant over the years.

If we now assume that Technological progress $T_t$ (it was represented as $t$ before in the function) is modeled as "input" neutral, we can multiply our function with $T_t$, yielding

$$ mpg_{it} = T_t f(w_{it},hp_{it},tq_{it},X_{it},\in_{it} | C^{1} = \sigma) $$

We can only make consistent estimations of our iso-cost curves, and how they have changed because of $T_t$, if the value of $C^{1}$ does neither change over time, nor within a year. In our Empirical models, the value of $C^{1}$ will be put into the error term $\epsilon_{it}$. In our empirical model we will also not take expenditures on technology into account. This might lead to two different sources of bias.

< info "bias"

Bias is the difference between this estimator's expected value and the true value of the parameter being estimated.

The Bias of an estimator $\beta$ is represented as: $$ Bias(\hat\beta) = E \hat\beta - \beta $$

if $$Bias(\hat\beta) = 0 \Leftrightarrow E \hat\beta = \beta $$ it is called unbiased.

Source: Stock, J. H., Watson, M. W. (2007): Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. page 68

>

First, if we want to estimate how our iso-cost curves have changed over time, therefore holding investments into technology constant, our estimated iso-cost curves will be biased in an unknown direction. On the one hand, if companies have increased their spending in technology, our curves will reflect not only technological progress, but also its increase. On the other hand, if companies have decreased their spending in technology, our curves will understate technological progress.

Another source of bias could arise from within-year variation in technology investments, if this variation is correlated with observed characteristics of our car. As a result, our relationships between fuel economy, engine power or weight will be biased.

Because the observed increase in fuel economy captures changes in the iso-cost curves due to technological progress and increases in how much firms are devoting to technology, the results should be interpreted in this light.

Besides the cost devoted to technologies, other factors make a difference in the relationship between fuel economy, engine characteristics and weight. As an example, can vehicles with a manual transmission achieve a higher fuel economy than automatic transmissions. This fact might change, if technology evolves further and more efficient automatic transmissions are invented. As our data allows, we will try to control a number of these factors, labelled as $X_{it}$.

Let's start with the empirical work in the next exercise.

This exercise refers to page 3371 and 3372 of the paper.

Exercise 5.1: Empirical Model: Cobb-Douglas

In this part of the Problem Set, we will focus on a Cobb-Douglas functional form to estimate the level sets.

a) Introducing Model

Before we are going to work with the more complicated models in the paper, it makes sense to look at an easier model first.

We assume there is a cost function representing the costs of producing a car with a given amount of fuel economy mpg, horsepower hp and torque torque. I know it is a very simple representation of a car, but the reason is to get an idea of how later models work. In 1928 Charles Cobb and Paul Douglas published their paper " A Theory of Production" in which they established a framework that has been widely accepted in empirical investigations.

A Cobb-Douglas production function is widely used to represent the relationship between two or more inputs and the amount of output generated by those inputs. If we assume, that all manufacturers have the same production elasticities and that substitution elasticities equal 1 we can use the Cobb-Douglas form.

The formula then looks like this:

$$\tilde c_{it} = mpg_{it}{}^\tilde\alpha * hp_{it}{}^\tilde\beta * torque_{it}{}^\tilde\gamma * \tilde T_t $$

< info "Technoligcal Progress T"

In economics, technological progress is a measure of innovation. If covers the invention of new technologies as well as the improvement of already existing technologies. We could therefore say, technological progress mostly consists of more and better technology.

For further information, click this Wikipedia link

>

The problem with this formula is, that we don't have any data for costs nor price.

One way of solving this is to express one variable with the others. Since we are interested in how the fuel economy has changed over time, we will express fuel economy with horsepower and torque.

If we take the logarithm of our formula this results in:

$$ \ln \tilde c_{it} = \tilde\alpha * \ln mpg_{it} + \tilde\beta * \ln hp_{it} + \tilde\gamma * \ln torque_{it} + \ln \tilde T_t $$

Since we want to express fuel economy, we should bring mpg on one side of the equation:

$$ -\tilde\alpha * \ln mpg_{it} = \ln \tilde T_t + \tilde\beta * \ln hp_{it} + \tilde\gamma * \ln torque_{it} - \ln \tilde c_{it} $$

Now we multiply it with (-1):

$$ \tilde\alpha * \ln mpg_{it} = - \ln \tilde T_t - \tilde\beta * \ln hp_{it} - \tilde\gamma * \ln torque_{it} + \ln \tilde c_{it}$$

Since we want to express fuel economy we want it to be separated. We simply divide through $\alpha$ :

$$ \ln mpg_{it} = - \frac{\ln \tilde T_t}{\tilde\alpha} - \frac{\tilde\beta}{\tilde\alpha} * \ln hp_{it} - \frac{\tilde\gamma}{\tilde\alpha} * \ln torque_{it} + \frac{\ln \tilde c_{it}}{\tilde\alpha}$$

Now we just have to move the costs into the error term $\tilde\epsilon_{it}$ (for more information see the info box):

$$ \ln mpg_{it} = - \frac{\ln \tilde T_t}{\tilde\alpha} - \frac{\tilde\beta}{\tilde\alpha} * \ln hp_{it} - \frac{\tilde\gamma}{\tilde\alpha} * \ln torque_{it} + \tilde\epsilon_{it}$$

< info "Error Term epsilon "

The random variable $\epsilon_{it}$ is also called disturbance.

$\epsilon_{it}$ has different properties.

  1. In many cases, it's hard to explain every variability in the model. Therefore $\epsilon$ explains omitted variables.

  2. Maybe the data wasn't collected 100% correctly. Even if the relationship might still exist, our $\epsilon$ has some of the measurement errors collected.

  3. Since we only have a model it surely does help to understand some relationships but it does not predict unpredictable effects. These effects will be accounted by the error term.

In our case $\tilde\epsilon_{it}$ equals:

$$\tilde\epsilon_{it} = \frac{\ln \tilde c_{it}}{\tilde\alpha}= c_{it}$$

Therefore our $\tilde\epsilon_{it}$ captures the unobserved costs that occur for a car.

>

With $$- \frac{\ln \tilde T_t}{\tilde\alpha} = T_t $$ $$ - \frac{\tilde\beta}{\tilde\alpha} * \ln hp_{it} = \beta * \ln hp_{it} $$ $$ - \frac{\tilde\gamma}{\tilde\alpha} * \ln torque_{it} = \gamma * \ln torque_{it}$$

we get:

$$ \ln mpg_{it} = T_t + \beta * \ln hp_{it} + \gamma * \ln torque_{it} + \tilde \epsilon_{it} $$

These results are level sets. For further information, click the info box "level sets".

< info "level sets"

Level sets are also known as iso-cost curves. In our case the level set, or iso-cost curve, represents all different combinations of inputs which would then result to the same amount of costs. In our problem set, it represents all possible combinations of characteristics which then result to the same costs of production.

>

< info "Logarithmic Transformation"

Please note that, an "Estimation is often facilitated by performing a logarithmic transformation of variables to create a linear estimation equation. A popular example of this is the Cobb-Douglas functional form, which requires a multiplicative disturbance if the logarithmic transformation is to create a linear estimating form in transformed variables. Now if, as is traditional, the nonlinear function without the disturbance is to represent the expected value of the dependent variable given the independent variables, the expected value of this multiplicative disturbance must be unity. The logarithm of this disturbance, which is the "disturbance" associated with the linear estimating form, does not have a zero expectation. This means that the OLS estimator of the constant in the linear estimating equation (the logarithm of the original Cobb-Douglas constant) is biased."

Source: Kennedy Peter (2008), A Guide to Econometrics, p. 111

>

For exercise 5.1, we will not take a look at Technological progress $T_t$. It will be discussed in Exercise 5.2

b) Loading Data

For the next few exercises, we need a special subset of our data: We will use the filter()command on dat and select all data sets with following characteristics:

and save them in a new variable called regdata. To do so, just click the check button.

#< task 
# We load the same data again.
dat = read.dta("Steroids_AER_data_post.dta")
# Then we kick the trucks and outlier out of our data.
regdata = filter(dat, d_truck==0 & outlier==0)
#>

With d_truck==0 we can assure that only data from cars are used and outlier==0 makes sure that outliers are not used in our regressions.

c) Model 1

After getting an overview of the data in earlier exercises and getting an idea on how the models work in exercise a), we will now use the knowledge in order to improve the introducing model.

Let's take our easy model, and assume a "more complicated" car. Our car still contains of fuel economy ($mpg_{it}$), horsepower ($hp_{it}$) and torque ($tq_{it}$), but let's add curb weight ($curbwt_{it}$). Since there are still very few car attributes represented (missing ones might be transmission, exhaust system,...) we should take more into account. To keep it simple, we add a term $X_{it}$ in which we store other attributes related to fuel economy.

A formula for this might look like this: $$c_{it} = mpg_{it}{}^\alpha * hp_{it}{}^\beta * torque_{it}{}^\gamma * curbwt_{it}{}^\delta* X_{it}{}^B * T_t $$

The vector $B$ captures the estimated values for the characteristics represented in $X_{it}$.

After transforming the same way as our example, this yields:

Model 1: $$ \ln mpg_{it} = T_t + \delta \ln curbwt_{it} +\beta \ln hp_{it} + \gamma \ln tq_{it} + X_{it}B + \tilde \epsilon_{it} $$

We can now try to calculate values for the given variables using a regression: In our data, we have existing groups (mfr) in which the values might be correlated. For example may cars from the same manufacturer share parts or technology (a combination is possible as well) and therefore the values might be correlated within the groups. This would result in the fact that regular OLS standard errors are biased. We can correct this, by using clustered standard errors. If we still assume that the values are uncorrelated across groups, we are able to use clustered standard errors. For more information, see the into box.

< info "clustered standard errors"

Standard errors under standard OLS assumptions are being calculated by:

$$ V_{OLS} = \sigma^2(X'X)^{-1} $$

with $\sigma^2$ being estimated by $s^2$

$$ s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2 $$

$N$ is the number of observations

$K$ is the rank (number of variables in the regression)

$e_i$ are the residuals from the regression.

If the standard errors are clustered the confidence interval doesn't have a probability of $1-\alpha$. To fix this, we can apply a sandwich estimator, like this: $$ V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1} $$

$n_c$ is the total number of clusters

$$u_j = \sum_{j_{cluster}}e_i*x_i$$

$x_i$ is the predictors (including constant) row vector.

Source: William Sribney, StataCorp (1998): Comparison of standard errors for robust, cluster, and standard estimators link

>

Within the package lfe the command felm() allows us a relatively easy implementation of clustered standard errors. To see how the felm() command is structured, click the info box.

< info "felm()"

The command felm() is built in a very specific way. The first part, requires the formula of our regression in the same way as you would write it in the lm() command.

Following after the $~|~$, are the factors added to the regression. Multiple factors are linked with "+".

The space behind the next $~|~$ is supposed to be used for independent variables. We will not need this part in this problem set, therefore it will always be 0 for us.

Behind the next $~|~$ we have to add the variables we would like to cluster our data for.

In general a felm command for us would look like this:

$$y \sim x_{1}+...+x_{n} ~|~ factor_{1}+...+factor_{n} ~|~ 0 ~|~ cluster_{1}+...+cluster_{n} $$

>

Since we know from exercise 1 that dummy variables can be used in an classic linear regression just as any other explanatory variable yielding standard OLS results, we can take the values of the following dummy variables d_manual+time_d_manual+d_diesel+d_turbo+d_super, and interpret them as our additional vector $X_{it}$.

This would then result to the following regression:

#< task 
# In order to use the felm command, we first need to load the package lfe
library(lfe)
# The stargazer package is needed, to show the restults in a nice way
library(stargazer)

# we use the felm command to express lmpg with other variables, adding year as a factor, clustered by mfr. 
# as data we use the recently loaded data, regdata
reg1 = felm(lmpg ~ 
              lcurbwt+lhp+ltorque+
              d_manual+time_d_manual+d_diesel+d_turbo+d_super | year |0| mfr, data = regdata)
# Now we just need to show the values for reg1
# to show it in a nice html format, we use stargazer
stargazer(reg1, type = "html")
#>

If we take a look at the estimates given by this regression, we can assume a first interpretation of the values:

Ceteris paribus, a 10 percent increase in weight (curbwt) is associated with a 3.977 percent decrease in fuel economy.

The same interpretation is given for horsepower: All else equal, a 10 percent increase in horsepower is associated with a 3.241 percent decrease in fuel economy.

For torque the relationship is not precisely estimated, which we are able to tell by the Signif. codes, but a 10 percent increase in torque is associated with a 0.19 percent decrease in fuel economy.

< info "interpretation log-log regression"

A Log-Log model is a regression model where the dependent variable (in our case $\ln mpg_{it}$) and the explanatory variables (in our case $\ln curbwt_{it}$,$\ln hp_{it}$, $\ln torque_{it}$) are in logarithmic form.

Based on Wooldridge, Jeffrey M. (2013). Introductory Econometrics: A Modern Approach (Fifth international ed.) Table 2.3 Summary of Functional Forms involving logarithms the interpretation of such a regression is as following:

$$ \% \Delta y = \beta_1 \% \Delta x $$

In the log-level model, 100 * $\beta_1$ is sometimes called the semi-elasticity of y with respect to x.

Source: Wooldridge, Jeffrey M. (2013). Introductory Econometrics: A Modern Approach (Fifth international ed.). Australia: South-Western. p. 44 and 852

>

d) Endogeneity

A variable is called exogenous if it is not correlated with the error term. For example if we assume that torque would be an exogenous variable then:

$$ Cor(\tilde\epsilon_{it}, torque_{it}) = 0 $$

If this is the case, the regression should show the real relationship.

In a statistical model, an endogenous variable is one that is correlated with the error term. In our case, $\tilde \epsilon_{it}$ captures the unobserved costs. Let's think of this scenario:

A Ferrari is typically a very expensive car with a lot of horsepower. Spending more money on a Ferrari would buy the customer more horsepower. But in our model, you can also get more fuel economy by spending more money. As a result, the correlation between horsepower and our error term looks as following:

$$ Cor(\tilde\epsilon_{it}, hp_{it}) \neq 0 $$

This results in horsepower being an endogenous variable. If we have an endogenous variable, all OLS estimators will (typically) be inconsistent and biased.

Source: Wooldridge, Jeffrey M. (2013). Introductory Econometrics: A Modern Approach (Fifth international ed.). Australia: South-Western. pp. 92 and 303.

Source: Herbert Stocker: Methoden der Empirischen Wirtschaftsforschung Chapter 13. link

e) Model 2

As you've seen, we might have endogeneity in our model. One way to fix it, is by using Panel data (see the info box) and adding fixed effects (see the info box). If we now think that cars have changed over time, but are constant across manufacturers we can add fixed manufacturer effects to Model 1.

< info "Panel data"

Panel data, or longitudinal data, are data that observe many entities (in our case cars) over time. Each car should be at least twice. There are two types of panel data: balanced, where all entities are observed in all periods of time. unbalanced, where information about at least one period for one entity is missing

(source: Stock and Watson (2007), Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. p.13, 350-351)

>

< info "Fixed Effects"

If we assume that there are other omitted variables, such as a manufacturer's expertise, which are correlated with the values in our model, then Fixed Effect Models provide a way for us to control the bias created by those variables. The idea is that the effect that the omitted variables have on the subject at a given time will also appear similar later. As a result this omitted effect would be constant.

To derive the transformation for fixed effects we suppose:

The $i$ th car in the $t$ th time period is written as:

$$ y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it} \tag{a} $$

If we now take the average on our observations on the $i$ th car over the time periods we have data on this car we get:

$$ \bar y_i = \alpha_i + \beta \bar x_{i} + \bar\epsilon_{i} \tag{b}$$

If we now subtract $b$ from $a$ we get

$$ y_{it} - \bar y_i = \beta (x_{it} - \bar x_{i}) + (\epsilon_{it} - \bar\epsilon_{i}) $$

and the intercept has been eliminated.

The fixed effects regression has n different intercepts, one for each entity. These intercepts can be represented by dummy variables. Said dummy variables absorb the influences of all omitted variables that differ from one entity to the next, but are constant over time.

Source: Kennedy Peter (2008), A Guide to Econometrics p.292 - 293

Source: Stock, J. H., Watson, M. W. (2007): Introduction to Econometrics. Second Edition, Boston: Pearson Education Inc. page 356

>

Let's remember what we have learned in exercise 4:

We are using a marginal cost function which consisted of variable as well as fixed costs, and these costs have been moved into the error term $\tilde \epsilon_{it}$.

Our costs in $\tilde \epsilon_{it}$ therefore consists of a manufacturer specific factor $\bar c_i$ and a car specific factor $k_{it}$.

$$\tilde \epsilon_{it} = \bar c_i + k_{it}$$

By adding manufacturer fixed effects, we eliminate the manufacturer specific component of our error term, yielding

$$\epsilon_{it} = k_{it}$$

By doing so, we try to reduce/eliminate endogeneity.

With 'felm' these fixed effects are relatively easy to implement. We simply add mfr to our factor part of the command.

Now it's your turn: Please use the felm command the same way as before to create reg2, but add mfr as a second factor with a + besides year and use regdata for the data .

#< task
# use felm to create reg2. 
# express lmpg with lcurbwt+lhp+ltorque+d_manual+time_d_manual+d_diesel+d_turbo+d_super
# add year and mfr as factors (year + mfr)
# cluster by mfr
# use regdata.

#>
reg2 = felm(lmpg ~ 
              lcurbwt+lhp+ltorque+
              d_manual+time_d_manual+d_diesel+d_turbo+d_super | year+ mfr |0| mfr, data = regdata)

Great work on that regression.

To show your results next to the ones from the first regression, click Check

#< task
stargazer(reg1, reg2, column.labels=c("OLS","Fixed Effects"), type = "html")
#>

< award "Adding fixed effects!"

Congratulations, You've earned this award for adding manufacturer fixed effects to a Model!

>

Let me ask you some questions on the results of regression 2:

Question 1:

< quiz "q3"

question: If we interpret our results, are the increases/decreades in for example curbwt or are they in ln curbwt? sc: - curb weight* - ln curb weight success: Great, your answer is correct! failure: Wrong! See the info box for log-log regression.

>

Question 2:

< quiz "single"

question: A 10 percent increase in curb weight is associated with a 3.834 percent decrease in fuel economy. Correct? sc: - yes* - no success: Great, your answer is correct! failure: Try the other answer.

>

Question 3:

Look at lhp. Add the correct parts to the sentence: "A 10 percent (answer1) in lhp is associated with a (answer2) percent increase in fuel economy"

< quiz "parts"

parts: - question: 1. Add the word needed for answer1 answer: decrease roundto: 0.01 - question: 2. Select the value for answer2 choices: - 0.815 - 3.14 - 2.68* - 0.64 multiple: FALSE success: Great, your answer is correct! failure: Try again.

>

< award "Quiz master"

Congratulations, You've earned this award for solving the questions correctly!

>

The coefficients associated with manual transmissions and diesel engines suggest fuel economy savings for these two attributes. For our Cobb-Douglas Models, the increase in fuel efficiency from diesel technology is between 19 and 21 percent.

The negative estimates for time_d_manual suggest that the gains from a manual transmission are estimated to fall over time. This might indicate that either more and more cars are equipped with an automatic transmission or the efficiency of those transmissions increased. A combination of both is possible too. Early in our sample, a manual transmission suggests savings between 8.7 and 10 percent.

Since the efficiency gains of automatic transmissions, in relation to manual transmissions, can also be represented as technological improvements specific to automatic transmission, we can try to think of it as some kind of technological progress.

We can also see, that the estimated trade-offs (and as you will see in further exercises the technological progress too) only chance little when manufacturer fixed effects are included. This suggests that any additional endogeneity concerns are likely to be small.

After we now got an idea about how trade-offs between fuel economy and other vehicle characteristics work, we will now take a closer look at Technological Progress $T_t$ in the next exercise.

This exercise refers to page 3372 and 3381 of the paper.

Exercise 5.2: Cobb-Douglas: Technological Progress

As you might have already noticed, we did not touch on the $T_t$ of our formula yet. $T_t$ is the estimator for Technological progress. It should capture the progress that occurred in a certain year $t$ and is, in our models, modeled nonparametrically as a set of year fixed effects.

Technological progress does not only represent increases in engine technology, but also improvements regarding for example transmissions, rolling resistance, aerodynamics or even fuel composition. As you might see, some of these effects cannot be influenced by manufacturers or customers. Beginning in the 1980 numerous technologies were established in newly produced cars. Some of these progresses on the engine side have been for example replacing carburetors with fuel injection, or adding manual cylinder deactivation. Both lead to great improvements in fuel economy. If we compare a modern engine with an engine from around 1980 a modern engine has a camshaft, which is responsible for lifting the valves during its rotation and is placed above the engine head thus eliminating friction. Many new cars have multiple valves per cylinder as well as variable valve timing. While multiple valves allow a smoother flow of the fuel/air mixture within the engine, variable valve timing allows the engine to adjust to driving conditions. Both a supercharger and a turbocharger use a turbine to force more air into the engine, resulting in an increase in efficiency. Within the last years, cylinder deactivation and hybrid technology are becoming more and more common. Hybrid technology is a combination of a traditional engine with an electric motor. This allows a car to run on only the engine or the electric motor or both. The electric motor can be used as long as enough electricity stored in a battery. The battery charges while the car is in motion and the electric motor is not used. Obviously having the possibility to "generate" energy that is later used to move the vehicle without using any fuel, has an immense positive impact on fuel economy. Cylinder deactivation allows a car to not use all of its cylinders if they are not needed. Therefore an increase in fuel economy is obvious. Not all improvements are directly related to the engine though. For example may advanced materials like Carbon, innovations by tire manufacturer or better lubricants from suppliers lead to efficiency improvements as well.

To get some values for this estimator, we would ideally like to get an increasing value for each year. This would imply that technological progress always had a positive effect on fuel economy.

If we now get back to our models, there is one question: How are we able to estimate all of these huge improvements over all these years?

Before we can do anything, we need to load the data as usual.

#< task
# first, we should load the data 
dat = read.dta("Steroids_AER_data_post.dta")
regdata = filter(dat, d_truck==0 & outlier==0)
#> 

a) Technological progress in model 1

After loading the data again, we will estimate technological progress using our model 1. Since technological progress is a set of year fixed effects, we will simply display the values for year in our model.

#< task
# Let's look at our old regression first: 
# we had "lmpg ~ lcurbwt+lhp+ltorque+d_manual+time_d_manual+d_diesel+d_turbo+d_super | year |0| mfr" in our felm command. 
# if we now like to get a value for every year in this regression, we have to take the `year` paramenter into account. 
# because of this, we have to add "year" as a factor to our data. If we don't do that step we would only get 1 value for year, but we would like to see how it has changed. 

regdata$year <- factor(regdata$year)

# Next we replace the `year` in the factor part of the felm command with a 0, and add year to our parameters.
reg1t = felm(lmpg ~ 
               lcurbwt+lhp+ltorque+
               d_manual+time_d_manual+d_diesel+d_turbo+d_super+year |0|0| mfr, data = regdata)
# As the last step, we display the results using stargazer
stargazer(reg1t, type = "html")
#>

As you can see, we now have a value for every year. This value represents the level of technology in this given year based on year 1980. Because we already have the values for $curb weight$ , $horsepower$ , $torque$ and $X_{it}$, and they didn't change, we are for now only interested in the values of technology. In order to only have the values we need, we will extract them from the results. For this task the tidy() command from the package broom helps us to make this task more convenient.

< info "tidy()"

The command tidy() from the package broom is an easy way to convert statistical analysis objects such as coefficients of a regression from R into tidy data frames. Data frames are more easily processed, reshaped or combined with tools from other packages like for example 'dplyr', 'tidyr' or 'ggplot2'.

Source: David Robinson (2015). broom: Convert Statistical Analysis Objects into Tidy Data Frames. R package version 0.4.0 https://cran.r-project.org/web/packages/broom/index.html

>

I did this already for you. Just click the check button.

If you want to see what is stored in TECH_PROG_MOD1, you have to simply undomment the last line in this command.

#< task
# We use the `tidy` command from the "broom" package to create a nicer looking appearance.  
library(broom)
M1t <- tidy(reg1t)

# lastly, we need to extract the data regarding year from our data frame "M1t"
TECH_PROG_MOD1 = M1t[10:35, c('term', 'estimate')]

# in case you want to see how this looks like, just uncomment the next line
# TECH_PROG_MOD1
#>

We do now have an estimation for Technological Progress in model 1, saved as TECH_PROG_MOD1 (Technological Progress Model 1).

b) Technological progress in model 2

But since having only estimations for 1 model are hard to evaluate, we will get these values for model 2 too.

Your task now is, to create the regression in order to get $T_t$ for model 2:

#< task
# change the felm command as I did in the example.
# You have to leave the mfr as a factor, because this represents the Manufacturer Fixed effects we already added.
# In case you don't remember the command for Model 2, here is it again:
# felm(lmpg ~ lcurbwt+lhp+ltorque+d_manual+time_d_manual+d_diesel+d_turbo+d_super | year+ mfr |0| mfr, data = regdata)
# change it accordingly, then save it as `reg2t`
#>
reg2t = felm(lmpg ~ 
               lcurbwt+lhp+ltorque+
               d_manual+time_d_manual+d_diesel+d_turbo+d_super+year | mfr |0| mfr, data = regdata)

Good Job on this regression again.

To show both results from reg1t and reg2t side by side, click check.

#< task
stargazer(reg1t, reg2t, column.labels=c("OLS","Fixed Effects"), type = "html")
#>

After we now have the values for each year in both models, we only have to extract them for model 2.

I have already prepared the needed code for you, so simply click check once more.

If you would like to see how TECH_PROG_MOD2 looks, just uncomment the last line of code.

#< task
# Use tidy to create M2t from reg2t
M2t <- tidy(reg2t)
# Take rows 9:34 from M2t and save it as TECH_PROG_MOD2
TECH_PROG_MOD2 = M2t[9:34, c('term', 'estimate')]
# in case you want to see how this looks like, just uncomment the next line
# TECH_PROG_MOD2
#>

Good Job on that regression. We are now able create a plot of the two estimates for $T_t$. This will be done in exercise c).

c) Comparison

Let's set these two estimates into relation:

In order to plot both of the results, we will first create a data frame containing the values of TECH_PROG_MOD1 and TECH_PROG_MOD2.

As in previous exercises, if you would like to see how p12 looks, just uncomment the last line of code.

#< task
# first we create a vector containing the years
Year = 1981:2006
# then we create a vector for the estimates we got from our regressions and saved 
# as TECH_PROG_MOD1 and TECH_PROG_MOD2
Model1 = TECH_PROG_MOD1$estimate
Model2 = TECH_PROG_MOD2$estimate
# now we have 3 vectors, one for the years, and one for each models technological progress estimates
# we now only have to save them into a data frame. 
# cbind takes a sequence of vector, matrix or data frames arguments and combine by columns
# in our case the three vectors 
# to be able to plot them, we need to transform them into a data.frame with data.frame()
p12 = data.frame(cbind(Year,Model1,Model2))
# p12
#>

After this step we have a data frame p12 containing the values of TECH_PROG_MOD1 and TECH_PROG_MOD2 as well as an indicator for the year. As a result, we can now easily plot the values for for $T_t$ according to model 1 and model 2.

#< task
# Then we use the ggplot command from `ggplot2`to create the plot
progress12 <- ggplot(aes(x=Year,y=Estimate,colour = "Model No"), data=p12) + 
# now we will just add the lines
  geom_line(aes(y = Model1, colour = "Model 1")) + 
  geom_line(aes(y = Model2, colour = "Model 2"))
# we simply show the plot
progress12
#>

As you can see, the two models result in very similar estimates for Technological Progress $T_t$. This also indicates that any additional endogeneity concerns are likely to be small.

We can see that early in the sample (year 1981 to 1986) the increase of progress was greatest. This is consistent with what we estimated in exercise 2. After these years, progress is still increasing considerably, but has slowed down.

d) Possible fuel economy in 2006

The really interesting part is that we are now able to estimate how fuel economy in year $t$ compare to fuel economy in 1980 if we had held size and power constant. To do so, we will hold the values for $\ln curbwt$, $\ln hp$, $\ln tq$ and $X$ on their 1980 level, and only change $T_t$.

To calculate this, we assume the possible fuel economy in year $t$ as $\widetilde {mpg_{it}}$

According to our model, this would yield: $$\widetilde {\ln mpg_{it}} = T_t + \delta \ln curbwt_{i1980} +\beta \ln hp_{i1980} + \gamma \ln tq_{i1980} + X_{i1980}B + \tilde \epsilon_{i1980}$$

and the fuel economy in year 1980 as

$$\ln mpg_{i1980} = T_0 + \delta \ln curbwt_{i1980} +\beta \ln hp_{i1980} + \gamma \ln tq_{i1980} + X_{i1980}B + \tilde \epsilon_{i1980}$$

If we now want to calculate the increase in mpg possible by 2006, this would yield to

$$G = \widetilde {\ln mpg_{it}} - \ln mpg_{i1980} = (T_t + \delta \ln curbwt_{i1980} +\beta \ln hp_{i1980} + \gamma \ln tq_{i1980} + X_{i1980}B + \tilde \epsilon_{i1980}) - (T_0 + \delta \ln curbwt_{i1980} +\beta \ln hp_{i1980} + \gamma \ln tq_{i1980} + X_{i1980}B + \tilde \epsilon_{i1980}) $$

This would then equal:

$$ G = T_t - T_0 \overset{T_0 = 0}{=} T_t$$

This way we can say that our estimates for $T_t$ are the increase in log fuel economy by year $t$ compared to 1980.

Therefore we can say that, for Model 1, the log of fuel economy is over 0.52174952 greater in 2006, compared to 1980. A similar interpretation for Model 2 is that the log of fuel economy is over 0.51150664 greater in 2006, compared to 1980.

We will now estimate how the fuel economy of a car with characteristics of 1980 would look like in 2006 regarding Model 1.

To do so we will calculate the possible fuel economy in 2006, using the characteristics of 1980. The idea is, that if we keep all the values except for $T_t$ on their 1980 levels, we can use our estimates for $T_t$ to calculate the log of fuel economy in each year $t$.

Let's assume a fictive car with the mean values for our attributes in 1980: We therefore have to save the mean values of our attributes in the year 1980 into a vector.

#< task
# to get these values, we need to take all the cars from 1980: 

cars1980 = filter(dat, d_truck == 0, outlier == 0, year == 1980)


# since you already know how to calculate means from Exercise 1, 
# we this time will save all the means in a vector called means1980.
# it contains the mean values for the relevant attributes we used in the regression. 

means1980 = c(mean(cars1980$lcurbwt),
              mean(cars1980$lhp),
              mean(cars1980$ltorque),
              mean(cars1980$d_manual), 
              mean(cars1980$time_d_manual), 
              mean(cars1980$d_diesel),
              mean(cars1980$d_turbo), 
              mean(cars1980$d_super))
# In case you would like to see the saved attributes, uncomment the next line.
# means1980
#>

Now, after we got the mean values, we still need the values for our estimators $\delta$, $\beta$, $\gamma$ and $B$. These are the coefficients we got from our first regression.

#< task
# since we want to have the coefficients of our regression, we can use the command `coef()` to get them.
# This command saves the coefficients into a data frame called datreg1
datreg1 = data.frame(coef(reg1t))
show(datreg1)
#>

Since we by now only have all the values, we would like to filter out the ones we need. We need the values of $\delta$, $\beta$, $\gamma$ and $B$, as well as the values for $T_{2006}$ and the constant:

#< task
# We can retrieve values by declaring the index inside a single square bracket "[row,column]" operator.
# get the constant
const = datreg1[1,1]
const
# get the coefficients of our regression
coef = datreg1[2:9,1]
coef
# get the value for Technological progress in Year 2006
T2006 = datreg1[35,1]
T2006
#>

First we will try to get a rough estimation on how good our regression results are:

Therefore we will use the values of our regression, to estimate the mean fuel economy in 1980. Then we compare it with the real fuel economy from the data.

To calculate the mean fuel economy in 1980 with our model, we need to multiply our coefficients (coef) with the mean values of 1980 (means1980), and add the constant (const). The 0 represents the value for $T_{1980}$.

#< task
# here we are estimating the mean fuel economy in 1980 with our model
# sum(coeff*means1980) is just the multiplication of the two vectors
# one contains the coefficients, the other the mean values of 1980.
reglmpg1980 = 0 + sum(coef*means1980) + const
reglmpg1980
# this command provides us the real value from our data.
mean(cars1980$lmpg)
#>

So the value we get from our regression is 3.106287, which is equal to the mean values of this year. We can see that the results of our models are equal to the real value, therefore our calculation appears to be correct.

If we now want to estimate how the same fictive car would ceteris paribus look like in year 2006, we have to use the estimated $T_{2006}$ instead of $T_1980$

#< task
# we use the same calculations as before, but we change T to the value of 2006
tildelmpg2006 = T2006 + sum(coef*means1980) + const
tildelmpg2006
#>

Now, that we have those results we can see that ceteris paribus the log of fuel economy in 2006 would be 3.628037. This would mean, that the log of fuel economy in 2006 is 0.512 greater in 2006, compared to 1980. As you might have already realized, this is exactly $T_{2006}$.

If we would like to estimate the increase as percentages, we can take the values for $\widetilde {mpg_{t}}$ (note that this is not $\ln \widetilde {mpg_{t}}$ anymore).

$$ \% increase = \dfrac{\widetilde {mpg_{2006}} -mpg_{1980}}{mpg_{1980}}$$

Since

$$\ln(exp(x)) = x $$ $$exp(\ln(x)) = x $$

we can say that: $$ \widetilde {mpg_{t}} = exp(\ln \widetilde {mpg_{t}}) $$

Now, that we have a value for fuel economy in 2006, we can compare $\widetilde {mpg_{2006}}$ with the mean values of $mpg_{1980}$:

#< task
# exp(tildempg2006) provides us with the value of possible fuel economy in 2006, by eliminating the log.
# then we just calculate the percentage increase compared to 1980.
progressm1 = (exp(tildelmpg2006)-mean(cars1980$mpg))/mean(cars1980$mpg)
progressm1
#>

Taking the results from Model 1, we can say that an increase in fuel economy by ~64.4 percent could have been possible.

The increase for model 2 will be discussed in exercise 6.3.

This exercise refers to page 3373, 3382, 3384 and 3385 of the paper.

Exercise 6.1: Robustness: Cobb-Douglas

After we have managed to get an idea of $T_t$, one might be concerned that a supercharger or a turbocharger are some kind of "technological progress". If this is the case, they should not be considered in our regression, because they will already be represented in $T_t$. So let's take a look at this.

a) Loading data

As usual, we load our data.

#< task 
# We load the same data again.
dat = read.dta("Steroids_AER_data_post.dta")
# Then we kick the trucks and outlier out of our data.
regdata = filter(dat, d_truck==0 & outlier==0)
#>

b) Market penetration of superchargers & turbochargers

Before we think about changing our models, we will take a look at the market penetration of turbochargers and superchargers.

To get an idea, we are going to plot the market penetration. Because d_turbo and d_super are dummy variables, the mean value in a given year equals the percentage of cars having the qualitative phenomenon in this year.

Click check to see an example.

#< task
# we take the same data we already used for the past regressions.
# remember exercise 1c) we do the same thing here.
pen = summarise(group_by(regdata, year), d_super=mean(d_super), d_turbo=mean(d_turbo))
# now we need to plot the penetration. We save it as pen1
pen1 = ggplot(aes(x=year, y=d_super), data = pen) + geom_line() +ggtitle("Superchager Penetration")
# then we need to show pen1
pen1
#>

We can see that the market penetration of superchargers started in 1988. Since then it has increased with a small decrease in 1996. In the years after 1996, we can see that more and more cars have been equipped with a supercharger.

Now it is your turn. Do the equivalent plot for turbocharger.

#< task
# plot the market penetration for turbocharger(d_turbo), with main = "Turbocharger Penetration", xlab="Year 1980-2006"
# Look at the graph for Superchager Penetration. 
# change the y variable to the turbocharger variable (d_turbo), and the title to "Turbocharger Penetration"
# then display the plot
#>
pen2 = ggplot(aes(x=year, y=d_turbo), data = pen) + geom_line() +ggtitle("Turbocharger Penetration")
pen2

What we can see here is, that the market penetration of new cars regarding turbochargers has increased drastically early in the sample. After a big downfall between 1989 and 1996, the amount of new cars equipped with a turbocharger increased drastically again.

Now, regarding both market penetrations, we can say that more and more cars are equipped with a supercharger or turbocharger, especially in the later years of our sample.

< quiz "m3"

question: Regarding this, should we take d_super and d_turbo into account? sc: - yes - no* success: Great, your answer is correct! failure: Try the other answer.

>

If we don't take d_super and d_turbo into account, we allow our estimates of technological progress to reflect their increased penetration, as well as their effect on fuel economy.

This transforms our iso-cost curve to:

Model 3: $$ \ln mpg_{it} = T_t + \delta \ln curbwt_{it} +\beta \ln hp_{it} + \gamma \ln tq_{it} + X'{it}B + \epsilon{it} $$

The difference between $X_{it}B$ and $X'{it}B$ is, that d_superand d_turbo are not included in $X'{it}B$

For this regression, we will use the felm() command to estimate the relationship between lmpg ~ lcurbwt+lhp+ltorque+d_manual+time_d_manual+d_diesel with the factors year and mfr clustered by mfr. We are going to use regdata as data: For our regression, compared to Model 2, this changes the part of the independent variables: We simply leave out d_super and d_turbo:

#< task
# we use the felm command again. In comparison to Model 2, we now leave out d_super and d_turbo  to represent the penetration 
reg3 = felm(lmpg ~ 
              lcurbwt+lhp+ltorque+
              d_manual+time_d_manual+d_diesel | year+ mfr |0| mfr, data = regdata)
# we show the results
stargazer(reg3, type = "html")
#>

The Cobb-Douglas results imply that, ceteris paribus, a 10 percent decrease in weight is associated with a 4.19 percent increase in fuel economy. Large fuel efficiency gains are also correlated with lowering horsepower; all else equal, a 10 percent decrease in horsepower is associated with a 2.62 percent increase in fuel economy. The relationship between fuel economy and torque is small and not precisely estimated; a 10 percent increase in torque is correlated with a 0.45 percent increase in fuel economy.

For the discussion of this models Technological Progress, see Exercise 6.3.

< award "Cobb-Douglas!"

Congratulations, You've earned this award for creating the Cobb-Douglas Models!

>

This exercise refers to page 3372 and 3381 of the paper.

Exercise 6.2: Robustness: Translog

The assumptions made by the Cobb-Douglas model, are very restrictive. Therefore we will use a more flexible model in this exercise: the Translog production function.

A general Translog function can look like this:

$$ \ln y = \alpha_0 + \sum_{i} \alpha_i + \ln X_i + \dfrac{1}{2} \sum_{i} \sum_{j} \gamma_{ij} \ln X_i \ln X_j$$

< info "Translog"

As the first form of a Translog production may be considered the proposal made in 1967 by J. Kmenta. When Grilichs and Ringstad proposed a new form of production function in 1971, the production function became in fact a labor productivity function.

Probably the main advantage of a translog function is that, unlike in case of Cobb-Douglas, it doesn't assume strict premises as such: perfect or "smooth" substitution between production factors perfect competition on the production factors market (J.Klacek, et al., 2007)

Also, the concept of the translog production function permits to pass from a linear relationship between the output and the production factors, to a non-linear one.

Source: Pavelescu Florin-Marius: Some aspects of the translog production function estimation

>

The advantage a translog function has over a Cobb-Douglas Model is the flexible functional form. This results in fewer restrictions on production elasticity and substitution elasticizes. But the disadvantages are that our results are more difficult to interpret. Therefore we will not interpret them as detailed as we did for the first three models.

A translog function is a generalization the Cobb-Douglas production function and therefore we can use the same way of transforming the cost function into level sets as we did in Exercise 5.1 a).

This results to: $$ \ln mpg_{it} = T_t + f(curbwt,tq,hp) + X_{it}B + \epsilon_{it} $$

which is equal to:

$$ \ln mpg_{it} = T_t + \beta_1 \ln curbwt_{it} +\beta_2 \ln hp_{it} + \beta_3 \ln tq_{it} + \\ \gamma_1(\ln curbwt_{it})^2 +\gamma_2(\ln hp_{it})^2 + \gamma_3(\ln tq_{it})^2 +\\ \delta_1\ln curbwt_{it}\ln hp_{it} + \delta_2\ln curbwt_{it}\ln tq_{it} + \delta_3\ln hp_{it}\ln tq_{it} + X_{it}B + \epsilon_{it} $$

a) Loading Data

we load the same data as for Cobb-Douglas. Simply click check:

#< task
dat = read.dta("Steroids_AER_data_post.dta")
regdata = filter(dat, d_truck==0 & outlier==0)
#>

b) Model 4

For the first translog model we will take the same assumptions we made for the first Cobb-Douglas model.

< quiz "t1"

question: To give you an idea again, simply check the attributes we used for Cobb-Douglas Model 1. mc: - fuel economy - other attributes not related to fuel economy - horsepower - torque - curb weight - acceleration - other attributes related to fuel economy* - cylinders success: Good job, all answers are correct! failure: Not all answers correct. Try again. Only 5 answers have to be ticked.

>

Since you now know again which characteristics are part of our Cobb-Douglas models, this is how the level sets of our translog model looks like:

$$ \ln mpg_{it} = T_t + \beta_1 \ln curbwt_{it} +\beta_2 \ln hp_{it} + \beta_3 \ln tq_{it} +\\ \gamma_1(\ln curbwt_{it})^2 +\gamma_2(\ln hp_{it})^2 + \gamma_3(\ln tq_{it})^2 +\\ \delta_1\ln curbwt_{it}\ln hp_{it} + \delta_2\ln curbwt_{it}\ln tq_{it} + \delta_3\ln hp_{it}\ln tq_{it} + X_{it}B + \epsilon_{it} $$

The difference between this, and the "old" Cobb-Douglas level sets, is that the translog level set has the functional part

$$ ...+\gamma_1(\ln curbwt_{it})^2 +\gamma_2(\ln hp_{it})^2 + \gamma_3(\ln tq_{it})^2 + \delta_1\ln curbwt_{it}\ln hp_{it} + \delta_2\ln curbwt_{it}\ln tq_{it} + \delta_3\ln hp_{it}\ln tq_{it}+...$$

added.

This allows us to have less restrictions on production elasticities and subsitution elasticities, but makes the results difficult to interpret.

Our data has columns called lhp2, lcurbwt2, ltorque2, which are equal to $(\ln curbwt_{it})^2$, $(\ln hp_{it})^2$ and $(\ln tq_{it})^2$. The same applies for lcurbwt_lhp, lcurbwt_ltorque and lhp_ltorque. They are equal to $\ln curbwt_{it}\ln hp_{it}$, $\ln curbwt_{it}\ln tq_{it}$ and $\ln hp_{it}\ln tq_{it}$.

$X_{it}$ is the same as in model 1, it contains the dummy variables of characteristics related to fuel economy d_manual+time_d_manual+d_diesel+d_turbo+d_super.

Now, we will use a regression again to calculate the coefficients under the Translog assumption:

#< task
# we load a package again
library(lfe)
# we use the felm command to save the regression as reg4  
reg4 = felm(lmpg~ 
              lcurbwt+ lhp+ ltorque+ 
              lhp2+ lcurbwt2+ ltorque2+ 
              lcurbwt_lhp+ lcurbwt_ltorque+lhp_ltorque+ 
              d_manual+ time_d_manual+ d_diesel+ d_turbo+ d_super | year |0| mfr, data = regdata)
# we use stargazer to show the results in a nice html format
stargazer(reg4, type="html")
#>

If we now take a look at the results, one of the first things to realize might be, that all the Standard Errors for lcurbwt, lhp and ltorque are way bigger than for the Cobb-Douglas Model.

Since there might be the same problem with endogeneity as in our Cobb-Douglas Model, let's see what happens if we add manufacturer fixed effects to our translog model.

c) Model 5

In this exercise we will add manufacturer fixed effects to our recently developed translog model.

Therefore we repeat the steps we did for Model 2 and add manufacturer fixed effects to our Translog Model:

It will now be your task to create the felm()- command for this regression. Simply do it the same way as you did before with the Cobb-Douglas model. In case you don't know why and how you did it, you can either go back to exercise 5.1 e) or follow the given instructions here.

#< task
# save reg5 as a felm command. 
# lmpg should be described as lcurbwt+ lhp+ ltorque+ 
# lhp2 +lcurbwt2 +ltorque2+ 
# lcurbwt_lhp+ lcurbwt_ltorque +lhp_ltorque 
# +d_manual +time_d_manual +d_diesel+ d_turbo +d_super
# add manufacturer and year fixed effects to our translog model, by adding the factor year + mfr as factors. 
# don't forget to cluster by manufacturer

#>
reg5 = felm(lmpg~ 
              lcurbwt+ lhp+ ltorque+ 
              lhp2 +lcurbwt2 +ltorque2+ 
              lcurbwt_lhp+ lcurbwt_ltorque +lhp_ltorque +
              d_manual +time_d_manual +d_diesel+ d_turbo +d_super | year + mfr |0| mfr, data = regdata)

To compare the results, click check

#< task
stargazer(reg4, reg5, column.labels=c("OLS","Fixed Effects"), type = "html")
#>

If we look at the Standard Errors of this model, how did they change?

Question 1:

< quiz "translog1"

question: Compare the results from reg4 with reg5. How did the overall standard errors change? sc: - decrease* - increase success: Great, your answer is correct! failure: Try the other answer.

>

Question 2:

< quiz "translog2"

question: Smaller standard errors means, that the estimated values are closer to the mean value? sc: - false - true* success: Great, your answer is correct! failure: Try the other answer.

>

d) Model 6

If we now take the increased market penetration for turbocharger and supercharger (see exercise 6.1) into account, and therefore eliminate them from our regression, this step yields to:

We had: $$ \ln mpg_{it} = T_t + \beta_1 \ln curbwt_{it} +\beta_2 \ln hp_{it} + \beta_3 \ln tq_{it} +\\ \gamma_1(\ln curbwt_{it})^2 +\gamma_2(\ln hp_{it})^2 + \gamma_3(\ln tq_{it})^2 + \\ \delta_1\ln curbwt_{it}\ln hp_{it} + \delta_2\ln curbwt_{it}\ln tq_{it} + \delta_3\ln hp_{it}\ln t_{it} + X_{it}B + \epsilon_{it} $$

Now that we change $X_{it}B$ from consisting of d_manual , time_d_manual , d_diesel , d_turbo , d_super to $X'_{it}B$ which only contains: d_manual , time_d_manual , d_diesel this results into our final Translog form:

$$ \ln mpg_{it} = T_t + \beta_1 \ln curbwt_{it} +\beta_2 \ln hp_{it} + \beta_3 \ln tq_{it} +\\ \gamma_1(\ln curbwt_{it})^2 +\gamma_2(\ln hp_{it})^2 + \gamma_3(\ln tq_{it})^2 + \\ \delta_1\ln curbwt_{it}\ln hp_{it} + \delta_2\ln curbwt_{it}\ln tq_{it} + \delta_3\ln hp_{it}\ln t_{it} + X'{it}B + \epsilon{it} $$

The regression, for the Translog Model with Manufacturer Fixed Effects and the market penetration of turbocharger and supercharger looks like this:

#< task
reg6 = felm(lmpg~
              lcurbwt+ lhp+ ltorque+ 
              lhp2 +lcurbwt2 +ltorque2 +
              lcurbwt_lhp+ lcurbwt_ltorque+ lhp_ltorque +
              d_manual +time_d_manual +d_diesel | year + mfr |0| mfr, data = regdata)

#>
#< task
stargazer(reg4, reg5, reg6, column.labels=c("OLS","Fixed Effects","Fixed Effects no turbo/super"), type = "html")
#>

After we now have all our results for the regression, we can see that the standard errors of all models are bigger than for Cobb-Douglas.

It appears that the Translog assumption does overparameterize the iso-cost curve.

The coefficients associated with manual transmissions and diesel engines suggest fuel economy savings for these two attributes. For our Translog Models, the increase in fuel efficiency from diesel technology is between 24 and 27 percent. The gains from a manual transmission are estimated to fall over time, since more and more cars are equipped with an automatic transmission. Early in our sample, a manual transmission suggests savings between 7.6 and 8.7 percent. Since the efficiency gains of automatic transmissions, in relation to manual transmissions, can also be represented as technological improvements specific to automatic transmission, we can try to think of it as some kind of technological progress.

< award "Translog!"

Congratulations, You've earned this award for completing the translog regressions!

>

This exercise refers to page 3372 and 3381 of the paper.

Exercise 6.3: Robustness: Technological Progress

Since are not only interested in trade-offs between the different attributes, but also in how technology has changed over the course of our data, we are now using estimates for technological progress across all our models.

Because we will later need the values of cars in 1980, I already prepared the needed data for you. Simply click check.

#< task
dat = read.dta("Steroids_AER_data_post.dta")
cars1980 = filter(dat, d_truck == 0, outlier == 0, year == 1980)
#>

In order to make this exercise more convenient for you, I already did all Technological Progress estimators for this Exercise beforehand. It was done the same way as we just did for model 1 and 2 in exercise 5.2, just for every model. To get the required estimations, use read.table() to read the file "Progress.txt" and save it in a variable called progress16 (means: Progress Model 1-6)

#< task
# use read.table to save "Progress.txt" as progress16

#>
progress16 = read.table("Progress.txt")

After we now loaded the Technological Progress estimates, we surely want to take a look at them.

#< task
#Use the show() command on progress16 to view the estimates
#>
show(progress16)

As we look at these estimates, first of all we might realize that all these values are pretty close to each other across all models. This corresponds to the results we had earlies when just looking at model 1 and 2. To give you an idea of how close the Estimations are across all models, click the check button to see a graph on this.

#< task
progressplot <- ggplot(progress16, aes(x=Year, y= Progress, colour = "Model No")) + 
  geom_line(aes(y = Model.1, colour = "Model.1")) + 
  geom_line(aes(y = Model.2, colour = "Model.2")) +
  geom_line(aes(y = Model.3, colour = "Model.3")) + 
  geom_line(aes(y = Model.4, colour = "Model.4")) + 
  geom_line(aes(y = Model.5, colour = "Model.5")) + 
  geom_line(aes(y = Model.6, colour = "Model.6")) 

progressplot
#>

As we already noticed, the technological progress estimates are very similar across models.

If we take a closer look at the graph, we can see that early in the sample (year 1981 to 1986) the increase of progress was greatest. This is consistent with what we estimated in exercise 2. Another reason beside the CAFE standards might be, that early in the sample the gasoline prices were high, and therefore the industry had to come up with ideas to increase fuel economy in order to sell their cars. After these years the technological progress is still increasing considerably, but the increase has obviously slowed down. All results for the Cobb-Douglas Models (1-3) are significant on the one percent level. The same applies to our Translog Models (4-6). It is interesting that even though these results are very similar, we can still see small differences between the different models. Our Cobb-Douglas models yield slightly higher estimates of progress over the year than the Translog models. This might come from the functional part of the translog function. All of the models imply that, conditional on weight and power characteristics, the log of fuel economy is at least over 0.485 greater in 2006, compared to 1980.

Since $T_t$ is the absolute increase of $lmpg$ in year $t$, we can estimate the percentage increase in every model pretty easy.

This is a faster way of estimating the percentage increase as we did it in exercise 5.2.

First off, we will extract the values for $T_{2006}$ from our estimates progress16.

#< task
T2006 = progress16[26,2:7]
T2006
#>

Because this is the absolute increase, we can add it to the mean values of lmpg in year 1980. Afterwards we will calculate the absolute increase in lmpg for every model. Your task is to now add the mean values of lmpg from cars1980 to our recently created T2006.

#< task
# create a new variable called lmpgtilde
# then add the mean value for lmpg from cars1980 to T2006
# display lmpgtilde
#>
lmpgtilde = T2006 + mean(cars1980$lmpg)
lmpgtilde 

As you can see, these are the different values for $\ln \widetilde {mpg_{2006}}$ across our models if we held other characteristics on their 1980 level.

Now it is very easy to calculate the percentage increase for each model.

#< task
percentincrease = (exp(lmpgtilde) - mean(cars1980$mpg)) / mean(cars1980$mpg)
percentincrease
#> 

If we now take a look at our percentage increases, we can see similar differences between the models as before (which is logical of course). The Cobb-Douglas models yield slightly higher increases than the translog models. This results from the already greater values for $T_{2006}$ in those models. Overall we can say that , conditional on weight and power characteristics, the log of fuel economy is over 0.485 greater in 2006, compared to 1980. At the mean fuel economy in 1980, our models imply a 58 percent increase in fuel economy could have been possible. This, in contrast to the 18 percent "real increase" we concluded in exercise 1, is a pretty big difference.

< award "Modelwide Technological Progress!"

Congratulations, You've earned this award for completing the last exercise!

>

This exercise refers to page 3382 and 3384 of the paper.

Exercise 7: Conclusion

Before we are coming to a conclusion for this problem set, let's see what you have been awarded for in this problem set.

To see which awards you achieved during this problem set, click check for one last time. In case you got all the awards, there should be 10 awards shown.

#< task
awards(as.html=TRUE)
#>

< award "Problem Set Complete"

Congratulations, You've earned this award for completing the Problem Set!

>

As a conclusion, we can say that, after analyzing the given data we are able to estimate the trade-offs that consumers and manufacturers face when choosing between fuel economy, vehicle size, and vehicle power. The estimated trade-offs between weight and fuel economy suggests that, fuel economy increases by over 4 percent for every 10 percent reduction of weight. On average, fuel economy increases by 2.7 percent for every 10 percent reduction of horsepower. However, the effect of torque is less precisely estimated. We are also able to estimate the technological advances that occurred over these dimensions from 1980 to 2006. As a consequence fuel economy would have been nearly 60 percent higher in 2006 com-pared to the 1980 level, if we had kept vehicle size and power constant at their 1980 levels.

We could also use our results to potentially estimate how fuel economy could look in the fu-ture.

Let us look back to exercise 2. There we found out that there was a positive correlation be-tween CAFE standards and fuel economy. But would we be able to achieve the increased fuel economy results we estimated, if CAFE standards had been adapted constantly?

In order to answer this, we should look at this question from another perspective. Let's as-sume we were a customer and would like to buy a new car. Would you choose a new car that has the same characteristics as your old car, if there had only been improvements on the fuel economy side since the time you had bought your old car? Since most customers buy new cars with the intention to dispose of more horsepower, better torque and acceleration and so on, it is quite obvious that such a person would not buy a new car that lacks better engine characteristics.

Therefore the incentive to buy a new car is lost, and sales numbers would decrease consid-erably.

So what could be a solution?

One way of maintaining incentives for manufacturers in order to increase fuel economy would be to convince customers of the importance of fuel economy when purchasing a new car. If consumers value fuel economy over other characteristics such as for example horse-power, manufactures will have to value fuel economy stronger. As we have seen, CAFE standards are a good way for policy makers to ensure a minimum of fuel economy require-ments for vehicles. But if the standards are too strict whereas customers want other charac-teristics, more and more manufacturers are willing to pay the fine in order to sell their cars.

We also have to note that all our results are based on an approach from an economics' per-spective. As a result, we regard the functional relationship within a car or an engine as a "black-box". Approaching this question from an engineer's perspective, we should probably take other models into consideration.

Exercise 8: References

Bibliography

R and packages in R

Websites

Licence

Author: Marius Breitmayer

Creative Commons Lizenzvertrag
Dieses Werk ist lizenziert unter einer Creative Commons Namensnennung - Nicht-kommerziell - Weitergabe unter gleichen Bedingungen 4.0 International Lizenz.



MariusBreitmayer/RTutorAttributeTradeOffs documentation built on May 7, 2019, 2:53 p.m.