Model

Once the final data set was created and cleaned, with a number of response variables including trailing beta, trailing vol, and price to book value, and the associated outcome, which was measured by whether or not a stock was in the Min Vol index or not (1 if in, 0 if not in).

A snippet of the final data set is shown below

head(trainingData)
tail(trainingData)
summary(trainingData)

Given the nature of the data, a logit regression will be ran. Looking at all of the historical data and stock various characteristics, this would model the log odds of a stock being in the minimum volatility index as a combination of the linear predictors mentioned. Several models will be run in a panel, including one by certain months, and one by the entire pool of data.

Model 1: Entire Data Set (Monthly)

The first logit model that will be run is for the entire pool of monthly data.

Data Cleaning - Checking for Class Bias

Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 24% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get better models.

table(monthly_final$index_now)

Create Training and Test Samples

One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.

# Create Training Data
input_ones <- monthly_final[which(monthly_final$index_now == 1), ]  # all 1's
input_zeros <- monthly_final[which(monthly_final$index_now == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))  # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones))  # 0's for training. Pick as many 0's as 1's
training_ones <- input_ones[input_ones_training_rows, ]  
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros)  # row bind the 1's and 0's 
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros)  # row bind the 1's and 0's 
# Remove NA values in index_before
testData <- subset(testData, !is.na(index_before))
trainingData <- subset(trainingData, !is.na(index_before))

Now we can check class bias to see if it is more balanced. It is very close to being evenly weighted now.

table(trainingData$index_now)

Logistic Regression Model

Now the model can be run:

# Model 1
logit1 <- glm(index_now ~  volatility + beta + price_to_book + index_before, data=trainingData, family=binomial(link="logit"))

# Summary of Model 1
summary(logit1)

# Coefficient Interpretation
## Log Odds
exp(coef(logit1))
## Probability 
(exp(coef(logit1))) / (1+(exp(coef(logit1))))

Looking at the monthly data is not a true representation of the results, because the index is rebalanced once every six months - not once a month.

Interpretation of Model

The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.86 - 0.0044 x vol - 0.31 x beta + 0.00039 x price_to_book + 6.16 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.86 - 0.0044 x vol - 0.31 x beta + 0.00039 x price_to_book + 6.16 x index_before ) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 0.996 times smaller, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.731 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 1.0051 times greater, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 410.261 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break

Sanity Check

To take a sample stock to understand the model, we can look at a stock that was not in the USMV index on 12-30-2016, as see how accurate our model would be in predicting the probability of this stock being in the index. We can take AAL (American Airlines), which had a beta of 1.6312867, volatility of 0.8067945, price to book ratio of 4.6943413, and was not in the USMV index 6 months ago. This stock ended up not being in the minimum volatility index on 12-30-2016, so we would expect a probability to be relatively low. \hfill\break - Odds Ratio: \hfill\break $\frac{p}{1-p}$ = exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0) \hfill\break $\frac{p}{1-p}$ = 0.03013677 \hfill\break \hfill\break - Probability: \hfill\break p = (exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0) / (1+exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0))) \hfill\break p = 0.02925511 \hfill\break \hfill\break The odds of AAL being in the index on 12-30-2016 is 0.03013677, and this translates to a probability of 2.93%. As expected, already knowing that the stock was not in the index, this low probability seems reasonable. \hfill\break To further understand the model, we can look at a stock that was was in the USMV index on 12-30-2016, as see how accurate our model would be in predicting the probability of this stock being in the index. We can take AAPL (Apple), which had a beta of 1.0099644, volatility of 0.6118842, price to book ratio of 4.7037726, and it was in the USMV index 6 months ago. This stock ended up being in the minimum volatility index on 12-30-2016, so we would expect a probability to be relatively high \hfill\break - Odds Ratio: \hfill\break $\frac{p}{1-p}$ = exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1) \hfill\break $\frac{p}{1-p}$ = 14.45369 \hfill\break - Probability: \hfill\break p = (exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1) / (1+exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1))) \hfill\break p = 0.9352905 \hfill\break \hfill\break The odds of AAL being in the index on 12-30-2016 is 14.45369, and this translates to a probability of 93.53%. As expected, already knowing that the stock was in the index, this high probability seems reasonable.

Model Quality

To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.74.

library(InformationValue)
optCutOff <- optimalCutoff(testData$index_now, predicted)[1] 

\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.

library(car)
vif(logit1)

\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 3.1%, which is quite low, and thus good.

predicted <- plogis(predict(logit1, testData)) 
misClassError(testData$index_now, predicted)

\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.3%.

plotROC(testData$index_now, predicted)

\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.

In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.2% is a good quality model.

Concordance(testData$index_now, predicted)

\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 89.6%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.9%.

sensitivity(testData$index_now, predicted, threshold = optCutOff)
specificity(testData$index_now, predicted, threshold = optCutOff)

\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds

confusionMatrix(testData$index_now, predicted, threshold = optCutOff)

Model 2: November Model

Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for each of these individual months. Thus, a subset of the data was taken for November, and the same procedures done at with Model 1.

# Subset data for dates from November only
november_final <- filter(monthly_final, date == "2011-11-30" | date == "2012-11-30"| date == "2013-11-29"| date == "2014-11-28" | date == "2015-11-30" | date == "2016-11-30")
# Remove NA values from set
november_final <- subset(november_final, !is.na(index_before))

Data Cleaning - Checking for Class Bias

Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 26% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.

table(november_final$index_now)

Create Training and Test Samples

One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.

# Create Training Data
input_ones2 <- november_final[which(november_final$index_now == 1), ]  # all 1's
input_zeros2 <- november_final[which(november_final$index_now == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows2 <- sample(1:nrow(input_ones2), 0.7*nrow(input_ones2))  # 1's for training
input_zeros_training_rows2 <- sample(1:nrow(input_zeros2), 0.7*nrow(input_ones2))  # 0's for training. Pick as many 0's as 1's
training_ones2 <- input_ones2[input_ones_training_rows2, ]  
training_zeros2 <- input_zeros2[input_zeros_training_rows2, ]
trainingData2 <- rbind(training_ones2, training_zeros2)  # row bind the 1's and 0's 
# Create Test Data
test_ones2 <- input_ones2[-input_ones_training_rows2, ]
test_zeros2 <- input_zeros2[-input_zeros_training_rows2, ]
testData2 <- rbind(test_ones2, test_zeros2)  # row bind the 1's and 0's 

Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 525 observations.

table(trainingData2$index_now)

Logistic Regression Model

Now the model can be run:

# Model 2
logit2 <- glm(index_now ~  volatility + beta + price_to_book + index_before, data=trainingData2, family=binomial(link="logit"))

# Summary of Model 2
summary(logit2)

# Coefficient Interpretation
## Log Odds
exp(coef(logit2))
## Probability 
(exp(coef(logit2))) / (1+(exp(coef(logit2))))

Looking at the November model will be helpful for someone looking to predict index rebalancing between June and October.

Interpretation of Model

The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.46 + 0.061 x vol - 0.49 x beta - 0.00013 x price_to_book + 5.08 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.46 + 0.061 x vol - 0.49 x beta - 0.00013 x price_to_book + 5.08 x index_before) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 1.063 times greater, given a one unit increase in volatility. This response variable is statistically significant, at an alpha level of 0.1. \hfill\break - Beta: The odds ratio of being added to the index is 0.61 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 160.88 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break

Sanity Check

Will do later, if useful.

Model Quality

To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.95.

library(InformationValue)
optCutOff2 <- optimalCutoff(testData2$index_now, predicted2)[1] 

\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.

library(car)
vif(logit2)

\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 4.4%, which is quite low, and good.

predicted2 <- plogis(predict(logit2, testData2)) 
misClassError(testData2$index_now, predicted2)

\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 95.2%.

plotROC(testData2$index_now, predicted2)

\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.

In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 95.5% is a good quality model.

Concordance(testData2$index_now, predicted2)

\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 85.8%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.3%.

sensitivity(testData2$index_now, predicted2, threshold = optCutOff2)
specificity(testData2$index_now, predicted2, threshold = optCutOff2)

\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds

confusionMatrix(testData2$index_now, predicted2, threshold = optCutOff2)

Model 3: May Model

Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for each of these individual months. Thus, a subset of the data was taken for May, and the same procedures done at with Model 1.

# Subset data for dates from May only
may_final <- filter(monthly_final, date == "2012-05-31" |  date == "2013-05-31"| date == "2014-05-30" | date == "2015-05-29" | date == "2016-05-31")
# Remove NA values from set
may_final <- subset(may_final, !is.na(index_before))

Data Cleaning - Checking for Class Bias

Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 24% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.

table(may_final$index_now)

Create Training and Test Samples

One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.

# Create Training Data
input_ones3 <- may_final[which(may_final$index_now == 1), ]  # all 1's
input_zeros3 <- may_final[which(may_final$index_now == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows3 <- sample(1:nrow(input_ones3), 0.7*nrow(input_ones3))  # 1's for training
input_zeros_training_rows3 <- sample(1:nrow(input_zeros3), 0.7*nrow(input_ones3))  # 0's for training. Pick as many 0's as 1's
training_ones3 <- input_ones3[input_ones_training_rows3, ]  
training_zeros3 <- input_zeros3[input_zeros_training_rows3, ]
trainingData3 <- rbind(training_ones3, training_zeros3)  # row bind the 1's and 0's 
# Create Test Data
test_ones3 <- input_ones3[-input_ones_training_rows3, ]
test_zeros3 <- input_zeros3[-input_zeros_training_rows3, ]
testData3 <- rbind(test_ones3, test_zeros3)  # row bind the 1's and 0's 

Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 493 observations.

table(trainingData3$index_now)

Logistic Regression Model

Now the model can be run:

# Model 3
logit3 <- glm(index_now ~  volatility + beta + price_to_book + index_before, data=trainingData3, family=binomial(link="logit"))

# Summary of Model 3
summary(logit3)

# Coefficient Interpretation
## Log Odds
exp(coef(logit3))
## Probability 
(exp(coef(logit3))) / (1+(exp(coef(logit3))))

Looking at the May model will be helpful for someone looking to predict index rebalancing between December and April.

Interpretation of Model

The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.74 - 0.04 x vol - 0.64 x beta - 0.012 x price_to_book + 7.014 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.74 - 0.04 x vol - 0.64 x beta - 0.012 x price_to_book + 7.014 x index_before ) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 0.96 times smaller, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.52 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 1112.01 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break

Sanity Check

Will do later if useful.

Model Quality

To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.98.

library(InformationValue)
optCutOff3 <- optimalCutoff(testData3$index_now, predicted3)[1] 

\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.

library(car)
vif(logit3)

\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 2.6%, which is quite low, and good.

predicted3 <- plogis(predict(logit3, testData3)) 
misClassError(testData3$index_now, predicted3)

\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.5%.

plotROC(testData3$index_now, predicted3)

\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.

In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.3% is a good quality model.

Concordance(testData3$index_now, predicted3)

\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 92.0%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 98.3%.

sensitivity(testData3$index_now, predicted3, threshold = optCutOff3)
specificity(testData3$index_now, predicted3, threshold = optCutOff3)

\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds

confusionMatrix(testData3$index_now, predicted3, threshold = optCutOff3)

Model 4: Total Rebalancing (November & May) Model

Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for both of these months. Thus, a subset of the data was taken for May and November, by combining the data sets from Model 2 and Model 3.

# Subset data for dates from May only
both_final <- rbind(may_final, november_final)

Data Cleaning - Checking for Class Bias

Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 25% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.

table(both_final$index_now)

Create Training and Test Samples

One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.

# Create Training Data
input_ones4 <- both_final[which(both_final$index_now == 1), ]  # all 1's
input_zeros4 <- both_final[which(both_final$index_now == 0), ]  # all 0's
set.seed(100)  # for repeatability of samples
input_ones_training_rows4 <- sample(1:nrow(input_ones4), 0.7*nrow(input_ones4))  # 1's for training
input_zeros_training_rows4 <- sample(1:nrow(input_zeros4), 0.7*nrow(input_ones4))  # 0's for training. Pick as many 0's as 1's
training_ones4 <- input_ones4[input_ones_training_rows4, ]  
training_zeros4 <- input_zeros4[input_zeros_training_rows4, ]
trainingData4 <- rbind(training_ones4, training_zeros4)  # row bind the 1's and 0's 
# Create Test Data
test_ones4 <- input_ones4[-input_ones_training_rows4, ]
test_zeros4 <- input_zeros4[-input_zeros_training_rows4, ]
testData4 <- rbind(test_ones4, test_zeros4)  # row bind the 1's and 0's 

Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 1018 observations.

table(trainingData4$index_now)

Logistic Regression Model

Now the model can be run:

# Model 4
logit4 <- glm(index_now ~  volatility + beta + price_to_book + index_before, data=trainingData4, family=binomial(link="logit"))

# Summary of Model 3
summary(logit4)

# Coefficient Interpretation
## Log Odds
exp(coef(logit4))
## Probability 
(exp(coef(logit4))) / (1+(exp(coef(logit4))))

Looking at this model will be helpful for someone looking to predict index rebalancing, generally, for both months.

Interpretation of Model

The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.84 + 0.003 x vol - 0.31 x beta - 0.0019 x price_to_book + 5.89 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.84 + 0.003 x vol - 0.31 x beta - 0.0019 x price_to_book + 5.89 x index_before) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 1.0029 times greater, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.73 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 360.28 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break

Sanity Check

Will do later if useful.

Model Quality

To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.77.

library(InformationValue)
optCutOff4 <- optimalCutoff(testData4$index_now, predicted4)[1] 

\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.

library(car)
vif(logit4)

\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 3.2%, which is quite low, and good.

predicted4 <- plogis(predict(logit4, testData4)) 
misClassError(testData4$index_now, predicted4)

\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.2%.

plotROC(testData4$index_now, predicted4)

\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.

In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.1% is a good quality model.

Concordance(testData4$index_now, predicted4)

\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 89.9%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.8%.

sensitivity(testData4$index_now, predicted4, threshold = optCutOff4)
specificity(testData4$index_now, predicted4, threshold = optCutOff4)

\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds

confusionMatrix(testData4$index_now, predicted4, threshold = optCutOff4)


johngil/gilheany.thesis documentation built on May 8, 2019, 7:48 p.m.