Once the final data set was created and cleaned, with a number of response variables including trailing beta, trailing vol, and price to book value, and the associated outcome, which was measured by whether or not a stock was in the Min Vol index or not (1 if in, 0 if not in).
A snippet of the final data set is shown below
head(trainingData) tail(trainingData) summary(trainingData)
Given the nature of the data, a logit regression will be ran. Looking at all of the historical data and stock various characteristics, this would model the log odds of a stock being in the minimum volatility index as a combination of the linear predictors mentioned. Several models will be run in a panel, including one by certain months, and one by the entire pool of data.
The first logit model that will be run is for the entire pool of monthly data.
Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 24% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get better models.
table(monthly_final$index_now)
One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.
# Create Training Data input_ones <- monthly_final[which(monthly_final$index_now == 1), ] # all 1's input_zeros <- monthly_final[which(monthly_final$index_now == 0), ] # all 0's set.seed(100) # for repeatability of samples input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # 1's for training input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # 0's for training. Pick as many 0's as 1's training_ones <- input_ones[input_ones_training_rows, ] training_zeros <- input_zeros[input_zeros_training_rows, ] trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's # Create Test Data test_ones <- input_ones[-input_ones_training_rows, ] test_zeros <- input_zeros[-input_zeros_training_rows, ] testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's # Remove NA values in index_before testData <- subset(testData, !is.na(index_before)) trainingData <- subset(trainingData, !is.na(index_before))
Now we can check class bias to see if it is more balanced. It is very close to being evenly weighted now.
table(trainingData$index_now)
Now the model can be run:
# Model 1 logit1 <- glm(index_now ~ volatility + beta + price_to_book + index_before, data=trainingData, family=binomial(link="logit")) # Summary of Model 1 summary(logit1) # Coefficient Interpretation ## Log Odds exp(coef(logit1)) ## Probability (exp(coef(logit1))) / (1+(exp(coef(logit1))))
Looking at the monthly data is not a true representation of the results, because the index is rebalanced once every six months - not once a month.
The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.86 - 0.0044 x vol - 0.31 x beta + 0.00039 x price_to_book + 6.16 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.86 - 0.0044 x vol - 0.31 x beta + 0.00039 x price_to_book + 6.16 x index_before ) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 0.996 times smaller, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.731 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 1.0051 times greater, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 410.261 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break
To take a sample stock to understand the model, we can look at a stock that was not in the USMV index on 12-30-2016, as see how accurate our model would be in predicting the probability of this stock being in the index. We can take AAL (American Airlines), which had a beta of 1.6312867, volatility of 0.8067945, price to book ratio of 4.6943413, and was not in the USMV index 6 months ago. This stock ended up not being in the minimum volatility index on 12-30-2016, so we would expect a probability to be relatively low. \hfill\break - Odds Ratio: \hfill\break $\frac{p}{1-p}$ = exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0) \hfill\break $\frac{p}{1-p}$ = 0.03013677 \hfill\break \hfill\break - Probability: \hfill\break p = (exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0) / (1+exp(-3.094 - 0.0032 x 0.8067945 - 0.25 x 1.6312867 + 0.00051 x 4.6943413 + 6.017 x 0))) \hfill\break p = 0.02925511 \hfill\break \hfill\break The odds of AAL being in the index on 12-30-2016 is 0.03013677, and this translates to a probability of 2.93%. As expected, already knowing that the stock was not in the index, this low probability seems reasonable. \hfill\break To further understand the model, we can look at a stock that was was in the USMV index on 12-30-2016, as see how accurate our model would be in predicting the probability of this stock being in the index. We can take AAPL (Apple), which had a beta of 1.0099644, volatility of 0.6118842, price to book ratio of 4.7037726, and it was in the USMV index 6 months ago. This stock ended up being in the minimum volatility index on 12-30-2016, so we would expect a probability to be relatively high \hfill\break - Odds Ratio: \hfill\break $\frac{p}{1-p}$ = exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1) \hfill\break $\frac{p}{1-p}$ = 14.45369 \hfill\break - Probability: \hfill\break p = (exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1) / (1+exp(-3.094 - 0.0032 x 0.6118842 - 0.25 x 1.0099644 + 0.00051 x 4.7037726 + 6.017 x 1))) \hfill\break p = 0.9352905 \hfill\break \hfill\break The odds of AAL being in the index on 12-30-2016 is 14.45369, and this translates to a probability of 93.53%. As expected, already knowing that the stock was in the index, this high probability seems reasonable.
To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.74.
library(InformationValue) optCutOff <- optimalCutoff(testData$index_now, predicted)[1]
\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.
library(car) vif(logit1)
\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 3.1%, which is quite low, and thus good.
predicted <- plogis(predict(logit1, testData)) misClassError(testData$index_now, predicted)
\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.3%.
plotROC(testData$index_now, predicted)
\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.
In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.2% is a good quality model.
Concordance(testData$index_now, predicted)
\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 89.6%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.9%.
sensitivity(testData$index_now, predicted, threshold = optCutOff) specificity(testData$index_now, predicted, threshold = optCutOff)
\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds
confusionMatrix(testData$index_now, predicted, threshold = optCutOff)
Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for each of these individual months. Thus, a subset of the data was taken for November, and the same procedures done at with Model 1.
# Subset data for dates from November only november_final <- filter(monthly_final, date == "2011-11-30" | date == "2012-11-30"| date == "2013-11-29"| date == "2014-11-28" | date == "2015-11-30" | date == "2016-11-30") # Remove NA values from set november_final <- subset(november_final, !is.na(index_before))
Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 26% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.
table(november_final$index_now)
One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.
# Create Training Data input_ones2 <- november_final[which(november_final$index_now == 1), ] # all 1's input_zeros2 <- november_final[which(november_final$index_now == 0), ] # all 0's set.seed(100) # for repeatability of samples input_ones_training_rows2 <- sample(1:nrow(input_ones2), 0.7*nrow(input_ones2)) # 1's for training input_zeros_training_rows2 <- sample(1:nrow(input_zeros2), 0.7*nrow(input_ones2)) # 0's for training. Pick as many 0's as 1's training_ones2 <- input_ones2[input_ones_training_rows2, ] training_zeros2 <- input_zeros2[input_zeros_training_rows2, ] trainingData2 <- rbind(training_ones2, training_zeros2) # row bind the 1's and 0's # Create Test Data test_ones2 <- input_ones2[-input_ones_training_rows2, ] test_zeros2 <- input_zeros2[-input_zeros_training_rows2, ] testData2 <- rbind(test_ones2, test_zeros2) # row bind the 1's and 0's
Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 525 observations.
table(trainingData2$index_now)
Now the model can be run:
# Model 2 logit2 <- glm(index_now ~ volatility + beta + price_to_book + index_before, data=trainingData2, family=binomial(link="logit")) # Summary of Model 2 summary(logit2) # Coefficient Interpretation ## Log Odds exp(coef(logit2)) ## Probability (exp(coef(logit2))) / (1+(exp(coef(logit2))))
Looking at the November model will be helpful for someone looking to predict index rebalancing between June and October.
The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.46 + 0.061 x vol - 0.49 x beta - 0.00013 x price_to_book + 5.08 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.46 + 0.061 x vol - 0.49 x beta - 0.00013 x price_to_book + 5.08 x index_before) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 1.063 times greater, given a one unit increase in volatility. This response variable is statistically significant, at an alpha level of 0.1. \hfill\break - Beta: The odds ratio of being added to the index is 0.61 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 160.88 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break
Will do later, if useful.
To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.95.
library(InformationValue) optCutOff2 <- optimalCutoff(testData2$index_now, predicted2)[1]
\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.
library(car) vif(logit2)
\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 4.4%, which is quite low, and good.
predicted2 <- plogis(predict(logit2, testData2)) misClassError(testData2$index_now, predicted2)
\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 95.2%.
plotROC(testData2$index_now, predicted2)
\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.
In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 95.5% is a good quality model.
Concordance(testData2$index_now, predicted2)
\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 85.8%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.3%.
sensitivity(testData2$index_now, predicted2, threshold = optCutOff2) specificity(testData2$index_now, predicted2, threshold = optCutOff2)
\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds
confusionMatrix(testData2$index_now, predicted2, threshold = optCutOff2)
Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for each of these individual months. Thus, a subset of the data was taken for May, and the same procedures done at with Model 1.
# Subset data for dates from May only may_final <- filter(monthly_final, date == "2012-05-31" | date == "2013-05-31"| date == "2014-05-30" | date == "2015-05-29" | date == "2016-05-31") # Remove NA values from set may_final <- subset(may_final, !is.na(index_before))
Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 24% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.
table(may_final$index_now)
One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.
# Create Training Data input_ones3 <- may_final[which(may_final$index_now == 1), ] # all 1's input_zeros3 <- may_final[which(may_final$index_now == 0), ] # all 0's set.seed(100) # for repeatability of samples input_ones_training_rows3 <- sample(1:nrow(input_ones3), 0.7*nrow(input_ones3)) # 1's for training input_zeros_training_rows3 <- sample(1:nrow(input_zeros3), 0.7*nrow(input_ones3)) # 0's for training. Pick as many 0's as 1's training_ones3 <- input_ones3[input_ones_training_rows3, ] training_zeros3 <- input_zeros3[input_zeros_training_rows3, ] trainingData3 <- rbind(training_ones3, training_zeros3) # row bind the 1's and 0's # Create Test Data test_ones3 <- input_ones3[-input_ones_training_rows3, ] test_zeros3 <- input_zeros3[-input_zeros_training_rows3, ] testData3 <- rbind(test_ones3, test_zeros3) # row bind the 1's and 0's
Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 493 observations.
table(trainingData3$index_now)
Now the model can be run:
# Model 3 logit3 <- glm(index_now ~ volatility + beta + price_to_book + index_before, data=trainingData3, family=binomial(link="logit")) # Summary of Model 3 summary(logit3) # Coefficient Interpretation ## Log Odds exp(coef(logit3)) ## Probability (exp(coef(logit3))) / (1+(exp(coef(logit3))))
Looking at the May model will be helpful for someone looking to predict index rebalancing between December and April.
The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.74 - 0.04 x vol - 0.64 x beta - 0.012 x price_to_book + 7.014 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.74 - 0.04 x vol - 0.64 x beta - 0.012 x price_to_book + 7.014 x index_before ) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 0.96 times smaller, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.52 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 1112.01 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break
Will do later if useful.
To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.98.
library(InformationValue) optCutOff3 <- optimalCutoff(testData3$index_now, predicted3)[1]
\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.
library(car) vif(logit3)
\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 2.6%, which is quite low, and good.
predicted3 <- plogis(predict(logit3, testData3)) misClassError(testData3$index_now, predicted3)
\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.5%.
plotROC(testData3$index_now, predicted3)
\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.
In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.3% is a good quality model.
Concordance(testData3$index_now, predicted3)
\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 92.0%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 98.3%.
sensitivity(testData3$index_now, predicted3, threshold = optCutOff3) specificity(testData3$index_now, predicted3, threshold = optCutOff3)
\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds
confusionMatrix(testData3$index_now, predicted3, threshold = optCutOff3)
Since the index is rebalanced twice a year (once in November and once in May), it makes sense to look at a model for both of these months. Thus, a subset of the data was taken for May and November, by combining the data sets from Model 2 and Model 3.
# Subset data for dates from May only both_final <- rbind(may_final, november_final)
Ideally, the proportion of stocks in and out of the USMV index should approximately be the same. Checking this, we can see that this is not the case. However, just around 25% of the data is from stocks that ereurrently in the index, so there is a class bias. As a result, we must sample the observations in approximately equal proportions to get a better model.
table(both_final$index_now)
One way to address the problem of class bias is to draw the 0’s and 1’s for the trainingData (development sample) in equal proportions. In doing so, we will put rest of the inputData not included for training into testData (validation sample). As a result, the size of development sample will be smaller that validation, which is okay, because, there are large number of observations.
# Create Training Data input_ones4 <- both_final[which(both_final$index_now == 1), ] # all 1's input_zeros4 <- both_final[which(both_final$index_now == 0), ] # all 0's set.seed(100) # for repeatability of samples input_ones_training_rows4 <- sample(1:nrow(input_ones4), 0.7*nrow(input_ones4)) # 1's for training input_zeros_training_rows4 <- sample(1:nrow(input_zeros4), 0.7*nrow(input_ones4)) # 0's for training. Pick as many 0's as 1's training_ones4 <- input_ones4[input_ones_training_rows4, ] training_zeros4 <- input_zeros4[input_zeros_training_rows4, ] trainingData4 <- rbind(training_ones4, training_zeros4) # row bind the 1's and 0's # Create Test Data test_ones4 <- input_ones4[-input_ones_training_rows4, ] test_zeros4 <- input_zeros4[-input_zeros_training_rows4, ] testData4 <- rbind(test_ones4, test_zeros4) # row bind the 1's and 0's
Now we can check class bias to see if it is more balanced. It is evenly weighted now, with each being represented by 1018 observations.
table(trainingData4$index_now)
Now the model can be run:
# Model 4 logit4 <- glm(index_now ~ volatility + beta + price_to_book + index_before, data=trainingData4, family=binomial(link="logit")) # Summary of Model 3 summary(logit4) # Coefficient Interpretation ## Log Odds exp(coef(logit4)) ## Probability (exp(coef(logit4))) / (1+(exp(coef(logit4))))
Looking at this model will be helpful for someone looking to predict index rebalancing, generally, for both months.
The model can be interpreted as: \hfill\break ln[$\frac{p}{1-p}$] = -1.84 + 0.003 x vol - 0.31 x beta - 0.0019 x price_to_book + 5.89 x index_before \hfill\break $\frac{p}{1-p}$ = exp(-1.84 + 0.003 x vol - 0.31 x beta - 0.0019 x price_to_book + 5.89 x index_before) \hfill\break The coefficients can be interpreted as: \hfill\break - Volatility: The odds ratio of being added to the index is 1.0029 times greater, given a one unit increase in volatility. This response variable is not statistically significant. \hfill\break - Beta: The odds ratio of being added to the index is 0.73 times smaller, given a one unit increase in beta. This response variable is statistically significant. \hfill\break - Price to Book: The odds ratio of being added to the index is 0.99 times smaller, given a one unit increase in price to book ratio. This response variable is not statistically significant. \hfill\break - Index before: The odds ratio of being added to the index is 360.28 times greater if the stock was in the index 6 months ago. This response variable is statistically significant. \hfill\break \hfill\break
Will do later if useful.
To test the quality of the model, several tests were done: \hfill\break Predictive Power \hfill\break The default cutoff prediction probability score is 0.5 or the ratio of 1’s and 0’s in the training data. But sometimes, tuning the probability cutoff can improve the accuracy in both the development and validation samples. The InformationValue::optimalCutoff function provides ways to find the optimal cutoff to improve the prediction of 1’s, 0’s, both 1’s and 0’s and to reduce the misclassification error. Here, the optimal cut off is 0.77.
library(InformationValue) optCutOff4 <- optimalCutoff(testData4$index_now, predicted4)[1]
\hfill\break VIF**_ \hfill\break Like in case of linear regression, we should check for multicollinearity in the model. As seen below, all X variables in the model have VIF well below 4.
library(car) vif(logit4)
\hfill\break Misclassification Error \hfill\break Misclassification error is the percentage mismatch of predicted vs actuals, irrespective of 1’s or 0’s. The lower the misclassification error, the better the model. Here it is 3.2%, which is quite low, and good.
predicted4 <- plogis(predict(logit4, testData4)) misClassError(testData4$index_now, predicted4)
\hfill\break ROC \hfill\break Receiver Operating Characteristics Curve traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a good model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. So for a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the cutoff score decreases. Greater the area under the ROC curve, better the predictive ability of the model. Here, it is 96.2%.
plotROC(testData4$index_now, predicted4)
\hfill\break Concordance \hfill\break Ideally, the model-calculated-probability-scores of all actual Positive’s, (aka Ones) should be greater than the model-calculated-probability-scores of ALL the Negatives (aka Zeroes). Such a model is said to be perfectly concordant and a highly reliable one. This phenomenon can be measured by Concordance and Discordance.
In simpler words, of all combinations of 1-0 pairs (actuals), Concordance is the percentage of pairs, whose scores of actual positive’s are greater than the scores of actual negative’s. For a perfect model, this will be 100%. So, the higher the concordance, the better is the quality of model. This model with a concordance of 97.1% is a good quality model.
Concordance(testData4$index_now, predicted4)
\hfill\break Specificity and Sensitivity \hfill\break - Sensitivity (or True Positive Rate) is the percentage of 1’s (actuals) correctly predicted by the model, while, specificity is the percentage of 0’s (actuals) correctly predicted. In this model, it was found to be 89.9%. \hfill\break - Specificity can also be calculated as 1 - False Positive Rate. In this model, it was found to be 97.8%.
sensitivity(testData4$index_now, predicted4, threshold = optCutOff4) specificity(testData4$index_now, predicted4, threshold = optCutOff4)
\hfill\break Confusion Matrix \hfill\break In the confusion matrix, the columns are actuals, while rows are predicteds
confusionMatrix(testData4$index_now, predicted4, threshold = optCutOff4)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.