README.md

title: "Heuristica" output: github_document

Also available on CRAN: https://cran.r-project.org/web/packages/heuristica/index.html

The heuristica R package implements heuristic decision models, such as Take The Best (TTB) and a unit-weighted linear model. The models are designed for two-alternative choice tasks, such as which of two schools has a higher drop-out rate. The package also wraps more well-known models like regression and logistic regression into the two-alternative choice framework so all these models can be assessed side-by-side. It provides functions to measure accuracy, such as an overall percentCorrect and, for advanced users, some confusion matrix functions. These measures can be applied in-sample or out-of-sample.

The goal is to make it easy to explore the range of conditions in which simple heuristics are better than more complex models. Optimizing is not always better!

The Task

This package is focused on two-alternative choice tasks, e.g. given two schools, which has a higher drop-out rate. The output is categorical, not quantitative.

A Simple Example

Here is a subset of data on Chicago public high school drop-out rates. The criterion to predict is the Dropout_Rate, which is in column 2.

schools <- data.frame(Name=c("Bowen", "Collins", "Fenger", "Juarez", "Young"), Dropout_Rate=c(25.5, 11.8, 28.7, 21.6, 4.5), Low_Income_Students=c(82.5, 88.8, 63.2, 84.5, 30.3), Limited_English_Students=c(11.4, 0.1, 0, 28.3, 0.1))
schools
#>      Name Dropout_Rate Low_Income_Students Limited_English_Students
#> 1   Bowen         25.5                82.5                     11.4
#> 2 Collins         11.8                88.8                      0.1
#> 3  Fenger         28.7                63.2                      0.0
#> 4  Juarez         21.6                84.5                     28.3
#> 5   Young          4.5                30.3                      0.1

Fitting

To fit a model, we give it the data set and the columns to use. In this case, the 2nd column, Dropout_Rate, is the criterion to be predicted. The cues are the following columns, percent of Low_Income_Students and percent of Limited_English_Students. They are at indexes 3 and 4.

Let's fit two models: ttbModel, Take The Best, which uses the highest-validity cue that discriminates (more details below). regModel, a version of R's "lm" function for linear regression wrapped to fit into heurstica's interface.

library(heuristica)
#> Error in library(heuristica): there is no package called 'heuristica'
criterion_col <- 2
ttb <- ttbModel(schools, criterion_col, c(3:4))
#> Error in ttbModel(schools, criterion_col, c(3:4)): could not find function "ttbModel"
reg <- regModel(schools, criterion_col, c(3:4))
#> Error in regModel(schools, criterion_col, c(3:4)): could not find function "regModel"

What do the fits look like? We can examine Take The Best's cue validities and the regression coefficients.

ttb$cue_validities
#> Error in eval(expr, envir, enclos): object 'ttb' not found
coef(reg)
#> Error in coef(reg): object 'reg' not found

Both Take The Best and regression give a higher weight to Low_Income_Students than Limited_English_Students, although of course how they use the weights differs. Take The Best will use a lexicographic order, making its prediction based solely on Low_Income_Students as long as the schools have differing values-- which they do for all 5 schools in this data set. That means it will ignore Limited_English_Students when predicting on this data set. In contrast, regression will use a weighted sum of both cues, but with the most important cues weighted more.

Predicting the fitted data

To see a model's predictions, we use the predictPair function. It takes two rows of data-- which together comprise a "row pair"-- and the fitted model. predictPair outputs three possible values:

In Bowen vs. Collins, it outputs 1, meaning it predicts Bowen has a higher dropout rate. In Bowen vs. Fenger, it outputs -1, meaning it predicts Fenger has a higher dropout rate.

predictPair(subset(schools, Name=="Bowen"), subset(schools, Name=="Collins"), ttb)
#> Error in predictPair(subset(schools, Name == "Bowen"), subset(schools, : could not find function "predictPair"
predictPair(subset(schools, Name=="Bowen"), subset(schools, Name=="Fenger"), ttb)
#> Error in predictPair(subset(schools, Name == "Bowen"), subset(schools, : could not find function "predictPair"

Note that the output depends on the order of the rows. In the reversed pair of Collins vs. Bowen, the output is -1. This is consistent because it still picks Bowen, regardless of order.

predictPair(subset(schools, Name=="Collins"), subset(schools, Name=="Bowen"), ttb)
#> Error in predictPair(subset(schools, Name == "Collins"), subset(schools, : could not find function "predictPair"

All rows

It is tedious to predict one row pair at a time, so let's use heurstica's predictPairSummary function instead. We simply pass it the data and the heuristics whose predictions we are interested in. It produces a matrix with all row pairs, which in this case is 10 (5 * 4 / 2).

out <- predictPairSummary(schools, ttb, reg)
#> Error in predictPairSummary(schools, ttb, reg): could not find function "predictPairSummary"
# See the first row: It has row indexes.
out[1,]
#> Error in eval(expr, envir, enclos): object 'out' not found
# Convert indexes to school names for easier interpretation
out_df <- data.frame(out)
#> Error in data.frame(out): object 'out' not found
out_df$Row1 <- schools$Name[out_df$Row1]
#> Error in eval(expr, envir, enclos): object 'out_df' not found
out_df$Row2 <- schools$Name[out_df$Row2]
#> Error in eval(expr, envir, enclos): object 'out_df' not found
out_df
#> Error in eval(expr, envir, enclos): object 'out_df' not found

The first row shows the Bowen vs. Collins example we considered above. Because CorrectGreater is 1, that means TTB predicted it correctly-- Bowen really does have a higher drop-out rate. But regression predicted -1 for this row pair, which is incorrect.

predictPairSummary is for beginners. heuristica offers full flexibility in output with the rowPairApply function. After passing it the data, you can pass it any number of generators to make the columns you want. Some examples are below, where we print only the first row.

# Same as predictPairSummary.
out_same <- rowPairApply(schools, rowIndexes(), correctGreater(criterion_col), heuristics(ttb, reg))
#> Error in rowPairApply(schools, rowIndexes(), correctGreater(criterion_col), : could not find function "rowPairApply"
out_same[1,]
#> Error in eval(expr, envir, enclos): object 'out_same' not found

# Show first the heuristic predictions, then CorrectGreater.  No row indexes.
out_simple <- rowPairApply(schools, heuristics(ttb, reg), correctGreater(criterion_col))
#> Error in rowPairApply(schools, heuristics(ttb, reg), correctGreater(criterion_col)): could not find function "rowPairApply"
out_simple[1,]
#> Error in eval(expr, envir, enclos): object 'out_simple' not found

Assessing Overall Performance

For an overall measure of performance, we can measure the percent of correct inferences for all pairs of schools in the data with percentCorrect, namely the number of correct predictions divided by the total number of predictions. We give the function the data to be predicted (in this case the same as what was fit) and the fitted models to assess.

percentCorrect(schools, ttb, reg)
#> Error in percentCorrect(schools, ttb, reg): could not find function "percentCorrect"

Take The Best got 60% correct and regression got 50% correct, which is the same as chance.

Regression is the best linear unbiased model for the data. But this data had a very small sample size of just 5 schools, and good estimates require more data.

This is an unusual case where TTB actually beat regression in a fitting task. Usually ttb only wins in out-of-sample performance, e.g. fitting 5 schools and then predicting on other schools not used in the fit.

For a more realistic example, see the vignette with cross-validated out-of-sample performance on a complete data set.

Package Installation

Uncomment and execute the line below to get the CRAN version:

# install.packages("heuristica") 

Uncomment and execute the line below to get the development version.

# Uncomment and execute the line below if you do not have devtools.
# install.packages("devtools") 
# devtools::install_github("jeanimal/heuristica")
# library("heuristica")

Models

The package comes with the following models that you can call with predictPair.

You can add your own models by also implementing a function related to predictPair, as described in a vignette.

Data

The package comes with two data sets used by many heuristic researchers.

See also

The original C version of the code, used to produce the results in the book chapter, is also available in github: https://github.com/jeanimal/legacy_code_for_SHTMUS/blob/master/README.md

Citations

Take The Best was first described in: Gigerenzer, G. & Goldstein, D. G. (1996). "Reasoning the fast and frugal way: Models of bounded rationality". Psychological Review, 103, 650-669.

All of these heuristics were run on many data sets and analyzed in: Gigerenzer, G., Todd, P. M., & the ABC Group (1999). Simple heuristics that make us smart. New York: Oxford University Press. code for this chapter

The research was also inspired by: Dawes, Robyn M. (1979). "The robust beauty of improper linear models in decision making". American Psychologist, volume 34, pages 571-582. archived pdf

Acknowledgements

Thanks for coding advice and beta testing go to Marcus Buckmann, Daniel G. Goldstein, and Özgür Simsek.



jeanimal/heuristica documentation built on Feb. 3, 2024, 9:56 p.m.