compare_models: Compare Maxent OT models using a variety of methods

View source: R/compare.R

compare_modelsR Documentation

Compare Maxent OT models using a variety of methods

Description

Compares two or more model fit to the same data set to determine which provides the best fit, using a variety of methods.

Usage

compare_models(..., method = "lrt")

Arguments

...

Two or more models objects to be compared. These objects should be in the same format as the objects returned by the optimize_weights function. Note that the likelihood ratio test applies to exactly two models, while the other comparison methods can be applied to arbitrarily many models.

method

The method of comparison to use. This currently includes lrt (likelihood ratio test), aic (Akaike Information Criterion), aic_c (Akaike Information Criterion adjusted for small sample sizes), and bic (Bayesian Information Criterion).

Details

The available comparison methods are

  • lrt: The likelihood ratio test. This method can be applied to a maximum of two models, and the parameters of these models (i.e., their constraints) must be in a strict subset/superset relationship. If your models do not meet these requirements, you should use a different method.

    The likelihood ratio is calculated as follows:

    LR = 2(LL_2 - LL_1)

    where LL_2 is log likelihood of the model with more parameters. A p-value is calculated by conducting a chi-squared test with X^2 = LR and the degrees of freedom set to the difference in number of parameters between the two models. This p-value tells us whether the difference in likelihood between the two models is significant (i.e., whether the extra parameters in the full model are justified by the increase in model fit).

  • aic: The Akaike Information Criterion. This is calculated as follows for each model:

    AIC = 2k - 2LL

    where k is the number of model parameters (i.e., constraints) and LL is the model's log likelihood.

  • aic_c: The Akaike Information Criterion corrected for small sample sizes. This is calculated as follows:

    AIC_c = 2k - 2LL + \frac{2k^2 + 2k}{n - k - 1}

    where n is the number of samples and the other parameters are identical to those used in the AIC calculation. As n approaches infinity, the final term converges to 0, and so this equation becomes equivalent to AIC. Please see the note below for information about sample sizes.

  • bic: The Bayesian Information Criterion. This is calculated as follows:

    BIC = k\ln(n) - 2LL

    As with aic_c, this calculation relies on the number of samples. Please see the discussion on sample sizes below before using this method.

A few caveats for several of the comparison methods:

  • The likelihood ratio test (lrt) method applies to exactly two models, and assumes that the parameters of these models are nested: that is, the constraints in the smaller model are a strict subset of the constraints in the larger model. This function will verify this to some extent based on the number and names of constraints.

  • The Akaike Information Criterion adjusted for small sample sizes (aic_c) and the Bayesian Information Criterion (bic) rely on sample sizes in their calculations. The sample size for a data set is defined as the sum of the column of surface form frequencies. If you want to apply these methods, it is important that the values in the column are token counts, not relative frequencies. Applying these methods to relative frequencies, which effectively ignore sample size, will produce invalid results.

The aic, aic_c, and bic comparison methods return raw AIC/AICc/BIC values as well as weights corresponding to these values. These weights are calculated similarly for each model:

W_i = \frac{\exp(-0.5 \delta_i)}{\sum_{j=1}^{m}{\exp(-0.5 \delta_j)}}

where \delta_i is the difference in score (AIC, AICc, BIC) between model i and the model with the best score, and m is the number of models being compared. These weights provide the relative likelihood or conditional probability of this model being the best model (by whatever definition of "best" is assumed by the measurement type) given the data and the selection of models it is being compared to.

Value

A data frame containing information about the comparison. The contents and size of this data frame vary depending on the method used.

  • lrt: A data frame with a single row and the following columns:

    • description: the names of the two models being compared. The name of the model with more parameters will be first.

    • chi_sq: the chi-squared value calculated during the test.

    • k_delta: the difference in parameters between the two models used as degrees of freedom in the chi-squared test.

    • p_value: the p-value calculated by the test

  • aic: A data frame with as many rows as there were models passed in. The models are sorted in ascending order of AIC (i.e., best first). This data frame has the following columns:

    • model: The name of the model.

    • k: The number of parameters.

    • aic: The model's AIC value.

    • aic.delta: The difference between this model's AIC value and the AIC value of the model with the smallest AIC value.

    • aic.wt: The model's AIC weight: this reflects the relative likelihood (or conditional probability) that this model is the "best" model in the set.

    • cum.wt: The cumulative sum of AIC weights up to and including this model.

    • ll: The log likelihood of this model.

  • aicc: The data frame returned here is analogous to the structure of the AIC data frame, with AICc values replacing AICs and accordingly modified column names. There is one additional column:

    • n: The number of samples in the data the model is fit to.

  • bic: The data frame returned here is analogous to the structure of the AIC and AICc data frames. Like the AICc data frame, it contains the n column.

Examples

  # Get paths to toy data files
  # This file has two constraints
  data_file_small <- system.file(
      "extdata", "sample_data_frame.csv", package = "maxent.ot"
  )
  # This file has three constraints
  data_file_large <- system.file(
      "extdata", "sample_data_frame_large.csv", package = "maxent.ot"
  )

  # Fit weights to both data sets with no biases
  tableaux_small <- read.csv(data_file_small)
  small_model <- optimize_weights(tableaux_small)

  tableaux_large <- read.csv(data_file_large)
  large_model <- optimize_weights(tableaux_large)

  # Compare models using likelihood ratio test. This is appropriate here
  # because the constraints are nested.
  compare_models(small_model, large_model, method='lrt')

  # Compare models using AIC
  compare_models(small_model, large_model, method='aic')

  # Compare models using AICc
  compare_models(small_model, large_model, method='aic_c')

  # Compare models using BIC
  compare_models(small_model, large_model, method='bic')


connormayer/maxent.ot documentation built on Nov. 24, 2024, 1:21 p.m.