In jasongraf1/VADIS: Variation-Based Distance & Similarity Modeling

library(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  error = FALSE,
  message = FALSE,
  comment = "#>"
)
knitr::opts_knit$set(
  root.dir = normalizePath('../'), # set working directory as root
  cache.path = "vignettes/basic_vadis_cache/")

This vignette introduces the basics of the Variation-Based Distance & Similarity Modeling (VADIS) method and demonstrates how to use the core functions in the VADIS package. The VADIS method builds upon techniques in comparative sociolinguistics and quantitative dialectometry for quantifying the similarity between varieties and dialects as captured by correspondences among the ways in which language users choose between different ways of saying the same thing. For details of the method and theoretical motivation see @szmrecsanyi_variationbased_2019 and @grafmiller_mapping_2018

The basic libraries you'll need for this vignette.

library(tidyverse) # for data wrangling
library(lme4) # for regression models
library(ranger) # for random forests
library(permimp) # for conditional permutation importance
library(phangorn) # for neighborNets

Install from GitHub with devtools and load the VADIS package.

devtools::install_github("jasongraf1/VADIS")

library(VADIS)

Setup

For this tutorial, we will examine how the constraints on the placement of postverbal particles, e.g. off in examples (a) and (b), vary across different varities of English.

(a) She uses a pair of long metal scissors to cut off a length of it. [continuous order] (b) She uses a pair of long metal scissors to cut a length of it off. [split order]

We'll use a version of the dataset of particle verb constructions used by @grafmiller_mapping_2018 Full details of the dataset can be found here: https://osf.io/x8vyw/.

# call the dataset 
pv <- particle_verbs_short
names(pv)

This dataset contains approximately 11000 tokens from nine different varieties of English from around the world.

ggplot(pv, aes(Variety, fill = Response)) +
  geom_bar(position = "dodge")

VADIS steps

The VADIS method is designed to measure the degree of (dis)similarity among "variable grammars" of different dialects or varieties, where a variable grammar is understood as the set of constraints (a.k.a. predictors or "conditioning factors") governing the choice between two or more linguistic variants. Variants can be individual lexical items (sneakers vs. trainers vs. tennis shoes), grammatical constructions (give me the book vs. give the book to me), or phonetic realizations of a particular phoneme (e.g. [ʊ] vs. [ʌ] pronunciations of the STRUT vowel).

The method takes inspiration from comparative sociolinguistics [see e.g. @tagliamonte_comparative_2013], which evaluates the relatedness between varieties and dialects based on how similar the conditioning of variation is in these varieties. Comparative sociolinguists rely on three lines of evidence to determine relatedness:

Are the same constraints significant across varieties?
Do the constraints have the same strength across varieties?
Is the relative explanatory importance of the constraints similar?

Below we'll go through the steps for calculating and assessing each of these lines of evidence.

Splitting the dataset

The first thing we need to do is to get the data into the necessary format. With VADIS we analyze variables across different datasest separately, and so the basic data object that the VADIS functions work with is a list of dataframes.

For this tutorial, we're modeling particle placement across nine varieties separately, so we first split the data up into nine individual datasets stored together as a list.

# create list of dataframes
data_list <- split(pv, pv$Variety, drop = TRUE) # drop unused levels
names(data_list)

Now we're ready to begin.

Step 1

In the first step we identify the most important constraints on our variable. We know from prior research [e.g. @grafmiller_mapping_2018; @gries_multifactorial_2003] that many different factors influence the choice of particle placement, including:

Length, definiteness, concreteness, and discourse accessibility of the direct object (Note: personal pronoun direct objects are excluded from this dataset)
The semantic idiomaticity of the verb-particle
The presence of a directional PP following the verb phrase
predictability of the particle given the verb, and the predictability of the verb given the particle
Prior use of V-NP-P or V-P-NP orders
The register, genre and/or mode (e.g. informal speech has much more V-NP-P tokens than formal writing)

These are the predictors we'll include in our variable grammar models. We'll define our model formula using this set. More information on these columns is available in help(particle_verbs_short).

f1 <- Response ~ DirObjWordLength + DirObjDefiniteness + DirObjGivenness + 
  DirObjConcreteness + DirObjThematicity + DirectionalPP + PrimeType + 
  Semantics + Surprisal.P + Surprisal.V + Register

Step 2

We now fit logistic regression models to the data (sub)sets. We'll use fixed-effects only models for this vignette since they are faster to compute, but I give an example of the code for a mixed-effects model below.

Next we simply loop over the datasets in data_list and fit a model to each one. Before fitting we'll center and standardize our model inputs [see @Gelman2008].

IMPORTANT: Standardizing inputs is essential for reliable calculation of the similarity coefficients in line 2 (Step 4). However, only fixed effects predictors should be standardized. There are a number of ways to do this but we've included a function stand() that will take care of this easily. See help(stand) for details.

glm_list <- vector("list") # empty list to store our models 
for (i in seq_along(data_list)){
  d <- data_list[[i]]
  # now standardize the model fixed effects inputs before fitting.
  d_std <- stand(d, cols = f1) # use the fitting function for convenience
  # fit the model
  glm_list[[i]] <- glm(f1, data = d_std, family = binomial, x = TRUE) # note the x = TRUE
}
names(glm_list) <- names(data_list) # add names to the list of models

Notice the inclusion of the x = TRUE argument in the glm() function. This tells it to store the data matrix in the model object, and this is necessary for the VADIS functions to work with glm models. This is not necessary for models fit with glmer.

Mixed-effects models are run in much the same fashion. We just have to define a new formula and different modeling function. Again, we won't use these models in this vignette, since mixed-models can take much longer to compute.

# update formula with by-verb and by-particle random intercepts 
f2 <- update(f1, .~ (1|Verb) + (1|Particle) + .)
f2

glmer_list <- vector("list")
for (i in seq_along(data_list)){
  d <- data_list[[i]]
  # standardize the model inputs, excluding the response and random effects
  d_std <- stand(d, cols = f2) # use the fitting function for convenience
  # fit the model
  glmer_list[[i]] <- glmer(f2, data = d_std, family = binomial, 
    # set optimizer controls to help convergence 
    control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 1e7)))
  rm(d, d_std) # remove datasets
}
names(glmer_list) <- names(data_list)

Checking fits

Before moving on, we should check that the models fit the data well, and that we don't have any issues with collinearity or other signs of trouble. The summary_stats() function gives a number of statistics to consider.

summary_stats(glm_list, data_list) %>% 
  round(3)

These are:

N: the total number of observations
baseline: the baseline accuracy, i.e. the proportion of the most frequent response
predicted.corr: the proportion of observations predicted correctly
Brier: the Brier score measuring the mean squared difference between the observed outcome and the predicted probability of each possible outcome for item i. Scores range from 0 (perfect prediction) to 1.
C: the concordance index C, a.k.a. as the area under the receiver operating characteristic (ROC) curve. Values below .8 suggest poor model discrimination
LogScore: the log score (log loss). Values closer to 0 reflect better prediction
AIC: the model's Akaike Information Criterion
Max.VIF: the maximum variance inflation factor for the model's parameter estimates. Values above 10 suggest potentially troublesome collinearity among predictors
kappa: the condition number indicating overall data multicollinearity. Values < 6 indicate no collinearity; ~15 indicate medium collinearity; > 30 indicate a worrysome degree of collinearity [@baayen_analyzing_2008, 198-200].
HosLem.p: the p-value of the Hosmer-Lemeshow test for goodness of fit [@hosmer_goodness_1980]. This is a popular though controversial goodness of fit test for logistic regression [see @harrell_regression_2015, 236-237]. Low values (p < .05) indicate potential problems.

All the models appear to discriminate between the two responses fairly well, though some of our models may have some issues with model fit. We'll set this aside for the time being.

Step 3

In step 3 we look at the first line of evidence: "Are the same constraints significant across varieties?"

We do this with the vadis_line1() function. This function returns a list with three elements.

signif.table: A table of predictor significance values, including the intercept. 1 = significant, 0 = non-significant
distance.matrix: A distance matrix derived from signif.table. For line 1 distances are Hamming distances, which in the case of binary data is equivalent to the Manhattan or squared Euclidean distance (the intercept is excluded when calculating the distances). In such cases these distances are equivalent to the number of constraints on which the two varieties disagree about their statistical significance. These distances are then normalized by dividing by the total number of constraints, i.e. the maximum number of constraints on which two varieties could disagree.
similarity.scores: A table of scores measuring the mean similarity for each variety to all the others. See @szmrecsanyi_variationbased_2019 for description of how these are calculated.

The path argument tells the function where to save the output. The default setting saves a file vadis_line1_output_2018-Oct-30_15.04.rds to the current working directory. Note that the output of the function is a list object, so it saves it as an R data object file (.rds). Setting path = FALSE tells it not to save anything.

signif_line <- vadis_line1(glm_list, path = FALSE)

Examine each of the elements.

signif_line$signif.table

signif_line$distance.matrix %>% 
  round(3)

signif_line$similarity.scores %>% 
  arrange(desc(Similarity)) # sort by similarity

Individual components can be saved as .csv (or.txt) files like so.

write.csv(signif_line$signif.table, 
          file = "line1_significance_table.csv")

write.csv(as.matrix(signif_line$distance.matrix), 
          file = "line1_distance_matrix.csv")

write.csv(signif_line$similarity.scores, 
          file = "line1_similarity_scores.csv")

Step 4

In the next step we look at the second line of evidence. We do this with the vadis_line2() function. This function also returns a list with three elements.

coef.table: A table of predictor coefficients from the regression models, including the intercept.
distance.matrix: A distance matrix based on coef.table. For this function distances are Euclidean distances, normalized to a maximum reasonable distance (see below). The intercept is excluded when calculating the distances.
similarity.scores: A table of coefficients measured as the mean similarity for each variety to all the others. See @szmrecsanyi_variationbased_2019 for description of how these are calculated.

coef_line <- vadis_line2(glm_list, path = FALSE)

coef_line$coef.table %>% 
  round(3)

coef_line$distance.matrix %>% 
  round(3)

coef_line$similarity.scores %>% 
  arrange(desc(Similarity))

Note that both vadis_line1() and vadis_line2() work the same with glm and glmer models.

Calculating maximum distance in line 2

In order to appropriately calculate line 2, we need to determine a maximum distance measure with which to normalize the pairwise distances. We define this distance based on what we know about the range of possible values the data is likely to have. We use the distance between two hypothetical varieties whose constraints have exactly the opposite effects (in the defaul case, 1 and -1 on the logit scale). Such cases of constraint "flipping", i.e. a systematic reversal in the direction of constraint effects between two varieties, are very unlikely to happen in real world contexts, which is exactly the point. We additionally set the absolute size of all the constraints to a reasonable value to get two (hypothetical) varieties that are about as different from one another as we could expect any two related language varieties to be. We hypothesize we are unlikely to come across a real world situation where the variable grammars of two varieties are more dissimilar than this.

For a comparison of grammars with N constraints, we define the maximum reasonable distance by taking the distance between two hypothetical varieties whose constraints all have an absolute effect size of ±1 but whose effect directions are the opposite of one another. For example, in a study of a variable grammar with 10 constraints, we define the maximum reasonable distance as thee euclidean distance between two hypothetical varieties A and B that look like so.

d <- data.frame(
  A = rep(c(-1,1), each = 5),
  B = rep(c(1,-1), each = 5),
  row.names = paste0("constraint.", 1:10))
d

The maximum distance in this case is d(A,B) = r round(as.numeric(dist(t(d), "euclidean")), 2). To normalize our distance matrix we then divide the observed distances by this value to give distances within a range of 0 to 1. For the similarity scores we simply subtract the distances from 1 to give us a score where larger values represent greater average similarity.

We chose ±1 as the default value for our hypothetical constraints since a change from -1 to 1 on the log odds scale represents an increase in probability of approximately 50 percentage points: p = logit^-1^(-1) = 0.27; p = logit^-1^(1) = 0.73. It seems very unlikely that we'd ever see such a strong reversal in the effect of every constraint between two related varieties. Note that these values assume that inputs are standardized according to the procedure described in @Gelman2008. Nevertheless, if desired, the effect size used to calculate the weighting can be adjusted by setting the weight argument in the vadis_line2() function.

vadis_line2(glm_list, weight = 2, path = FALSE)$distance.matrix %>% 
  round(3)

With weight = 2 the absolute distances are now half what they were before, but the relative distances have not changed (this also means that higher weights will push similarity scores closer to 1).

Step 5

In this step we calculate the conditional random forest models of each dataset. From these we'll then calculate variable importance rankings which comprise the third line of evidence. We recommend either the party or ranger packages for fitting random forests. Note that for ranger forests, the importance argument must be specified in the model fitting function.

Before fitting the models, we can tune the hyperparameter settings for each dataset. The tuneRanger package provides an efficient method for tuning forests fit with the ranger package.

library(tuneRanger)
library(mlr)

tune_df <- data.frame(matrix(NA, ncol = 5, nrow = 9))
names(tune_df) <- c("mtry", "min.node.size", "sample.fraction", "auc", "exec.time")

for (i in seq_along(data_list)){
  d <- data_list[[i]][, all.vars(f1)]
  pv_task <- makeClassifTask(data = d, target = "Response")

  # Tuning process (takes around 1 minute); Tuning measure is the Area Under the Curve
  result <- tuneRanger(pv_task, measure = list(auc), num.trees = 1000, 
                    num.threads = 4, iters = 80, show.info = F)

  tune_df[i,] <- result$recommended.pars
}
rownames(tune_df) <- names(data_list)

tune_df <- read.csv("data/tune_df.csv")

Now we loop over the datasets in data_list and apply the modeling function just like we did with the regression models.

library(ranger)
rf_list <- vector("list")
for (i in seq_along(data_list)){
  d <- data_list[[i]]
  # fit the random forest and add it to the list
  rf_list[[i]] <- ranger(
    f1, 
    data = d,
    seed = 1234,
    num.trees = 1000,
    mtry = tune_df[i, "mtry"],
    min.node.size = tune_df[i, "min.node.size"],
    sample.fraction = tune_df[i, "sample.fraction"],
    probability = TRUE,
    importance = "permutation"
    )
}
names(rf_list) <- names(data_list)

This gives us a list of random forests models similar to our list of regression models above.

WARNING: Random forest objects can take a long time to run, and the resulting objects can be quite large, especially those generated with party. Processing many forests with large numbers of trees can tax your CPU and RAM and slow down R's functioning. It's recommended that you save these objects to a file to be called as needed.

Checking fits

Before moving on, we should again check that the models fit the data well, and that we don't have any issues with collinearity or other signs of trouble. The summary_stats() function gives a subset of the statistics we saw before, mainly because random forest models don't provide us with these statistics. Additional arguments are also needed since model objects from some packages don't contain the necessary information.

summary_stats(rf_list, data_list, response = "Response") %>% 
  round(3)

This may also take some time with party models...

Step 6

In this next step we look at the third line of evidence. We do this with the vadis_line3() function. This function also returns a list with 4 elements.

varimp.table: A table of the variable importance measures returned by the forest model.
rank.table: A table of the variable importances converted to rankings, i.e. 1, 2, 3, ..., n for the first, second, third, etc. most importance predictors.
distance.matrix: A distance matrix based on rank.table. For this function distances are measured by calculating the pairwise Spearman rank correlation coefficients and subtracting from 1.
similarity.scores: A table of mean similarity coeficients measured for each variety. These are equivalent to the Spearman rank correlations.

We then loop over the list of random forests and calculate the variable importance rankings. For ranger forests, the importance scores are taken directly from the model objects. For party forests the package uses the AUC based permutation importance, calculated with the permimp() function from the permimp package (with AUC = TRUE). With party forests you can also specify whether the rankings should be based on conditional importance scores, which are advised when there are many potentially correlated predictors [@strobl_conditional_2008]. Since conditional importance can take much longer to compute, the default is set to FALSE (it is ignored with ranger forests).

varimp_line <- vadis_line3(rf_list, path = FALSE, conditional = FALSE)

Look at the variable importance table.

varimp_line$varimp.table %>% 
  round(3)

Now the importances as rankings. This makes it easier to compare varieties.

varimp_line$rank.table

varimp_line$distance.matrix %>% 
  round(3)

varimp_line$similarity.scores %>% 
  arrange(desc(Similarity))

Combining the 3 lines

We can merge the similarity scores across the three lines of evidence, and arrive at mean scores for each variety.

mean_sims <- data.frame(
  line1 = signif_line$similarity.scores[,1], # get only the values in the 2nd column
  line2 = coef_line$similarity.scores[,1],
  line3 = varimp_line$similarity.scores[,1],
  row.names = names(data_list)
)
mean_sims$mean <- rowMeans(mean_sims)
round(mean_sims, 3)

From this we can calculate a mean of the mean similarity scores. This is our Core Grammar Coefficient for particle placement across these varieties [@szmrecsanyi_variationbased_2019]

mean(mean_sims$mean)

We can also create a combined distance matrix from the three lines of evidence. We use the fuse() function in the analogue package.

fused_dist <- analogue::fuse(signif_line$distance.matrix, 
                             coef_line$distance.matrix, 
                             varimp_line$distance.matrix)
round(fused_dist, 3)

Visualization

Using the distance matrices from the different lines of evidence (or the combined distances), we can visualize the similarities among varieties in a number of ways.

Clustering

One way we can visualize distances is through clustering methods. There are many different functions and packages for doing this, the simplest case probably being the hclust() function.

# Use the 2nd line of evidence
line2_clust <- hclust(coef_line$distance.matrix, method = "ward.D2")
plot(line2_clust, main = "Hierarchical clustering of line 2 distances")

We can look at the fused matrix combining the 3 lines of evidence, which gives us a slightly different picture.

hclust(fused_dist, method = "ward.D2") %>% 
  plot(main = "Hierarchical clustering of fused distances")

Other methods use divisive clustering (with cluster::diana()) or unrooted trees with neighbor joining methods (with ape::nj()). For example, the Line 2 distances cluster like so.

par(mfrow = c(1,2))
cluster::diana(coef_line$distance.matrix) %>% 
  cluster::pltree(main = "Divisive clustering")
ape::nj(coef_line$distance.matrix) %>% 
  plot(type = "u", main = "Unrooted clustering")

The three Lines together give a slightly different picture.

par(mfrow = c(1,2))
cluster::diana(fused_dist) %>% 
  cluster::pltree(main = "Divisive clustering")
ape::nj(fused_dist) %>% 
  plot(type = "u", main = "Unrooted clustering")

Multidimensional scaling

Distance matrices can also be fed into a multidimensional scaling analysis, which maps distances onto a 2 or 3 dimensional space. The closer the points are in this space, the more similar the varieties' grammars.

line2_mds <- cmdscale(coef_line$distance.matrix, k = 3, eig = T)

line2_mds[[1]] %>%
  as.data.frame() %>% 
  mutate(genres = rownames(.)) %>% 
  ggplot(aes(V1, V2, label = genres)) +
  geom_point() +
  geom_text(nudge_y = .01, size = 4)

One nice thing about MDS maps is that they can be represented in 2 or 3 dimensions. There are a number of methods for generating 3D plots. The simplest is probably scatterplot3d.

dd <- line2_mds[[1]] %>%
  as.data.frame() 
library(scatterplot3d)
with(dd, {
  scttr <- scatterplot3d(x = V1, y = V2, z = V3, type = "h", pch = 18)
  scttr_coords <- scttr$xyz.convert(V1, V2, V3)
  text(scttr_coords$x, scttr_coords$y, labels = rownames(dd), pos = 3)
  })

The {plotly} package is very nice for generating all kinds of interactive graphics.

library(plotly)
dd %>% 
  rownames_to_column("Variety") %>% 
  plot_ly() %>%
  add_trace(x = ~V1, y = ~V2, z = ~V3,
            type = "scatter3d", inherit = F,
            marker = list(size = 4),
            mode = "markers") %>%
  add_text(x = ~V1, y = ~V2, z = ~V3,
           text = ~ Variety,
           type = "scatter3d",
           mode = "markers",
           showlegend = FALSE)

We can also run an MDS based on the fused distance matrix.

fused_mds <- cmdscale(fused_dist, k = 3, eig = T)

fused_mds[[1]] %>% # extract the coordinates
  as.data.frame() %>% 
  rownames_to_column("Variety") %>%
  plot_ly() %>%
  add_trace(x = ~V1, y = ~V2, z = ~V3,
            type = "scatter3d", inherit = F,
            marker = list(size = 4),
            mode = "markers") %>%
  add_text(x = ~V1, y = ~V2, z = ~V3,
           text = ~ Variety,
           type = "scatter3d",
           mode = "markers",
           showlegend = FALSE)

NeighborNets

NeighborNet is yet another technique for constructing and visualizing phylogenetic relationships. The method is similar to other agglomerative clustering methods, e.g. neighbor-joining algorithms, but allows for "conflicting" signals. These conflicting signals are realized as web-like reticulations which indicate that the data support several possible tree structures [see @dunn_structural_2008].

line2_NNet <- phangorn::neighborNet(coef_line$distance.matrix)
plot(line2_NNet, "2D")

Look at NNet derived from the the fused distance matrix.

phangorn::neighborNet(fused_dist) %>% 
  plot("2D")

References

jasongraf1/VADIS documentation built on July 19, 2023, 10:26 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jasongraf1/VADIS
Variation-Based Distance & Similarity Modeling

In jasongraf1/VADIS: Variation-Based Distance & Similarity Modeling

Setup

VADIS steps

Splitting the dataset

Step 1

Step 2

Checking fits

Step 3

Step 4

Calculating maximum distance in line 2

Step 5

Checking fits

Step 6

Combining the 3 lines

Visualization

Clustering

Multidimensional scaling

NeighborNets

References

R Package Documentation

Browse R Packages

We want your feedback!

jasongraf1/VADIS Variation-Based Distance & Similarity Modeling

In jasongraf1/VADIS: Variation-Based Distance & Similarity Modeling

Setup

VADIS steps

Splitting the dataset

Step 1

Step 2

Checking fits

Step 3

Step 4

Calculating maximum distance in line 2

Step 5

Checking fits

Step 6

Combining the 3 lines

Visualization

Clustering

Multidimensional scaling

NeighborNets

References

R Package Documentation

Browse R Packages

We want your feedback!

jasongraf1/VADIS
Variation-Based Distance & Similarity Modeling