knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
In stylometry, cluster analysis makes it easy to understand similarities and differences among texts in a large corpus by grouping them based on their contents. The stylo package visualizes these groups with hierarchical tree diagrams, or dendrograms, drawing connections among texts. In these charts, texts get placed in branches splitting off from one another at points that correspond to difference or distance. In the orientation shown below, when two texts or groups of texts seem to have greater stylistic similarity, their branches will join nearer to the right side of the chart---nearer to a distance of zero. As the texts show greater dissimilarity, their branches will connect closer to the left side of the chart.
# Only run this code chunk interactively, to create the needed files federalist_mfw_hc <- stylo(gui = FALSE, corpus.dir = system.file("extdata/federalist", package = "stylo2gg"), custom.graph.title = "Federalist Papers", analysis.type = "CA", analyzed.features = "w", ngram.size = 1, display.on.screen = TRUE, sampling = "no.sampling", culling.max = 75, culling.min = 75, mfw.min = 900, mfw.max = 900, write.rds.file = TRUE) saveRDS(federalist_mfw_hc, "federalist_mfw_hc.rds")
#| label: stylo_hc #| cache: false federalist_mfw_hc <- stylo(gui = FALSE, corpus.dir = system.file("extdata/federalist", package = "stylo2gg"), custom.graph.title = "Federalist Papers", analysis.type = "CA", analyzed.features = "w", ngram.size = 1, display.on.screen = TRUE, sampling = "no.sampling", culling.max = 75, culling.min = 75, mfw.min = 900, mfw.max = 900)
#| fig.height: 10.0 # Actually run this code chunk, but don't show the code federalist_mfw_hc <- readRDS("federalist_mfw_hc.rds") readRDS("articles_CA_120_MFWs_Culled_75__Classic Delta_.rds")
Using this federalist_mfw_hc
object, stylo2gg will create a similar cluster analysis:
#| fig.height: 10.0 library(stylo2gg) federalist_mfw_hc |> stylo2gg()
The addition of symbols to the chart allows for interpretability when printing, though these can be turned off with shapes=FALSE
, as they can when showing PCA. In fact, many of the same options that apply to a visualization of principal components will also apply to cluster analysis:
#| fig.height: 10.0 library(ggplot2) federalist_mfw_hc |> stylo2gg(viz = "CA", labeling = 2, highlight = 4, black = 3) + theme(legend.position = "bottom") + scale_shape_manual(values = 15:18)
If they are available, stylo2gg will adopt stylo's settings for measuring distances and finding groups. If they're not available, it will default to finding distance using the Burrows delta (distance.measure="delta"
) and grouping texts using the Ward1 method (linkage="ward.D"
). Many other options native to stylo and to R can be set for these.
When setting viz="CA"
or importing directly from a stylo analysis with the option set analysis.type="CA"
, stylo2gg will try to approximate the aesthetics shown by stylo, displaying an axis of distance and an explanatory caption. These options may not be desired since relative distances are already meaningful without the axis. To show a minimal dendrogram, choose the option viz="hc"
(for hierarchical clustering)
#| fig.height: 10.0 federalist_mfw_hc |> stylo2gg(viz = "hc", shapes = FALSE, linkage = "ward.D2", distance = "euclidean") + expand_limits(y = 0)
As shown above, setting the highlight
option to a category's number will draw a box around the branches containing this category. Unlike with principal components visualization, only one category can be highlighted at a time on a dendrogram; nevertheless, there are a few more options available to tweak this highlighting.
By default, stylo2gg will draw a single box, even if texts belonging to a category are distributed unevenly.
#| fig.height: 10.0 federalist_mfw_hc |> stylo2gg(viz = "hc", shapes = FALSE, highlight = 1, highlight.nudge = 0.4, linkage = "ward.D2", distance = "euclidean") + expand_limits(y = 0)
Setting highlight.single=FALSE
will instead draw multiple boxes while consolidating any contiguous branches:
#| fig.height: 10.0 federalist_mfw_hc |> stylo2gg(viz = "hc", shapes = FALSE, highlight = 1, highlight.nudge = 0.4, highlight.single = FALSE, linkage = "ward.D2", distance = "euclidean") + expand_limits(y = 0)
As an alternative to categorical highlighting on a dendrogram, stylo2gg can draw boxes indicated by each text's index number from the bottom, with the lowest text being number 1, the second-lowest text number 2, etc:
#| fig.height: 10.0 federalist_mfw_hc |> stylo2gg(viz = "hc", shapes = FALSE, highlight.box = c(6:19, 23:24, 29:63, 84:85), highlight.nudge = 0.4, linkage = "ward.D2", distance = "euclidean", show.zero = FALSE)
Finally, set horiz=FALSE
to display the cluster analysis vertically instead of horizontally:
#| fig.width: 11.0 federalist_mfw_hc |> stylo2gg(viz = "CA", horiz = FALSE) + theme(legend.position = "top")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.