Introduction to clustRcompaR

clustRcompaR

An R package to cluster and compare text data.

Background

Document clustering is a common technique to discover topics in a corpus of texts. This package uses functions from the quanteda R package as the basis for two functions, cluster() and `compare(), to make document clustering and comparing topics identified through document clustering across factors straightforward.

Installation

Because this package is in development and is not yet available on CRAN, to install it, first install the devtools package using install.packages("devtools"), followed by the function devtools::install_github("alishinski/clustRcompaR"). After installing the package, use library(clustRcompaR) to load it in each session.

Workflow

Example

Here is an example using the built-in inaugural_addresses dataset (from the quanteda package). This dataset consists of the inaugural addresses by every United States President.

First, we use cluster() to cluster the documents into three clusters. We include a new variable, year_before_1900, which we will later use to compare frequencies across clusters. Then we use extract_terms() to view the terms and term frequencies in the two clusters.

library(clustRcompaR)
library(dplyr)
library(quanteda)

d <- inaugural_addresses
d <- mutate(d, century = ifelse(Year < 1800, "17th",
                                ifelse(Year >= 1800 & Year < 1900, "18th",
                                       ifelse(Year >= 1900 & Year < 2000, "19th", "20th"))))

three_clusters <- cluster(d, century, n_clusters = 3)

extract_terms(three_clusters)

Second, we use the compare() function to compare the frequency of clusters across a factor, in this case, the century. We can then use the compare_plot() or compare_test() (which uses a Chi-Square test) function.

three_clusters_comparison <- compare(three_clusters, "century")

compare_plot(three_clusters_comparison)


Try the clustRcompaR package in your browser

Any scripts or data that you put into this service are public.

clustRcompaR documentation built on May 1, 2019, 11:16 p.m.