Devlopment Moved to clustext Package -CLICK HERE-
kmeanstext is a collection of optimized tools for clustering text data via kmeans clustering. There are many great R clustering tools to locate topics within documents. Kmeans clustering is a popular method for topic extraction. This package builds upon my hclustext package to extend the hclustext package framework to kmeans. One major difference between the two techniques is that with hierchical clustering the number of topics is specified after thte model has been fitted, whereas kmeans requires the k topics to be specified before the model is fit. Additionally, kmeans uses a random start seed, the results may vary each time a model is fit. Additionally, Euclidian distance is typically used in a kmeans algorithm, where as any distance metric may be passed to a hierachical clustering fit.
The general idea is that we turn the documents into a matrix of words. After this we weight the terms by importance using tf-idf. This helps the more salient words to rise to the top. The user then selects k clusters (topics) and runs the model. The model iteratively shuffels centers and assigns documents to the clusters based on minimal dsitance of a document to center. Each run uses the recalculated mean centroid of the prior clusters as a starting point for the current iteration's centroids. Once the centroids have stabalized the model has converged at k topics. The user then may extract the clusters from the fit, providing a grouping for documents with similar important text features.
The main functions, task category, & descriptions are summarized in the table below:
Function Category Descriptiondata_store
data structure
kmeanstext's data structure (list of dtm + text)
kmeans_cluster
cluster fit
Fits a kmeans cluster model
assign_cluster
assignment
Extract clusters for document/text element
get_text
extraction
Get text from various kmeanstext objects
get_dtm
extraction
Get tm::DocumentTermMatrix
from various kmeanstext objects
get_removed
extraction
Get removed text elements from various kmeanstext objects
get_terms
extraction
Get clustered weighted important terms from an assign_cluster object
get_documents
extraction
Get clustered documents from an assign_cluster object
To download the development version of kmeanstext:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL
on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/textshape",
"trinker/gofastr",
"trinker/termco",
"trinker/hclusttext",
"trinker/kmeanstext"
)
You are welcome to: - submit suggestions and bug-reports at: https://github.com/trinker/kmeanstext/issues - send a pull request on: https://github.com/trinker/kmeanstext/ - compose a friendly e-mail to: tyler.rinker@gmail.com
if (!require("pacman")) install.packages("pacman")
pacman::p_load(kmeanstext, dplyr, textshape, ggplot2, tidyr)
data(presidential_debates_2012)
The data structure for kmeanstext is very specific. The
data_storage
produces a DocumentTermMatrix
which maps to the
original text. The empty/removed documents are tracked within this data
structure, making subsequent calls to cluster the original documents and
produce weighted important terms more robust. Making the data_storage
object is the first step to analysis.
We can give the DocumentTermMatrix
rownames via the doc.names
argument. If these names are not unique they will be combined into a
single document as seen below. Also, if you want to do stemming, minimum
character length, stopword removal or such this is when/where it's done.
ds <- with(
presidential_debates_2012,
data_store(dialogue, doc.names = paste(person, time, sep = "_"))
)
ds
## Text Elements : 10
## Elements Removed : 0
## Documents : 10
## Terms : 3,369
## Non-/sparse entries: 7713/25977
## Sparsity : 77%
## Maximal term length: 16
Next we can fit a kmeans cluster model to the data_store
object via
kmeans_cluster
. Note that, unlike hclustext's
hierarchical_cluster
, we must provide the k
(number of topics) to
the model.
By default kmeans_cluster
uses an approximation of k
based on Can &
Ozkarahan's (1990) formula (m * n)/t where m and n are the
dimensions of the matrix and t is the length of the non-zero elements
in matrix A.
There are other means of determining k
as well. See Ben Marwic's
StackOverflow post for a
detailed exploration.
set.seed(100)
myfit <- kmeans_cluster(ds, k=6)
str(myfit)
## List of 9
## $ cluster : Named int [1:10] 4 3 2 1 1 6 2 1 1 5
## ..- attr(*, "names")= chr [1:10] "CROWLEY_time 2" "LEHRER_time 1" "OBAMA_time 1" "OBAMA_time 2" ...
## $ centers : num [1:6, 1:3369] 0 0 0 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3369] "good" "evening" "from" "hofstra" ...
## $ totss : num 0.00897
## $ withinss : num [1:6] 0.00097 0.000458 0 0 0 ...
## $ tot.withinss: num 0.00143
## $ betweenss : num 0.00754
## $ size : int [1:6] 4 2 1 1 1 1
## $ iter : int 2
## $ ifault : int 0
## - attr(*, "class")= chr [1:2] "kmeans_cluster" "kmeans"
## - attr(*, "text_data_store")=<environment: 0x0000000048e6d838>
The assign_cluster
function allows the user to extract the clusters
and the documents they are assigned to. Unlike hclustext's
assign_cluster
, the kmeanstext version as no k
argument and is
merely extracting the cluster assignments from the model.
ca <- assign_cluster(myfit)
ca
## CROWLEY_time 2 LEHRER_time 1 OBAMA_time 1 OBAMA_time 2
## 4 3 2 1
## OBAMA_time 3 QUESTION_time 2 ROMNEY_time 1 ROMNEY_time 2
## 1 6 2 1
## ROMNEY_time 3 SCHIEFFER_time 3
## 1 5
To check the number of documents loading on a cluster there is a
summary
method for assign_cluster
which provides a descending data
frame of clusters and counts. Additionally, a horizontal bar plot shows
the document loadings on each cluster.
summary(ca)
## cluster count
## 1 1 4
## 2 2 2
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
The user can grab the texts from the original documents grouped by
cluster using the get_text
function. Here I demo a 40 character
substring of the document texts.
get_text(ca) %>%
lapply(substring, 1, 40)
## $`1`
## [1] "Jeremy, first of all, your future is bri"
## [2] "Well, my first job as commander in chief"
## [3] "Thank you, Jeremy. I appreciate your you"
## [4] "Thank you, Bob. And thank you for agreei"
##
## $`2`
## [1] "Jim, if I if I can just respond very qui"
## [2] "What I support is no change for current "
##
## $`3`
## [1] "We'll talk about specifically about heal"
##
## $`4`
## [1] "Good evening from Hofstra University in "
##
## $`5`
## [1] "Good evening from the campus of Lynn Uni"
##
## $`6`
## [1] "Mister President, Governor Romney, as a "
As with many topic clustering techniques, it is useful to get the to
salient terms from the model. The get_terms
function uses the
centers
from the kmeans
output. Notice the absence of clusters 1 &
2. This is a result of lower weights (more diverse term use) across
these clusters.
get_terms(ca, .008)
## $`1`
## NULL
##
## $`2`
## NULL
##
## $`3`
## term weight
## 1 minutes 0.017257547
## 2 minute 0.015156189
## 3 segment 0.009093713
## 4 repealed 0.009093713
## 5 views 0.008673442
## 6 improve 0.008673442
##
## $`4`
## term weight
## 1 mister 0.008925013
##
## $`5`
## term weight
## 1 segment 0.01446184
##
## $`6`
## term weight
## 1 department 0.01709397
## 2 chu 0.01139598
## 3 stated 0.01139598
## 4 w 0.01139598
## 5 misperception 0.01139598
The min.weight
hyperparmeter sets the lower bound on the centers
value to accept. If you don't get any terms you may want to lower this.
Likewise, this parameter (and lowering nrow
) can be raised to
eliminate noise.
get_terms(ca, .002, nrow=10)
## $`1`
## term weight
## 1 sure 0.002350754
##
## $`2`
## term weight
## 1 insurance 0.005614675
## 2 health 0.003503115
## 3 medicare 0.003142618
## 4 regulation 0.003033859
## 5 care 0.002762083
## 6 banks 0.002694781
## 7 costs 0.002114392
## 8 dodd 0.002007181
## 9 frank 0.002007181
## 10 they 0.002006281
##
## $`3`
## term weight
## 1 minutes 0.017257547
## 2 minute 0.015156189
## 3 segment 0.009093713
## 4 repealed 0.009093713
## 5 views 0.008673442
## 6 improve 0.008673442
## 7 federal 0.007832898
## 8 yeah 0.006802737
## 9 specific 0.006062475
## 10 quality 0.006062475
##
## $`4`
## term weight
## 1 mister 0.008925013
## 2 along 0.006325015
## 3 sort 0.005960397
## 4 unemployed 0.005554852
## 5 thanks 0.005194275
## 6 romney 0.004924145
## 7 town 0.004166139
## 8 hall 0.004166139
## 9 short 0.004166139
## 10 questions 0.004155420
##
## $`5`
## term weight
## 1 segment 0.014461836
## 2 pakistan 0.007212314
## 3 segments 0.004597824
## 4 soviet 0.004597824
## 5 declare 0.004597824
## 6 minutes 0.003659317
## 7 iran 0.003659317
## 8 questions 0.003606157
## 9 gentlemen 0.003606157
## 10 war 0.003460208
##
## $`6`
## term weight
## 1 department 0.017093970
## 2 chu 0.011395980
## 3 stated 0.011395980
## 4 w 0.011395980
## 5 misperception 0.011395980
## 6 graduate 0.007965448
## 7 bush 0.006861063
## 8 oh 0.005958716
## 9 earn 0.005958716
## 10 george 0.005958716
Here I plot the clusters, terms, and documents (grouping variables) together as a combined heatmap. This can be useful for viewing & comparing what documents are clustering together in the context of the cluster's salient terms. This example also shows how to use the cluster terms as a lookup key to extract probable salient terms for a given document.
key <- data_frame(
cluster = 1:6,
labs = get_terms(ca, .002) %>%
bind_list("cluster") %>%
select(-weight) %>%
group_by(cluster) %>%
slice(1:10) %>%
na.omit() %>%
group_by(cluster) %>%
summarize(term=paste(term, collapse=", ")) %>%
apply(., 1, paste, collapse=": ")
)
ca %>%
bind_vector("id", "cluster") %>%
separate(id, c("person", "time"), sep="_") %>%
tbl_df() %>%
left_join(key) %>%
mutate(n = 1) %>%
mutate(labs = factor(labs, levels=rev(key[["labs"]]))) %>%
unite("time_person", time, person, sep="\n") %>%
select(-cluster) %>%
complete(time_person, labs) %>%
mutate(n = factor(ifelse(is.na(n), FALSE, TRUE))) %>%
ggplot(aes(time_person, labs, fill = n)) +
geom_tile() +
scale_fill_manual(values=c("grey90", "red"), guide=FALSE) +
labs(x=NULL, y=NULL)
## Joining by: "cluster"
The get_documents
function grabs the documents associated with a
particular cluster. This is most useful in cases where the number of
documents is small and they have been given names.
get_documents(ca)
## $`1`
## [1] "OBAMA_time 2" "OBAMA_time 3" "ROMNEY_time 2" "ROMNEY_time 3"
##
## $`2`
## [1] "OBAMA_time 1" "ROMNEY_time 1"
##
## $`3`
## [1] "LEHRER_time 1"
##
## $`4`
## [1] "CROWLEY_time 2"
##
## $`5`
## [1] "SCHIEFFER_time 3"
##
## $`6`
## [1] "QUESTION_time 2"
I like working in a chain. In the setup below we work within a
magrittr pipeline to fit a model, select clusters, and examine the
results. In this example I do not condense the 2012 Presidential Debates
data by speaker and time, rather leaving every sentence as a separate
document. On my machine the initial data_store
and model fit take ~35
seconds to run. Note that I do restrict the number of clusters (for
texts and terms) to a random 5 clusters for the sake of space.
.tic <- Sys.time()
myfit2 <- presidential_debates_2012 %>%
with(data_store(dialogue)) %>%
kmeans_cluster(k=100)
difftime(Sys.time(), .tic)
## Time difference of 35.46239 secs
## View Document Loadings
ca2 <- assign_cluster(myfit2)
summary(ca2) %>%
head(12)
## cluster count
## 1 4 1520
## 2 15 134
## 3 97 124
## 4 74 87
## 5 92 82
## 6 89 73
## 7 11 71
## 8 43 51
## 9 70 47
## 10 90 42
## 11 30 34
## 12 81 32
## Split Text into Clusters
set.seed(3); inds <- sort(sample.int(100, 5))
get_text(ca2)[inds] %>%
lapply(head, 10)
## $`17`
## [1] "Questions remain."
##
## $`32`
## [1] "That's not going to happen."
## [2] "Many will lose it."
## [3] "I will not cut our commitment to our military."
## [4] "Their questions will drive the night."
## [5] "Will will you certainly will have lots of time here coming up."
## [6] "I will never know."
## [7] "I will not let that happen."
## [8] "I know how to make that happen."
## [9] "I know how to make that happen."
## [10] "I will."
##
## $`38`
## [1] "Is there too much?"
## [2] "Our party has been focused too long."
## [3] "And I suspect he'll keep those too."
## [4] "He said, Me too."
## [5] "He said, Me too."
## [6] "He said, Me too."
## [7] "It's too high."
##
## $`58`
## [1] "Can we can we stay on Medicare?"
## [2] "Let's get back to Medicare."
## [3] "Let's get back to Medicare."
## [4] "of you on Medicare?"
## [5] "We didn't cut Medicare."
## [6] "Of course, we don't have Medicare, but we didn't cut Medicare by dollar seven hundred sixteen billion."
## [7] "You'll see chronic unemployment."
## [8] "You'll have four million people who will lose Medicare Advantage."
## [9] "You'll have hospital and providers that'll no longer accept Medicare patients."
## [10] "I'll restore that dollar seven hundred sixteen billion to Medicare."
##
## $`80`
## [1] "Candy, Candy|" "Candy?" "Candy?" "Candy, I'm|"
## Get Associated Terms
get_terms(ca2, term.cutoff = .07)[inds]
## $`17`
## term weight
## 1 remain 4.753402
## 2 questions 4.253402
##
## $`32`
## term weight
## 1 happen 0.6435874
## 2 will 0.5188045
## 3 not 0.3571543
## 4 i 0.1601698
## 5 that 0.1290930
## 6 know 0.1279324
## 7 help 0.1268966
## 8 needs 0.1035649
##
## $`38`
## term weight
## 1 too 1.6618079
## 2 me 0.5322476
## 3 he 0.5077092
## 4 said 0.4417606
## 5 high 0.3772305
## 6 much 0.2277686
## 7 suspect 0.2144246
## 8 it's 0.2104448
## 9 he'll 0.2024865
## 10 focused 0.1874464
## 11 party 0.1820784
## 12 there 0.1625931
## 13 long 0.1346921
## 14 keep 0.1285173
## 15 those 0.1010656
##
## $`58`
## term weight
## 1 medicare 0.8723376
## 2 sales 0.2918557
## 3 you'll 0.2910676
## 4 lose 0.2860665
## 5 chronic 0.2397251
## 6 didn't 0.2151903
## 7 let's 0.1944793
## 8 we 0.1944361
## 9 back 0.1792507
## 10 cut 0.1564845
## 11 unemployment 0.1545696
## 12 get 0.1340996
## 13 see 0.1215495
## 14 on 0.1178543
## 15 billion 0.1083292
## 16 turning 0.1065445
## 17 sixteen 0.1065093
## 18 can 0.1059886
##
## $`80`
## term weight
## 1 candy 6.5684530
## 2 i'm 0.6042973
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.