knitr::read_chunk(here::here("code", "chunk-options.R"))



Analyzing color frequency with TF-IDF

A few primary colors along with black, gray, and white make up the majority of the brick colors in the LEGO dataset. In text mining, one might remove some set of stop words that are frequent in all texts but add no meaning to a statistical analysis of the text. In our case, there are just 125 unique colors and I did not use a color stop list.

TF-IDF is one way to look at words that are more meaningful to each document. Word frequency per document is weighted(inversely) by the number of documents the word occurs in. The TF-IDF score shows what terms are unique to a particular document.

We can adopt this directly to the LEGO set to see what colors are unique to particular LEGO sets.

In the preceding code, we make a summary table of word counts for sets and the corpus and the tidytext function computes the TF, IDF and TF-IDF scores.

devtools::load_all()
knitr::read_chunk(here::here("code", "tf-idf.R"))

This code follows the code in tidytext mining


Top TF-IDF

The top TF-IDF scores associated with a set-color pair should show us the most distinctive color relative to a set.



Low TF-IDF

Set-color combinations are should correspond to colors that are common colors that are also common in a set. Here we make a plot similar to the last two but for set-colors combinations with the lowest TF-IDF scores. The lowest TF-IDF scores are associated with big sets with colors that show up in many sets.






nateaff/legolda documentation built on May 18, 2019, 10:15 a.m.