knitr::read_chunk(here::here("code", "chunk-options.R"))
A few primary colors along with black, gray, and white make up the majority of the brick colors in the LEGO dataset. In text mining, one might remove some set of stop words that are frequent in all texts but add no meaning to a statistical analysis of the text. In our case, there are just 125 unique colors and I did not use a color stop list.
TF-IDF is one way to look at words that are more meaningful to each document. Word frequency per document is weighted(inversely) by the number of documents the word occurs in. The TF-IDF score shows what terms are unique to a particular document.
We can adopt this directly to the LEGO set to see what colors are unique to particular LEGO sets.
In the preceding code, we make a summary table of word counts for sets and the corpus and the tidytext
function computes the TF, IDF and TF-IDF scores.
devtools::load_all() knitr::read_chunk(here::here("code", "tf-idf.R"))
This code follows the code in tidytext mining
The top TF-IDF scores associated with a set-color pair should show us the most distinctive color relative to a set.
Set-color combinations are should correspond to colors that are common colors that are also common in a set. Here we make a plot similar to the last two but for set-colors combinations with the lowest TF-IDF scores. The lowest TF-IDF scores are associated with big sets with colors that show up in many sets.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.