library(flexdashboard)
library(tidyverse)
library(RColorBrewer)
library(shiny)

item_matrix <- read_csv("~/Desktop/latent-variables/item-matrix.csv")

item_matrix$cluster <- as.factor(item_matrix$cluster)

latent_features <- item_matrix %>% select(tidyselect::matches("f[0-9]")) %>% names()
n_clusters <- length(unique(item_matrix$cluster))

cluster_sizes <- item_matrix %>% count(cluster)

cols <- scales::hue_pal()(n_clusters)

clusters <- item_matrix$cluster %>% 
  as.numeric() %>% 
  unique() %>% 
  sort()

top_10 <- item_matrix %>%
  select(tleo, latent_features) %>% 
  gather(latent_feature, value, -tleo) %>%
  group_by(latent_feature) %>%
  top_n(10, value) %>%
  ungroup() %>%
  mutate(col = "positive")

bottom_10 <- item_matrix %>%
  select(tleo, latent_features) %>% 
  gather(latent_feature, value, -tleo) %>%
  group_by(latent_feature) %>%
  top_n(-10, value) %>%
  ungroup() %>%
  mutate(col = "negative")

total <- rbind(top_10, bottom_10)

th <- theme_minimal(base_size = 18) +
  theme(text = element_text(family = "BBC Reith Sans"))

theme_set(th)

Overview

Inputs {.sidebar}

Row

Overview

What is a latent variable?

Latent variables (also known as hidden variables) are those that are not directly observed, but rather inferred from other, observed variables.

Here's a medical analogy. The common cold (the 'latent variable') cannot be directly observed by a GP. However, the virus produces a number of symptoms (the 'observed variables') - a blocked nose, a sore throat and headaches - that enable us to make the diagnosis.

What are some other examples of latent variables?

In psychology, a latent variable might be something like extraversion or spatial ability - these qualities are not directly observed, but measured through a set of other variables (e.g. answers to a psychometric test).

Examples of latent variables from economics include quality of life, business confidence, morale, happiness and conservatism: these are all variables which cannot be measured directly. But linking these latent variables to other, observable variables, the values of the latent variables can be inferred from measurements of the observable variables.

How are latent variables extracted?

Latent variables are extracted from a set of observed variables using some kind of statistical procedure. There are various such methods, depending on the application. Some common methods include principal components analysis and factor analysis, which is common in psychology and the behavioural sciences.

For this project, we extract latent variables using a technique called matrix factorisation. The mathematics behind matrix factorisation is complex, but the method is useful because it allows us to explore latent variables in relation to both our programmes and our users.

renderImage({
  list(src = "../img/matrix-factorisation.png")
}, deleteFile = FALSE)

Item matrix: Features

Inputs {.sidebar}

When building recommendation models, we unearth information about 'hidden' dimensions of our shows. These hidden dimensions become avaialble to us when we look at how the audience co-view different groups of shows. The problem is that these dimensions don't have names - we need to work out what they represent. If we do, we can identify different areas where we might have less content than intended.

To help us get a better understanding of this 'hidden' landscape, we have coloured shows that have roughly similar values across each dimension.

selectInput("column1", "Select the first latent variable:",
            choices = latent_features)

selectInput("column2", "Select the second latent variable:",
            choices = latent_features, selected = latent_features[2])

Column {data-width=600}

Compare two hidden dimensions

renderPlot({
  item_matrix %>% 
    ggplot(aes_string(x = input$column1, y = input$column2,
                      label = "tleo", col = "cluster")) +
    geom_text(check_overlap = TRUE, fontface = "bold", key_glyph = draw_key_point) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_hline(yintercept = 0, linetype = "dashed") +
    guides(col = guide_legend(title = "Cluster")) +
    theme(legend.position = "bottom")
})

Column {data-width=400}

Top/bottom programmes of the x-dimension

renderPlot({
  total %>%
    mutate(tleo = tidytext::reorder_within(tleo, value, latent_feature)) %>%
    filter(latent_feature == input$column1) %>%
    ggplot(aes(tleo, value, fill = col)) +
    geom_col() +
    coord_flip() +
    tidytext::scale_x_reordered() +
    guides(fill = FALSE) +
    theme(axis.title = element_blank())
})

Top/bottom programmes of the y-dimension

renderPlot({
  total %>%
    mutate(tleo = tidytext::reorder_within(tleo, value, latent_feature)) %>%
    filter(latent_feature == input$column2) %>%
    ggplot(aes(tleo, value, fill = col)) +
    geom_col() +
    coord_flip() +
    tidytext::scale_x_reordered() +
    guides(fill = FALSE) +
    theme(axis.title = element_blank())
})

Item matrix: Clusters

Inputs {.sidebar}

We have clustered BBC shows according to their values for each of the hidden dimensions. For example, if two shows have a very high score on Dimensions 1-10 but a very low score on Dimensions 11-20, they are likely to appear in the same cluster. In theory, this should mean that shows within each cluster should be similar – whether in genre, tone or audience.

The number of clusters is optimised so that the clusters are distinct from one another, but also so that the shows within each of them are similar.

selectInput("feature1", "Select a latent feature:",
            choices = latent_features, selected = latent_features[1])

selectInput("feature2", "Select a latent feature:",
            choices = latent_features, selected = latent_features[2])

selectInput("cluster1", "Select a cluster:",
            choices = clusters, selected = clusters[1])

# selectInput("cluster2", "Select a cluster:",
#             choices = clusters, selected = clusters[2])

Column {data-width=600}

Compare clusters by latent features

renderPlot({
  col <- cols[as.numeric(input$cluster1)]

  item_matrix %>% 
    select(cluster, latent_features) %>% 
    group_by(cluster) %>% 
    summarise_all(mean) %>% 
    left_join(cluster_sizes) %>% 
    mutate(flag = ifelse(cluster == input$cluster1, "1", "0")) %>% 
    mutate(label = paste("Cluster ", cluster, ":\n", n, " shows", sep = "")) %>% 
    ggplot(aes_string(input$feature1, input$feature2, col = "flag", label = "label")) +
    geom_text(vjust = 1.5, family = "BBC Reith Sans", fontface = "bold", size = 5,
              check_overlap = TRUE) +
    geom_point(aes_string(size = "n"), alpha = 0.25) +
    geom_vline(xintercept = 0, linetype = "dashed") +
    geom_hline(yintercept = 0, linetype = "dashed") +
    scale_size(range = c(5, 30)) +
    scale_color_manual(values = c("grey", col)) +
    guides(col = FALSE, size = FALSE)
})

Column {data-width=400 .tabset}

Genre

renderPlot({
  item_matrix %>% 
    filter(!is.na(genre)) %>% 
    filter(genre != "null") %>% 
    filter(cluster == input$cluster1) %>% 
    count(genre) %>% 
    mutate(prop = n / sum(n)) %>% 
    mutate(genre = reorder(genre, prop)) %>% 
    ggplot(aes(genre, prop, fill = prop)) +
    geom_col() +
    coord_flip() +
    scale_y_continuous(labels = scales::percent) +
    guides(fill = FALSE) +
    labs(x = "Genre", y = "Percentage of cluster")
})

Master Brand

renderPlot({
  item_matrix %>% 
    filter(!is.na(genre)) %>% 
    mutate(master_brand_name = as.factor(master_brand_name)) %>% 
    filter(cluster == input$cluster1) %>% 
    count(master_brand_name) %>% 
    mutate(prop = n / sum(n)) %>% 
    mutate(master_brand_name = reorder(master_brand_name, prop)) %>% 
    ggplot(aes(master_brand_name, prop, fill = prop)) +
    geom_col() +
    coord_flip() +
    scale_y_continuous(labels = scales::percent) +
    guides(fill = FALSE) +
    labs(x = "Master brand", y = "Percentage of cluster")
})

Top latent features

renderPlot({
  item_matrix %>% 
    filter(cluster == input$cluster1) %>% 
    select(cluster, latent_features) %>% 
    group_by(cluster) %>% 
    summarise_all(mean) %>% 
    gather(latent_feature, value, -cluster) %>% 
    mutate(latent_feature = reorder(latent_feature, value)) %>% 
    mutate(col = ifelse(value > 0, "positive", "negative")) %>% 
    ggplot(aes(latent_feature, value, fill = col)) +
    geom_col() +
    coord_flip() +
    guides(fill = FALSE) +
    labs(x = "Latent feature", y = "Value")
})

Example programmes

# renderText({
#   paste(item_matrix$tleo[item_matrix$cluster==input$cluster1], collapse = " • ")
# })

renderPrint({
  item_matrix %>% 
    filter(cluster == input$cluster1) %>% 
    select(tleo, dist) %>% 
    arrange(dist) %>% 
    print(n=30)
})

User matrix: Features

# user_matrix <- read_csv("~/Desktop/latent-variables/user-matrix.csv")
# user_matrix <- user_matrix[2:ncol(user_matrix)]
# # 
# user_matrix[2:21] <- scale(user_matrix[2:21])
# # 
# features <- names(user_matrix[2:21])

Inputs {.sidebar}

Explore the users

# selectInput("userfeature1", "Select the first feature:",
#             choices = features, selected = features[1])
# 
# selectInput("userfeature2", "Select the second feature:",
#             choices = features, selected = features[2])

Row

Explore latent feature by gender

# gender <- reactive({
#   user_matrix %>%
#     select(-user, -bbc_hid3, -age_range, -nation, -barb_region, -acorn_cat_id, -acorn_cat,
#          -frequency_band, -stations, -affinities) %>%
#     group_by(gender) %>%
#     summarise_all(mean) %>%
#     gather(key, value, -gender) %>%
#     filter(key == input$userfeature1)
# })

# renderText({
#
#   paste("The value of ", input$userfeature1, " is highest when gender = ",
#         ".")
# })
#
# renderText({
#
#   paste("The value of ", input$userfeature1, " is lowest when gender = ",
#         ".")
# })

# renderPlot({
# 
#   gender() %>%
#     ggplot(aes(gender, value, fill = value)) +
#     geom_col() +
#     coord_flip() +
#     guides(fill = FALSE) +
#     scale_fill_gradient2()
# 
# })

Explore latent feature by age range

# age <- reactive({
#   user_matrix %>% 
#     select(-user, -bbc_hid3, -gender, -nation, -barb_region, -acorn_cat_id, -acorn_cat,
#          -frequency_band, -stations, -affinities) %>% 
#     group_by(age_range) %>% 
#     summarise_all(mean) %>% 
#     gather(key, value, -age_range) %>% 
#     filter(key == input$userfeature1)
# })
# 
# renderPlot({
#   
#   age() %>% 
#     ggplot(aes(age_range, value, fill = value)) +
#     geom_col() +
#     coord_flip() +
#     guides(fill = FALSE) +
#     scale_fill_gradient2()
#   
# })

User matrix: Clusters

Inputs {.sidebar}

We can cluster our users in the same way that we cluster our shows – that is, according to their values for each of the hidden dimensions.

Again, the number of clusters is optimised so that the clusters are distinct from one another, but also so that the users within each of them are similar in terms of the content they consume.




bbc/insights-latent-feature-explorer documentation built on Nov. 3, 2019, 2:08 p.m.