In jveech/recolumnize: Clean Data With Arbitrary or Disordered Columns

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(recolumnize)
library(ggplot2)

Introduction

I will give a quick introduction to using the recolumnize package for data wrangling and cleaning. The intended use cases for this package occur when dealing with very messy data found "in the wild." Sometimes, one encounters data that has columns that are transposed or in a meaningless order. That is, a value found in row 1 column 1 may have no connection to the value found in row 2 column 1.

Often this occurs due to data entry errors, data not intended to be stored in table format, or from user/survey data that has not been properly validated (e.g., a user puts their ZIP code in the state field and their state in their address 2 field, etc.)

Example

In this example, we will use data collected from high level League of Legends games. League of Legends ia a competitive video game that consists of two teams of 5 players each. Each player selects a character (called a "champion") from a roster of 146 different characters.

data("games_small")

head(games_small,10)

We see that values can appear in any column (for example, "teemo" appears in 3 different columns in just the first 10 rows!). In fact, the order is determined solely by the order in which they were selected, which is not very meaningful for most analysis purposes. Our first step will be to try one-hot-encoding the data using recolumnize::one_hot_encode.

one_hot_encode()

one_hot_encoded <- one_hot_encode(games_small, encode_cols = c(4:8), keep = "exists", min_occurences = 1)
head(one_hot_encoded,10)[,1:9]

The one_hot_encode function creates a new column for each value that exists in the columns to be encoded. It then populates each row with a 1 if that value appears anywhere in the row and a 0 otherwise (note if we set keep = "sum", we could also store the number of times the value appears in each row)

In this case, the usual implementation of one hot encoding (for example in the library vtreat) would not work well for this data (each value would get encoded 5 separate times if it appears in all 5 of the original columns).

So this data encoding requires substantially fewer variables. It also allows more easy generalization (if something holds for a character in one). Can we do better? For modelling/prediction purposes, we are probably best served using one-hot-encoding (many algorithms require it -- plus, the reduced dimensionality is quite valuable). However, for data exploration, we may want to instead create some sort of meaningful ordering.

Generally, the 5 champions on each team have 5 separate roles (similar to positions you might see in sports): top, jungle, middle, bottom, and support.

data("champion_dictionary")
rownames(dict) <- dict[,1]
dict <- dict[-1]
head(dict,10)

We need a dataframe with row names consisting of the values we want to match and with columns representing the probability they fall into each category. Every value should have a probability for each category, so this method is not very well suited to datasets without a lot of structure (in such a case, using the recategorize function will probably be more fruitful)

best_categories_brute_force()

categorized <- best_categories_brute_force(games_small, dict, encode_cols = c(4:8),ignore_warning = T)

Note that calling best_categories_brute_force() will be very slow on large datasets, as it is O(n*k!) due to calculating probabilities for every permutation of columns. However, it can be useful to run it on a subset of your data and to compare it to the approximated version as we will do below.

head(categorized)

Now instead of seeing "teemo" in 3 different columns, he mostly only appears in the column for "top." Likewise, we can already see other examples of values reoccuring. This allows us to do better analysis. For one, we can now compare a player to his counterpart on the other team. It also lets us capture the interaction between champion and role, rather than just the existence of the champion.

bars <- c(sum(apply(games_small[,4:8], 2, function(x) length(unique(x)))) ,sum(apply(categorized[,4:8],2,function(x) length(unique(x)))),sum(apply(one_hot_encoded[,4:148],2,function(x) length(unique(x)))))
ggplot(data= NULL, aes(x = c("original data", "categorized","one-hot-encoded"), y = bars,fill = c("#2b8cbe","#e34a33", "#fdbb84"))) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(legend.position = "none") + labs(y = "Total cardinality of variables needed", x = "")

We still have somewhat reduced the number of predictors needed from the original dataset, although not as much as if we one-hot-encoded.

best_categories_approximate()

Now, let's use the faster best_categories_approximate() function. This function works by checking each category, finding the maximum percentage difference between the most and 2nd most likely value, then proceeding on the problem with one fewer category and one fewer value. This is substantially faster than checking every possible combination.

categorized_fast <- best_categories_approximate(games_small, dict, encode_cols = c(4:8))
head(categorized_fast)

At first glance, it looks quite similar to the brute force output from above. But we may want to check how often the two differ.

get_probability_by_row <- function(row, probs) {
  likelihood <- 1
  for (i in (1:length(row))) {
    likelihood <- likelihood*probs[row[i],i]
  }
  return(likelihood)
}

probs_bf <- as.numeric(apply(categorized[,4:8], 1, get_probability_by_row, dict))
probs_approx <- as.numeric(apply(categorized_fast[ ,4:8], 1, get_probability_by_row, dict))
diffs <- probs_bf - probs_approx

Now we can check the mean difference in probabilities of the two methods which is r mean(probs_bf-probs_approx). We may also want to check when there is a difference in opinion which happens r sum(rowSums(categorized[,4:8] != categorized_fast[,4:8]) > 0) times and has a mean r mean(diffs[(rowSums(categorized[,4:8] != categorized_fast[,4:8]) > 0)]). So when the two methods disagree (at least in this dataset), the two answers are usually close to equally likely.

get_individual_probabilities <- function(row, probs) {
  prob_vec <- rep(NA, length(row))
  for (i in (1:length(row))) {
    prob_vec[i] <- probs[row[i],i]
  }
  return(prob_vec)
}

categorized <- as.data.frame(categorized)
ind_probs <- as.vector(apply(categorized[,4:8], 1, get_individual_probabilities, dict))
ind_probs_df <- data.frame(probs = ind_probs, role = c(rep("top",5000),rep("jungle",5000), rep("mid",5000), rep("bot",5000), rep("support",5000)))

ggplot(ind_probs_df, aes(probs, fill = role)) +
  geom_density() +
  facet_wrap(~role) +
  theme(legend.position = "none") +
  labs(x = "Individual probability of placing an item correctly")

We can see that overall we seem to be putting most values into the most likely category (and there doesn't appear to be much of a difference based on which category).

At this point, we could proceed with modelling or more data exploration with fairly strong confidence in our encoding method.

jveech/recolumnize documentation built on Dec. 11, 2019, 2:05 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com