This document covers the details of how the scraped UVList records are processed for merging. It starts by loading the raw scraped data (see the UVList Data Scraping document for details) and concludes with UVList records in the desired structure for merging.

The UVList data were retrieved on 2/1/2016 and consist of all the game records available on the UVList game search pages at that time.

Note: Carefully observe the globals (e.g.,id_variations_from_scratch) in the opening code below. The major processes underlying the identification of value mispellings/variations in the game records (especially the game titles) are very lengthy (up to several hours) and resolving identified assocations is often best informed by some additional manual interaction with the data to assess if new auto-resolve rules are appropriate. Unless you are very certain the data need to be re-processed, it is suggested that you keep globals at their defaults and work with the intermediate data products loaded into the workspace.

Note: If run, this document assumes certain resources (e.g., the UVList raw data .Rds) are available in the local working directory.

Prepare the Workspace

We start by loading supporting packages and setting desired global options.

# Load supporting packages.
library(dplyr)          # Data manipulation.
library(stringr)        # String manipulation.
library(stringdist)     # Fuzzy string matching.

# Set document globals.
run_from_scratch <- FALSE

To keep the later portions of the document concise, we also define our custom helper functions here. Functions are presented in the order they are called.

source("./cleaning_functions.R")

Finally, we load the raw UVList data.

# Load the raw results of the UVList scraping script. This is a list of 
# game-by-system tables (i.e., games will have as many entries as they have
# systems recorded as associated with them).
load("./uvlist_raw.Rds")

# If not running from scratch, we also load all other intermediate products 
# produced by the code in this document.
if(!run_from_scratch) {
    load("./uvlist_intermediate_products.Rds")
}

Create the Raw Dataframe

The UVList data arrives as a list of dataframes, each scraped from one page of the UVList site. Each table is a collection of game-by-system records. In other words, they are structured such that games may have as many records as gaming systems they are associated with. Our first step is gather these tables into a single R dataframe.

# Pull the dataframe from each list and apply more appropriate (and R friendly)
# column names.
df_collection <- lapply(table_collection, function(x) {
    df <- x[[1]]

    names(df) <- c("raw_title", 
                   "publisher_developer", 
                   "release_year", 
                   "platform")

    return(df)
})

# Test for empty dataframes.
row_counts <- sapply(df_collection, nrow)

# Drop dataframe(s) with 0 rows. This should only occur if extra pages were
# grabbed (my code intentionally grabs one page further than was initially
# estimated).
df_collection <- df_collection[!(row_counts == 0)]

# Insure that all columns are character to simplify cleaning.
df_collection <- lapply(df_collection, function(x) {
    for(col_index in 1:ncol(x)) {
        x[col_index] <- sapply(x[col_index], as.character)
    }

    return(x)
})

# Collapse the results to a single dataframe.
uv_raw_df <- as.data.frame(rbind_all(df_collection), stringsAsFactors = FALSE)

# Clean intermediate products from our workspace.
rm(row_counts, table_collection, df_collection)

# Make a fresh dataframe copy to work with in case we need to appeal to the
# original.
uv_df <- uv_raw_df

Inspection

In this section we do some spot checking of the data to get a sense of the contents of each column and to look for obvious fixes.

# Examine the dataframe structure. 
str(uv_df)

# Observe the first few records. 
head(uv_df)

# Take a closer look at the print behavior of raw_titles. 
print(head(uv_df$raw_title))

# And for publisher_developer. 
print(head(uv_df$publisher_developer))

# Get a sense for the number of levels in our features. 
apply(uv_df, 2, function(x) length(unique(x))) 

We observe a few things immedidately.

We do some quick global cleaning before we proceed to the dirty work.

# Column values (especially game titles) may look strange if the encoding is not # specified. We insure that they will be properly represented as UTF-8. 
Encoding(uv_df$raw_title) <- "UTF-8" 
Encoding(uv_df$publisher_developer) <- "UTF-8" 
Encoding(uv_df$release_year) <- "UTF-8" 
Encoding(uv_df$platform) <- "UTF-8"

# Clean out junk space characters.
junk_regex <- "((?! )[[:space:]])|(\\((?! )[[:space:]]\\))"

uv_df$raw_title <- gsub(junk_regex, "", uv_df$raw_title, perl = TRUE)
uv_df$publisher_developer <- gsub(junk_regex, "", uv_df$publisher_developer, 
                                  perl = TRUE)
uv_df$platform <- gsub(junk_regex, "", uv_df$platform, perl = TRUE)

rm(junk_regex)

# Many (if not all) of the titles have extra spaces on the head/tail for some 
#reason. We clean those up. 
uv_df$raw_title <- str_trim(uv_df$raw_title) 

# Some titles have a lead paren that they fail to close. We insure they close.
uv_df$raw_title <- sapply(uv_df$raw_title, function(x) {
    if(grepl(" \\([^\\)]*$", x)) {
        return(paste0(x, ")"))
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

Splitting the Complex Fields: raw_title and publisher_developer

Our first order of business is breaking up our complex fields. In the case of raw_title, this will simplify our hunt for duplicate records and will make for cleaner records when it comes time to merge the UVList records with our other sources. In the case of publisher_developer, this is simply an essential step to pulling apart distinct pieces of information.

We start with raw_title.

# As noted, there are a large number of titles with information added in parens
# (e.g., "title (stuff)") after the title. This information usually appears to
# be alternate spellings or title translations. We see how much of this we can
# separate out.

# The pattern for the added text is a body of parens preceded by a space and
# for which the closing paren is the final character. There appears to be some
# use of ";" within the added text as a separator - we'll also want to replace
# this with a more unique separator (we'll use "----").

# We test for situations where our space-lead-paren pattern might be
# problematic.

# Added text doesn't finish the line? This could occur for records which have 
# parens as part of the title or in cases where the extra text was for some 
# reason inserted mid-title.
head(uv_df$raw_title[grepl("^.*\\).+$", uv_df$raw_title)])

# Parens without a leading space? This could occur for records that have parens 
# as part of the title or where catalogers failed to put the appropriate space 
# between title and extra text. We have no good way to distinguish such
# instances automatically (e.g., tit(le) v. title(extra)).
head(uv_df$raw_title[grepl("[^ ]\\(", uv_df$raw_title)])

# Space, semi-colon, or semi-colon after paren that is inside a paren set? These
# can't be distinguished from cases where the cataloger entered multiple bodies
# of extra text (e.g., (extra (text) extra) v. (extra (text); extra) v. (extra)
# (extra)).
head(uv_df$raw_title[grepl("\\)(?= |;|$)(?=.*\\))(?! \\()", uv_df$raw_title, 
                           perl = TRUE)])

# For the above cases, we're going to need to do some pre-processing of the
# identified records before we can apply a single regex solution to parse
# the title itself from the extra text.

# We make a copy of the field so that we can keep the original text for
# reference.
raw_title <- uv_df$raw_title

# We identify our problematic records (note: none of the records for the first
# test - the text-after-parens test - were problematic for our regex).
no_lead_space <- raw_title[grepl("[^ ]\\(", raw_title)]
bad_close_paren <- raw_title[grepl("\\)(?= |;|$)(?=.*\\))(?! \\()", 
                                    raw_title, perl = TRUE)]

# We manually inspect these collections to identify which raw_titles need
# pre-processing to correct their formatting. 

# no_lead_space: Other than two special cases, these turn out to be all the 
# cases that would fail for our leading-paren pattern. We can fix these by
# adding a space before the opening paren. You can use the regex patterns
# in the line below to see what elements I remove from no_lead_space.
no_lead_space <- no_lead_space[-grep("(^AMM)|(^hack)", no_lead_space)]
no_lead_space <- no_lead_space[!grepl(" \\(.*?\\)(?= |$|;)", 
                                      no_lead_space, perl = TRUE)]

# bad_close_paren: These are all cases where a closing paren is eventually 
# followed by another closing paren without another opening paren (e.g., (extra 
# (text) extra) v. (extra (text)) (extra)). We can "fix" these by simply
# replacing the problematic closing paren with a different symbol.

# We correct the text for all problematic entries.
raw_title <- sapply(raw_title, function(x) {
    if(x %in% no_lead_space) {
        x <- str_replace(x, "\\(", " (")
    }

    if(x %in% bad_close_paren) {
        x <- str_replace(x, 
                         "\\)(?= |;|$)(?=.*\\))(?! \\()",
                         "]")
    }

    return(x)
}, USE.NAMES = FALSE)

# Make a copy of our fixed-up raw_title that we can manipulate to extract the
# extra text from.
extra_text <- raw_title

# We extract the extra text by pulling out all paren groups that are followed by
# a space, end-of-line, or semi-colon. Because some titles can have multiple 
# extra text groups (e.g., (extra) (extra)), elements in the resulting list may
# be larger than 1.
extra_text <- str_extract_all(extra_text, " \\(.*?\\)(?= |$|;)")

# Our extraction solution captures the leading space. We clean these out.
extra_text <- lapply(extra_text, str_trim)

# Titles without any extra text will produce an empty string. We make sure these
# are flagged properly as NA.
extra_text <- lapply(extra_text, function(x) {
    if(length(x) == 0) {
        return(NA)
    } else {
        return(x)
    }
})

# At this point, we have something useful for getting to our trimmed titles: we 
# have a string (or strings) matching that which should be removed from the 
# raw_title string to leave just the trimmed title.
trim_title <- lapply(1:length(extra_text), function(index) {

    pd_string <- raw_title[index]
    rd_list <- extra_text[[index]]

    for(rd_index in 1:length(rd_list)) {
        if(is.na(rd_list[rd_index])) {
            return(pd_string)
        } else {
        pd_string <- gsub(rd_list[rd_index], "", pd_string, fixed = TRUE)
        }
    }

    return(pd_string)
})

# We finish cleaning extra_text (we'll return to trim_title). First, our 
# extraction solution captures the leading and trailing paren. We delete these 
# off to leave just the text.
extra_text <- lapply(extra_text, function(x) str_replace(x, "^.", ""))
extra_text <- lapply(extra_text, function(x) str_replace(x, ".$", ""))

# Next, some games have multiple extra text entries inside the same set of 
# parens but separated by semi-colons (e.g., (extra; extra)). We split these up.
extra_text <- str_split(extra_text, ";")

# Clean up any leading/trailing spaces leftover from the splitting.
extra_text <- lapply(extra_text, str_trim)

# Finally, at these point we have a list of variability sized character vectors 
# but we want a single character vector. For each collection of extracted extra 
# text, we collapse the collection to a single string. Then we collapse the 
# entire list to a single character vector.
extra_text <- lapply(extra_text, function(x) paste(x, collapse = "----"))
extra_text <- unlist(extra_text)

# And back to trim_title. We simply need to clean up any leading/trailing spaces
# and convert the list to a character vector.
trim_title <- lapply(trim_title, str_trim)
trim_title <- unlist(trim_title)

# We still need to clean these for NAs, but we'll do that later when we do our
# general NA cleaning. For now, we add our split columns to our dataframe.
# Populate our table with our new columns.
uv_df$trim_title <- trim_title
uv_df$extra_text <- extra_text

# Clean up our intermediate data products.
rm(no_lead_space, bad_close_paren, raw_title, extra_text, trim_title)

We perform a very similar process for splitting our publisher_developer field. Fortunately, this field is slightly more consistent, allowing us to skip some of the pre-processing work.

# Similar to splitting raw_title, because there is a common pattern for
# developer in the string ("publisher (developer)"), we can pull developer out
# into it's own column. 

# We test for situations where our space-lead-paren pattern might be 
# problematic.

# Added text doesn't finish the line?
head(uv_df$publisher_developer[grepl("^.*\\).+$", uv_df$publisher_developer)])

# Parens without a leading space?
head(uv_df$publisher_developer[grepl("[^ ]\\(", uv_df$publisher_developer)])

# Space, semi-colon, or semi-colon after paren that is inside a paren set? 
head(uv_df$publisher_developer[grepl("\\)(?= |;|$)(?=.*\\))(?! \\()", 
                                     uv_df$publisher_developer, 
                                     perl = TRUE)])

# A unique concern here is that we might have cases where only a developer is 
# listed without a publisher. If catalogers follow convention, this will produce
# records that start with a lead paren (i.e., which will not be detected because
# no lead space).
head(uv_df$publisher_developer[grepl("^\\(", uv_df$publisher_developer)])

# Only our last test identified problematic records (those starting with a
# lead-paren because only a developer is cataloged).

# We make a copy of the field so that we can keep the original text for
# reference.
publisher_developer <- uv_df$publisher_developer

# We fix our developer-only records by adding a leading space.
publisher_developer <- sapply(publisher_developer, function(x) {
    str_replace(x, "^\\(", " (")
}, USE.NAMES = FALSE)

# Make a copy of our fixed up publisher_developer that we can iterate on to
# extract the developer.
developer <- publisher_developer

# We extract the extra text by pulling out all paren groups that are followed by
# a space, end-of-line, or semi-colon. Because some titles can have multiple 
# extra text groups (e.g., (extra) (extra)), elements in the resulting list may
# be larger than 1.
developer <- str_extract_all(developer, " \\(.*?\\)(?= |$|;)")

# Our extraction solution captures the leading space. We clean these out.
developer <- lapply(developer, str_trim)

# publisher_developer entries without any developer will produce an empty
# string. We make sure these are flagged properly as NA.
developer <- lapply(developer, function(x) {
    if(length(x) == 0) {
        return(NA)
    } else {
        return(x)
    }
})

# At this point, we have something useful for getting to our publishers: we 
# have a string (or strings) matching that which should be removed from the 
# publisher_developer string to leave just the publisher.
publisher <- lapply(1:length(developer), function(index) {

    pd_string <- publisher_developer[index]
    rd_list <- developer[[index]]

    for(rd_index in 1:length(rd_list)) {
        if(is.na(rd_list[rd_index])) {
            return(pd_string)
        } else {
        pd_string <- gsub(rd_list[rd_index], "", pd_string, fixed = TRUE)
        }
    }

    return(pd_string)
})

# We finish cleaning developer (we'll return to publisher). First, our 
# extraction solution captures the leading and trailing paren. We delete these 
# off to leave just the text.
developer <- lapply(developer, function(x) str_replace(x, "^.", ""))
developer <- lapply(developer, function(x) str_replace(x, ".$", ""))

# Next, some games have multiple extra text entries inside the same set of 
# parens but separated by semi-colons (e.g., (extra; extra)). We split these up.
developer <- lapply(developer, function(x) unlist(str_split(x, ";")))

# Clean up any leading/trailing spaces leftover from the splitting.
developer <- lapply(developer, str_trim)

# Finally, at these point we have a list of variability sized character vectors 
# but we want a single character vector. For each collection of extracted extra 
# text, we collapse the collection to a single string. Then we collapse the 
# entire list to a single character vector.
developer <- lapply(developer, function(x) paste(x, collapse = "----"))
developer <- unlist(developer)

# And back to publisher. Multiple publishers are possible, so we do the same
# split/clean/combine work.
publisher <- str_split(publisher, ";")
publisher <- lapply(publisher, str_trim)
publisher <- lapply(publisher, function(x) paste(x, collapse = "----"))
publisher <- unlist(publisher)

# We still need to clean these for NAs, but we'll do that later when we do our
# general NA cleaning. For now, we add our split columns to our dataframe.
# Populate our table with our new columns.
uv_df$publisher <- publisher
uv_df$developer <- developer

# Clean up our intermediate data products.
rm(publisher_developer, developer, publisher)

Identify and Clean NAs, Set Types

At this point, we have columns that align with our target dataframe. We start giving those columns a more thorough cleaning by standardizing how NAs are expressed and insuring that column types are appropriate.

## trim_title ##
# First we examine trim_title for NA patterns. Because there are so many
# unique values for trim_title, we can't examine all of them. 

# We simplify our problem by assuming that NA patterns are likely to be short.
# We find counts of unique values and then look at just the shorter entries.
table(uv_df$trim_title[nchar(uv_df$trim_title) < 4], useNA = "ifany")

# We also look for titles that start with unusual characters.
uv_df$trim_title[grepl("^[^a-zA-Z0-9]", uv_df$trim_title)]

# One of the unusual values we observe is "{TITLE}". We take a closer look at
# this entry.
uv_df[grepl("{TITLE}", uv_df$trim_title, fixed = TRUE), ]

# The "TITLE" entry looks pretty bogus. We look for any possible variations on
# "TITLE" that might suggest an entry with an unknown game.
uv_df$trim_title[grepl("[Tt][Ii][Tt][Ll][Ee]", uv_df$trim_title)]

# "Untitled RPG" also looks pretty sketchy.
uv_df[grepl("Untitled RPG", uv_df$trim_title), ]

# Based on the above inspections, our NA candidates are: 
na_candidates <- c("", "{TITLE}", "Untitled RPG")

# We replace any matching strings with NA.
uv_df$trim_title <- ifelse(uv_df$trim_title %in% na_candidates, 
                           NA, uv_df$trim_title)

# We don't really care about extra_text, but we do the courtesy of seeing if
# there are obvious NAs.
table(uv_df$extra_text[nchar(uv_df$extra_text) < 4])

# NA candidates are:
na_candidates <- c("", "?", "???", "----", "NA")

# Replace our obvious NAs.
uv_df$extra_text <- ifelse(uv_df$extra_text %in% na_candidates, 
                           NA, uv_df$extra_text)

## publisher ##
table(uv_df$publisher[nchar(uv_df$publisher) < 4])

# Candidate NAs: 
na_candidates <- c("", "-", "*", "?", "NA")

# Replace our obvious NAs.
uv_df$publisher <- ifelse(uv_df$publisher %in% na_candidates, 
                          NA, uv_df$publisher)

# Because multiple publishers are possible, it's possible that we might have
# entries with a mix of valid and NA publishers. We inspect publishers for these
# possible combinations.
uv_df$publisher[grepl("^-", uv_df$publisher)]
uv_df$publisher[grepl("\\*", uv_df$publisher)]
uv_df$publisher[grepl("\\?", uv_df$publisher)]
uv_df$publisher[grepl("NA", uv_df$publisher)]

# We make some spot corrections for observed cases (based on as needed
# research). Untouched names appear to be valid.
uv_df$publisher[grepl("Mandriva", uv_df$publisher)] <- "Mandriva"
uv_df$publisher[grepl("Jam Creation", uv_df$publisher)] <- "Ainos"
uv_df$publisher[grepl("\\?{2,}", uv_df$publisher)] <- NA
uv_df$publisher <- ifelse(uv_df$publisher %in% c("G.J?", "Triple Eh?"), 
               uv_df$publisher, gsub("\\?$", "", uv_df$publisher))

## developer ##
table(uv_df$developer[nchar(uv_df$developer) < 4])

# Candidate NAs: 
na_candidates <- c("-", "*", "?", "NA")
uv_df$developer <- ifelse(uv_df$developer %in% na_candidates, 
                          NA, uv_df$developer)

# Inspect for cases that may contain a mix of NA symbols.
uv_df$developer[grepl("^-", uv_df$developer)]
uv_df$developer[grepl("\\*", uv_df$developer)]
uv_df$developer[grepl("\\?", uv_df$developer)]
uv_df$developer[grepl("NA", uv_df$developer)]

# Correct observed cases (based on as needed research). Untouched names
# appear to be valid.
uv_df$developer <- ifelse(uv_df$developer %in% c("G.J?", "Triple Eh?"), 
                          uv_df$developer, gsub("\\?$", "", uv_df$developer))

## release_year ##
table(uv_df$release_year)

# This is a pretty straightforward column to clean - any fields that are not
# four numbers are problematic and we set to NA.
uv_df$release_year <- sapply(uv_df$release_year, function(x) {
    if(grepl("^[[:digit:]]{4}$", x)) {
        return(x)
    } else {
        return(NA)
    }
}, USE.NAMES = FALSE)

uv_df$release_year <- as.integer(uv_df$release_year)

## platform ##
table(uv_df$platform)

# The only obvious NA candidate is the empty character string "".
uv_df$platform <- ifelse(uv_df$platform == "", NA, uv_df$platform)

# We clean up our intermediate data products.
rm(na_candidates)

Inspecting for String Mispellings and Variations

We're getting close to a clean dataset. The next major step is dealing with the fact that our records are game-platform records. We want to collapse the table to game records and merge our platform column to allow multi-way entries in the same style as our publisher and developer fields.

Prior to this, however, we want to ask an important question: Can we identify any entries in our key fields (trim_title, developer, publisher, release_year) that are mispellings or which have unintended variability? Where mispellings or variability occur, they will incorrectly give the appearance of a unique game when really the record is aligned with (or a duplicate of) other game records.

Doing this type of cleaning completely requires some manual labor - at minimum we would need to do enough manual cleaning to train a model for a more automated solution. However, for the current project, we're going to automate what we can and - due to the sheer number of records - accept that we'll be left with some duplications.

We adopt a collection of strategies to to allow us to automate the removal of as many duplicates as possible.

  1. Creating a column version with standardized casing and common differences removed. Here common differences are those I observed during early passes for variations/mispellings in the data (e.g., "007: Nightfire" v. "007 - Nightfire").

  2. First just examine unique values and no the entire record. In other words, don't worry about record duplication yet - just word about similarities among all the unique values in a single field (e.g., resolve similar titles alone).

  3. Automate the pairing of values with potential matches based on a simple string similarity index. This allows us to identify smaller collections of term v. match terms that we can process with more precise rules.

  4. Standardize how matched candidates are evaluated. This allows us to specify up front what the criteria are for an automated "match" (i.e., a potential mispelling/variation) or "mismatch" (i.e., false positive based on similarity alone). If we ever choose to do the proper manual cleaning of values, this will also minimize the manual work required.

Standardize Casing and Known Variations

Based on experience with both the UVList data and titling in general, there are a few common sources of variation we can attempt to account for up front.

# Before we get started, we make new versions of the variables we'll be
# tinkering with.
uv_df$clean_title <- uv_df$trim_title
uv_df$clean_publisher <- uv_df$publisher
uv_df$clean_developer <- uv_df$developer
uv_df$clean_platform <- uv_df$platform

## Subtitles ##
# Only really an issue for titles, so we ignore other fields during this
# inspection.

# Our ideal solution here is to convert either all ": " to " - " or vice versa.
# When picking between the two, it seems cleaner to convert all to ": " (less
# ambiguous) but we're more likely to succeed at converting all to " - "
# (because we don't need to worry about identifying cases where " - " is not a
# subtitle). We aim for the less ambiguous approach.

# Key assumptions to this approach:
# * ": " is always used to indicate a subtitle.
# * " - " is always used to indicate a subtitle.
# * There are no incorrect uses of ":" or "-" when indicating subtitles.

# Secondary assumption:
# * ": " and " - " don't have unique uses in the same string.

# ": " used to indicate something other than a subtitle?
head(grep(": ", unique(uv_df$clean_title), value = TRUE))
# Result: None observed in the first few hundred entries.

# " - " used to indicate something other than a subtitle?
head(grep(" - ", unique(uv_df$clean_title), value = TRUE))
# Result: Very few in titles and the only ones that were definitely problematic
# to change were those in which a range of digits was being specified.

# " - " used to indicate range of digits.
length(grep("[[:digit:]] - [[:digit:]]", uv_df$clean_title, value = TRUE))
grep("[[:digit:]] - [[:digit:]]", uv_df$clean_title, value = TRUE)
# Result: We'll remove the spaces to disambiguate the problematic strings.

titles_to_fix <- c("Historyline 1914 - 1918", 
                   "Premier Manager 2004 - 2005",
                   "Premier Manager 2005 - 2006", 
                   "Premier Manager 2006 - 2007",
                   "UEFA Champions League 2004 - 2005")

uv_df$clean_title <- sapply(uv_df$clean_title, function(x) {
    if(x %in% titles_to_fix) {
        return(gsub(" - ", "-", x))
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

rm(titles_to_fix)

# "-" used incorrectly or unusually?
head(grep("[^ ]- | -[^ ]", unique(uv_df$clean_title), value = TRUE))
# Result: Yes for titles. These fall into three cases - situations where "-" is
# used to surround a word/phrase as a subtitle (e.g., "animals -puppy-"), where
# "-" is doubled up, and where "-" just used wrong. We handle each.

# Surround a word/phrase.
uv_df$clean_title <- sapply(uv_df$clean_title, function(x) {
    if(grepl("(^| )-[^ ].*-", x)) {
        x <- sub(" -", ": ", x)
        x <- gsub("(-(?= |$))|(^-)", "", x, perl = TRUE)
        return(x)
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

# Doubled up.
uv_df$clean_title <- gsub("--", " - ", uv_df$clean_title)

# Leading or trailing "-".
uv_df$clean_title <- gsub(" -(?=[^ ])", ": ", uv_df$clean_title, perl = TRUE)
uv_df$clean_title <- sapply(uv_df$clean_title, function(x) {
    if(grepl("[^ ]- ", x, perl = TRUE)) {
        return(gsub("- ", ": ", x))
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

# ":" used incorrectly or unusually?
grep(" :|[^ ]:[^ ]", unique(uv_df$clean_title), value = TRUE)
# Result: Yes for titles. The problematic strings fall into three groups:
# legitimate uses of ":" with digits, incorrect uses of ":" with characters, and
# extra leading space " : ".

uv_df$clean_title <- sapply(uv_df$clean_title, function(x) {
    if(grepl(paste0("([[:alnum:]]:[[:alpha:]])",
                    "|([[:alpha:]]:[[:alnum:]])",
                    "|( : )|(::)|( :[^ ])"), x)) {
        return(gsub("(:(?=[^ :]))|( : )|([ ]*::)|( :)", ": ", x, perl = TRUE))
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

# Converting " - " to ": ".
uv_df$clean_title <- gsub(" - ", ": ", uv_df$clean_title)

## Hyphenation ##
# This is also an issue we'll only really address in our title column. As noted
# above, there are simply too many titles for us to correct all possible sources
# of hyphenation. We simply scan for obvious variation issues and correct what
# we can find.
length(grep("[^ ]-[^ ]", uv_df$clean_title, value = TRUE))
head(grep("[^ ]-[^ ]", uv_df$clean_title, value = TRUE), 100)

# The most obvious problematic pattern is the shorthand for X-Dimensional (e.g.,
# 2-D v. 2D v 2 D). We standardize these to remove the hyphenation.
uv_df$clean_title <- sapply(uv_df$clean_title, function(x) {
    if(grepl("([[:digit:]]-[Dd]( |$))", x)) {
        return(gsub("-(?=([Dd] )|([Dd]$))", "", x, perl = TRUE))
    } else {
        return(x)
    }
}, USE.NAMES = FALSE)

## Common Shorthands ##
# We also try to standardize some other common patterns.
# * vs, vs., versus
uv_df$clean_title <- gsub("( vs )|( vs. )|( versus )", " vs ", uv_df$clean_title,
                          ignore.case = TRUE)

# * vol. , vol.
uv_df$clean_title <- gsub("( vol )|( vol\\.)", " Volume ", uv_df$clean_title, 
                          ignore.case = TRUE)

# * no.
uv_df$clean_title <- gsub("( no\\. )|( no\\.)|( no [[:digit:]](?=(:|$)))", 
                          " Number ", 
                          uv_df$clean_title, ignore.case = TRUE, perl = TRUE)

# * ep
uv_df$clean_title <- gsub("( ep\\.)|( ep [[:digit:]])", " Number ",
                          uv_df$clean_title, ignore.case = TRUE)

# * &
uv_df$clean_title <- gsub("[^[:punct:]]&[^[:punct:]]", " and ",
                          uv_df$clean_title)

## Spacing ##
# This is one of our more straightforward cleaning steps. We simply don't allow
# our strings to have more than a single space in them to remove accidental
# multi-space sequences.
uv_df$clean_title <- gsub("[ ]{2,}", " ", uv_df$clean_title)
uv_df$clean_publisher <- gsub("[ ]{2,}", " ", uv_df$clean_publisher)
uv_df$clean_developer <- gsub("[ ]{2,}", " ", uv_df$clean_developer)
uv_df$clean_platform <- gsub("[ ]{2,}", " ", uv_df$clean_platform)

# And we insure that we introduced no new leading/trailing spaces with all our
# fixing.
uv_df$clean_title <- str_trim(uv_df$clean_title)
uv_df$clean_publisher <- str_trim(uv_df$clean_publisher)
uv_df$clean_developer <- str_trim(uv_df$clean_developer)
uv_df$clean_platform <- str_trim(uv_df$clean_platform)

## Casing ##
# Finally, we make the decision to ignore casing for our observed values and we
# enforce standard casing behavior on all our target fields. This is a jump from
# the title cleaning we've been doing and it breaks potentially desirable
# formatting. We do this work in a new column with the understanding that we'll
# need to eventually need to choose a desired "natural" casing for each value.
uv_df$flat_title <- str_to_lower(uv_df$clean_title)
uv_df$flat_publisher <- str_to_lower(uv_df$clean_publisher)
uv_df$flat_developer <- str_to_lower(uv_df$clean_developer)
uv_df$flat_platform <- str_to_lower(uv_df$clean_platform)

## Roman and Written Numerals ##
# We convert all characters chunks likely to be Roman numerals. The Roman
# numeral conversion will result in some false positives (e.g., puppy and i
# becomes puppy and 1), but we accept these for the purpose of string matching.
# We restrict this work to titles as the choice of a number format is more
# likely to be intentional for publishers, developers, and platforms.

# Roman numeral conversion. Roman numerals almost always occur at the end of a
# title/subtitle, so we minimize false positives by only looking for matches 
# indicating the end of a major title portion.
numeral_range <- c(1:50)
for(number in numeral_range) {
    roman <- as.roman(number)
    roman_regex <- paste0("(?<=[^[:alnum:]]|^)", 
                          str_to_lower(roman), 
                          "(?=[^[:alnum:]]|$)")
    uv_df$flat_title <- gsub(roman_regex, as.character(number), 
                            uv_df$flat_title, 
                            perl = TRUE)
}

rm(number, numeral_range, roman, roman_regex)

# Written numeral conversion. We create a vector of common words and a matching
# vector of their numeric form.
written_numerals <- c("zero", "one", "two", "three", "four", "five", "six",
                      "seven", "eight", "nine", "ten", "eleven", "twelve",
                      "thirteen", "fourteen", "fifteen", "sixteen", "seventeen",
                      "eighteen", "nineteen", "twenty", "thirty", "forty", 
                      "fourty", "fifty", "sixty", "seventy", "eighty", "ninety")
numeral_range <- c(0:20, 30, 40, 40, 50, 60, 70, 80, 90)
for(index in 1:length(numeral_range)) {
    written_regex <- paste0("(?<=[^[:alnum:]]|^)", 
                            written_numerals[index], 
                            "(?=[^[:alnum:]]|$)")
    uv_df$flat_title <- gsub(written_regex, numeral_range[index], 
                            uv_df$flat_title, 
                            perl = TRUE)
}

rm(written_numerals, numeral_range, index, written_regex)

Identify Variation/Mispelling Candidates

NOTE: The id_variations_from_scratch global gets assessed in this section. If set to TRUE, this section will take up to several hours to identify candidate associations and then (in the next section) resolve the candidate associations.

# This section handles identifying candidates for variations and mispellings. It
# is recommended that you only trigger this section if you are certain it is 
# necessary and know what you are doing as it is time-consuming.
if(run_from_scratch) {
    # First we look for variations/mispellings in our titles. This is the most
    # time-consuming of the variation processing. We save some processing by
    # focusing only on unique values (removing obvious duplicates from
    # consideration).
    unique_titles <- unique(uv_df$flat_title)

    title_matches <- fuzzy_match_all(unique_titles, max_dist = 0.1,
                                     skip_pure_digit = TRUE, 
                                     min_test_length = 3,
                                     assume_unique = TRUE, 
                                     remove_matches = TRUE)

    # Now for publisher... This takes a little bit of prep, as publisher is a 
    # complex field. We don't really want to evaluate the combinations of 
    # publishers, but the unique publisher names.
    single_publishers <- unlist(str_split(uv_df$flat_publisher, "----"))
    unique_publishers <- unique(single_publishers)

    publisher_matches <- fuzzy_match_all(unique_publishers, max_dist = 0.1,
                                         skip_pure_digit = TRUE, 
                                         min_test_length = 2,
                                         assume_unique = TRUE, 
                                         remove_matches = TRUE)

    # And developer...
    single_developers <- unlist(str_split(uv_df$flat_developer, "----"))
    unique_developers <- unique(uv_df$flat_developer)

    developer_matches <- fuzzy_match_all(unique_developers, max_dist = 0.1,
                                         skip_pure_digit = TRUE, 
                                         min_test_length = 2,
                                         assume_unique = TRUE, 
                                         remove_matches = TRUE)

    # And platform...
    unique_platforms <- unique(uv_df$flat_platform)

    platform_matches <- fuzzy_match_all(unique_platforms, max_dist = 0.1,
                                        skip_pure_digit = TRUE, 
                                        min_test_length = 2,
                                        assume_unique = TRUE, 
                                        remove_matches = TRUE)

    # Remove intermediate data products.
    rm(single_publishers, single_developers,
       unique_titles, unique_publishers, unique_developers, unique_platforms)
}

Process Variation/Mispelling Candidates

We now have a set of lists that identify terms that appear to be reasonable candidates for having variations/mispellings along with the associated candidate variations/mispellings. We use a helper function to walk-through our target variable matches (trim_title, publisher, developer, platform) and apply some basic logic to auto-assess for match validity. See the definition for the resolve_match to observe the rules determing what counts as a match or not.

NOTE: The resolve_variations_from_scratch global gets assessed in this section. If set to TRUE, this section can take some time to complete.

# This section handles processing candidates for variations and mispellings. It
# is recommended that you only trigger this section if you are certain it is 
# necessary and know what you are doing as it is time-consuming.
if(run_from_scratch) {
    # We run our auto-resolver on each of the collections.
    auto_title_matches <- resolve_all_match_sets(title_matches)
    auto_publisher_matches <- resolve_all_match_sets(publisher_matches)
    auto_developer_matches <- resolve_all_match_sets(developer_matches)
    auto_platform_matches <- resolve_all_match_sets(platform_matches)
}

Clean Identified Variation/Mispelling Pairs

At this point we've produced some objects that could either be subjected to manual processing (to resolve uncertain candidate matches) or which can be used to guide standardization of variation/mispelling collections as-is. For this project, we simply use the current auto-matched results and acknowledge that we've likely missed a few matches.

It's worth noting that we've done our matching with a highly stripped down version of our values (certain punctuation and casing removed). We'll use the results of matching these values to replace our richer-formatting titles as well, but we'll retain the original raw versions to allow us to spot check for errors if the need arises.

if(run_from_scratch) {
    ## titles ##
    fix_titles <- auto_title_matches$match_set_list
    fix_titles <- fix_titles[auto_title_matches$auto_accept_index]

    clean_uv_df <- fix_all_match_sets(fix_titles, uv_df, 
                                      "clean_title", "flat_title")

    ## publishers ##
    fix_publishers <- auto_publisher_matches$match_set_list
    fix_publishers <- fix_publishers[auto_publisher_matches$auto_accept_index]

    clean_uv_df <- fix_all_match_sets(fix_publishers, clean_uv_df,
                                      "clean_publisher", "flat_publisher")

    ## developers ##
    fix_developers <- auto_developer_matches$match_set_list
    fix_developers <- fix_developers[auto_developer_matches$auto_accept_index]

    clean_uv_df <- fix_all_match_sets(fix_developers, clean_uv_df,
                                      "clean_developer", "flat_developer")

    ## platforms ##
    # NO AUTO MATCHES

    ## cleanup ##
    rm(fix_titles, fix_publishers, fix_developers)
}

Roll Up to One Record per Game

At last we're ready to head into the roll up. We want to produce a dataframe with just one record per game. However, there is a reasonable concern that some distinct games may share a title - either by happenstance or because we rolled them together. We'll mitigate this risk by initially rolling up by both title and year and investigating any duplicates that remain to check for games that shouldn't be rolled up together.

We'll allow our other fields to initially become more complex, concatenating them together. During this concatenation, we'll also sort the elements and remove duplicates.

While we roll-up, we also assess whether we can distinguishing re-releases from new games with the same name.

if(run_from_scratch) {
    # We collapse the dataframe by flat title and year.
    gy_df <- clean_uv_df %>%
        group_by(release_year, flat_title) %>%
        summarise(
            title = sort_elements(clean_title),
            publisher = sort_elements(clean_publisher),
            developer = sort_elements(clean_developer),
            platform = sort_elements(clean_platform),
            pub_test = sort_elements(flat_publisher, dedup = FALSE, 
                                     cf_split = "~~~~"),
            dev_test = sort_elements(flat_developer, dedup = FALSE,
                                     cf_split = "~~~~"),
            plat_test = sort_elements(flat_platform, dedup = FALSE,
                                      cf_split = "~~~~"),
            flat_publisher = sort_elements(flat_publisher),
            flat_developer = sort_elements(flat_developer),
            flat_platform = sort_elements(flat_platform),
            raw_title = sort_elements(raw_title)
        )

    # And another version by title alone.
    g_df <- clean_uv_df %>%
        group_by(flat_title) %>%
        summarise(
            title = sort_elements(clean_title),
            publisher = sort_elements(clean_publisher),
            developer = sort_elements(clean_developer),
            platform = sort_elements(clean_platform),
            min_year = min(release_year, na.rm = TRUE),
            max_year = max(release_year, na.rm = TRUE),
            release_year = sort_elements(release_year),
            pub_test = sort_elements(flat_publisher, dedup = FALSE, 
                                     cf_split = "~~~~"),
            dev_test = sort_elements(flat_developer, dedup = FALSE,
                                     cf_split = "~~~~"),
            plat_test = sort_elements(flat_platform, dedup = FALSE,
                                      cf_split = "~~~~"),
            ry_test = sort_elements(release_year, dedup = FALSE,
                                    cf_split = "~~~~"),
            flat_publisher = sort_elements(flat_publisher),
            flat_developer = sort_elements(flat_developer),
            flat_platform = sort_elements(flat_platform),
            raw_title = sort_elements(raw_title)
        ) %>%
        ungroup() %>%
        mutate(max_min_gap = max_year - min_year,
               njoins = nchar(gsub("[^-]", "", release_year)) / 4)

    # We inspect the list for instances where the roll-up seems like it may be
    # problematic: roll-ups that pulled together records with very different
    # release years.
    table(g_df$max_min_gap, useNA = "ifany")
}

We see that very few titles have duplicates beyond about a decade. External research also supports the idea that games sharing a name are relatively rare and tend to occur some time apart (e.g., http://www.giantbomb.com/same-name-different-game/3015-5521/). However, re-releases are quite common and often occur well after the game release (e.g., many of the games on gog.com).

Our best solution would be to manually identify games that share a name and flag them to avoid roll-up. Since that is not reasonable (for this version), we simply acknowledge that some mistaken roll-ups definetely happened and move on.

Pending the quality of our other game sources, this is likely not too terrible an issue. Assuming we don't need to roll-up on title for at least one other rich source, we should identify the later same-name games when we attempt to merge on name-title.

Finalize the Dataframe

At this point, we have all the pieces we want for our merge-ready dataframe. We pull out the key columns and save the result.

# Select our target columns and create the merge-ready dataframe.
uv_merge_df <- g_df %>%
    select(
        # Core features.
        title, publisher, developer, platform, first_release_year = min_year,
        # Merge friendly versions of character features.
        flat_title, flat_publisher, flat_developer, flat_platform,
        # Additional useful UVList features.
        raw_title, all_release_year = release_year)

# Add a prefix so that columns are not lost accidentally during the merge.
names(uv_merge_df) <- paste0("uv_", names(uv_merge_df))

# If run from scratch, we save the key intermediate and final data products.
if(run_from_scratch) {
    save(title_matches, publisher_matches, developer_matches,
         platform_matches, auto_title_matches, auto_publisher_matches,
         auto_developer_matches, auto_platform_matches,
         g_df, gy_df, clean_uv_df,
         file = "./uvlist_intermediate_products.Rds")

    save(uv_merge_df, file = "./uv_merge_df.Rds")
}


datavores/vgsample documentation built on May 14, 2019, 8:59 p.m.