knitr::opts_chunk$set(echo = FALSE, message = FALSE)
We are occassionally faced with a data set variable that contains multiple pieces of information. Consider the variable charges in the following dataset test_data.
test_data <- data.frame(id=1:4, charges = c("Murder,Kidnapping,Burglary", "burglary: assault and battery", "murder", "")) knitr::kable(test_data, cpation="test_data")
Multiple pieces of information are stored in this variable. In this case these pieces are delineated by a comma or colon. As is, the variable is not useful.
The untangle function can be used to create a set of dummy codes from this variable that will be more useful. This results in a new dataset with dummy codes:
untangle <- function(data, x, pattern, verbose=FALSE){ require(stringr) variable <- str_to_lower(data[[x]]) # obtain list of unique codes code_matrix <- str_split(variable, pattern, simplify=TRUE) code_vector <- as.character(code_matrix) code_vector <- str_trim(code_vector) code_vector <- unique(code_vector[code_vector != ""]) if(verbose) cat("[Unique Codes] ", code_vector, "\n", sep="\n") # create dummy codes matrix nobs <- nrow(data) ncodes <- length(code_vector) dummy_codes <- matrix(rep(NA, times=nobs*ncodes), ncol=ncodes) # add 1/0 codes for(i in 1:ncodes){ dummy_codes[,i] <- ifelse(str_detect(variable, code_vector[i]), 1, 0) } # add dummy code names dummy_codes <- as.data.frame(dummy_codes) codenames <- str_replace_all(code_vector, " ", "_") names(dummy_codes) <- codenames # add to data frame newdata <- cbind(data, dummy_codes) }
test_data <- untangle(data=test_data, x="charges", pattern="[,:]", verbose=TRUE)
knitr::kable(test_data, caption="New test_data")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.