knitr::opts_chunk$set(echo = TRUE) devtools::load_all(".") library(data.table) rs = data.table(RS.data)
code = create.code(excerpts = as.character(rs$text), expressions = c("data","number","payload")) code = autocode(code, simplify=F) code$testSet = as.matrix(data.frame( ID = c(3476,1679,342,1719,651,359,179,784,728,3364), X1 = c(0,0,1,0,0,0,0,0,0,1) ))
Convert the excerpts attached to the code to a data.table, then split each sentence into cleaned words.
col = c("text") exDT = data.table(text = code$excerpts) excerptWords = exDT[, { wds = strsplit(as.character(.SD[[col]]), " ")[[1]] wds = tolower(gsub('[[:punct:]]| ', '', wds)) wds = wds[grep(x=wds, pattern="^$", invert=T)] wds }, by=1:nrow(exDT), .SDcols = col] head(excerptWords, 10)
Create a data.table that tracks the frequency of all words in the corpus, keeping reference to the corresponding documents
wordFrequency = excerptWords[, list(freq=.N, docs=list(.SD$nrow), seen=F), by=V1, .SDcols=c("nrow", "V1")] setorder(wordFrequency, -freq) head(wordFrequency, 10)
The excerpts attached to a code are stored as a single character vector, so we first create a range (excerptRange
) to represent each excerpt ID. This range will be used to to pull out the indices that haven't been seen (unseenWords
).
excerptRange = 1:length(code$excerpts) unseens = (excerptRange)[-code$testSet[,1]] unseenWords = unique(excerptWords[(nrow %in% unseens)]$V1)
Next, we find all of the words that are included in the excerpts that have already been coded as yes in the TestSet: excerptsCodedYes
. Using those indices, we can pull all of the words contained in those excerpts from the document-word table created above, yesWords
excerptsCodedYes = code$testSet[which(code$testSet[,2] == 1),1] yesWords = unique(excerptWords[(nrow %in% excerptsCodedYes)]$V1)
Combining the two sets of words, unseenWords
and yesWords
, to get our set of words to search through.
newWords = unique(c(yesWords, unseenWords))
Now that we have our set of words, we need to do two things to refine our list of newWords
:
no
no
code$expressions
)Using the same strategy as with the yes words, except the testSet is search for excerpts coded as 0
), we get the set of excerpt IDs that were coded no
: nos
. Again, that vector of excerpt IDs is used to subset the document-word table, resulting in the unique set of all words that have already been seen and coded as no
: noWords
excerptsCodedNo = code$testSet[which(code$testSet[,2] == 0),1] noWords = unique(excerptWords[(nrow %in% excerptsCodedNo)]$V1) newWordsFiltered = newWords[!(newWords %in% noWords)] length(newWordsFiltered)
TBD
# TBD
Now, to remove the words matching the classifier,
classifierWords = which(expression.match(newWordsFiltered, code$expressions) == 1) cleanedWords = newWordsFiltered[-classifierWords] length(cleanedWords)
Now filter the word frequency document from above (wordFrequency
) to only rows with the word appearing in the set of newWords
, and return the top 20
, resulting in the top unseen words.
topUnseen = wordFrequency[V1 %in% cleanedWords,.SD[1:20]] topUnseen
Each row from the frequency table contains the set of documents that the word occurs in. Using the docs
column, find the two most frequent document across the top 20 words
freqUnseen = sort(table(unlist(topUnseen$docs)), decreasing = T)[1:2] freqInds = as.numeric(names(freqUnseen)) freqInds code$excerpts[freqInds]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.