knitr::opts_chunk$set(echo = TRUE)

devtools::load_all(".")
library(data.table)

rs = data.table(RS.data)

Create a test code

  code = create.code(excerpts = as.character(rs$text), expressions = c("data","number","payload")) 
  code = autocode(code, simplify=F)
  code$testSet = as.matrix(data.frame(
    ID = c(3476,1679,342,1719,651,359,179,784,728,3364),
    X1 = c(0,0,1,0,0,0,0,0,0,1)
  ))

Create Data.Table of All Words

Convert the excerpts attached to the code to a data.table, then split each sentence into cleaned words.

col = c("text")
exDT = data.table(text = code$excerpts)
excerptWords = exDT[, {
  wds = strsplit(as.character(.SD[[col]]), " ")[[1]]
  wds = tolower(gsub('[[:punct:]]| ', '', wds))
  wds = wds[grep(x=wds, pattern="^$", invert=T)]
  wds
}, by=1:nrow(exDT), .SDcols = col]

head(excerptWords, 10)

Create Word Frequency Table

Create a data.table that tracks the frequency of all words in the corpus, keeping reference to the corresponding documents

wordFrequency = excerptWords[, list(freq=.N, docs=list(.SD$nrow), seen=F), by=V1, .SDcols=c("nrow", "V1")]
setorder(wordFrequency, -freq)

head(wordFrequency, 10)

Find Unseen Words

The excerpts attached to a code are stored as a single character vector, so we first create a range (excerptRange) to represent each excerpt ID. This range will be used to to pull out the indices that haven't been seen (unseenWords).

excerptRange = 1:length(code$excerpts)
unseens = (excerptRange)[-code$testSet[,1]]
unseenWords = unique(excerptWords[(nrow %in% unseens)]$V1)

Words In Matched Excerpts

Next, we find all of the words that are included in the excerpts that have already been coded as yes in the TestSet: excerptsCodedYes. Using those indices, we can pull all of the words contained in those excerpts from the document-word table created above, yesWords

excerptsCodedYes = code$testSet[which(code$testSet[,2] == 1),1]
yesWords = unique(excerptWords[(nrow %in% excerptsCodedYes)]$V1)

Combine Word Lists

Combining the two sets of words, unseenWords and yesWords, to get our set of words to search through.

newWords = unique(c(yesWords, unseenWords))

Refine New Words

Now that we have our set of words, we need to do two things to refine our list of newWords:

  1. Find words in the Test Set that have been coded as no
  2. Find words in the Training Set that have been coded as no
  3. Find words that already match our classifier (code$expressions)

1. Removing Words In Unmatched TestSet Excerpts

Using the same strategy as with the yes words, except the testSet is search for excerpts coded as 0), we get the set of excerpt IDs that were coded no: nos. Again, that vector of excerpt IDs is used to subset the document-word table, resulting in the unique set of all words that have already been seen and coded as no: noWords

excerptsCodedNo = code$testSet[which(code$testSet[,2] == 0),1]
noWords = unique(excerptWords[(nrow %in% excerptsCodedNo)]$V1)

newWordsFiltered = newWords[!(newWords %in% noWords)]
length(newWordsFiltered)

2. Removing Words In Unmatched TrainingSet Excerpts

TBD

# TBD

3. Removing Words In Classifier

Now, to remove the words matching the classifier,

classifierWords = which(expression.match(newWordsFiltered, code$expressions) == 1)
cleanedWords = newWordsFiltered[-classifierWords]
length(cleanedWords)

Highest Occurring New Words

Now filter the word frequency document from above (wordFrequency) to only rows with the word appearing in the set of newWords, and return the top 20, resulting in the top unseen words.

topUnseen = wordFrequency[V1 %in% cleanedWords,.SD[1:20]]

topUnseen

Most Frequent Documents

Each row from the frequency table contains the set of documents that the word occurs in. Using the docs column, find the two most frequent document across the top 20 words

freqUnseen = sort(table(unlist(topUnseen$docs)), decreasing = T)[1:2]
freqInds = as.numeric(names(freqUnseen))

freqInds
code$excerpts[freqInds]


epistemic-analytics/ncodeR documentation built on June 15, 2019, 12:03 a.m.