create_q_matrix: Elaborate a Document Frequency Matrix in either quanteda or...

Description Usage Arguments Value Author(s) References Examples

Description

In text investigations it is considered important to consider the use of skip-grams to increase coverage without unduly increasing training size. In multi-state and decision modelling generally, it is considered important to aggregate documents by state IDs and to return a time series. This includes use in packages such as neural networks, decision models and in ensemble packages such as RTextTools. This function is a wrapper for the quanteda package that elaborates a Document Frequency Matrix and returns it in a tm package format for use by other functions. The function receives arguments that leverage functions within the quanteda package namely docvars, ngrams and skipgrams. This function is independent but also acts as a feeder to classify_etea and is a wrapper developed with permission, of functions from originals by Ken Benoit and Paul Nulty et al quanteda.

Usage

1
2
3
4
5
6
7
8
9
create_q_matrix(textColumn, stateNum = NULL, timeNum = NULL,
  docvaragg = c("null", "state", "time", "statetime", "timestate"),
  use_stopwords = TRUE, stopwords_language = "english",
  add_stopwords = NULL, remove_stopwords = NULL, verbose = TRUE,
  toLower = FALSE, stem = FALSE, removeFeatures = TRUE,
  language = "english", valuetype = c("glob"), removeNumbers = TRUE,
  removePunct = TRUE, removeSeparators = TRUE, removeHyphens = TRUE,
  removeTwitter = FALSE, ngrams = 1L, skip = 0L, concatenator = "_",
  simplify = FALSE, useSentences = FALSE, convert_to_tm = TRUE, ...)

Arguments

textColumn

character vector containing the text to be analysed; mandatory

stateNum

numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL

timeNum

numeric verctor containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL

docvaragg

specifies how the aggregation on docvars is to occur either s stateID only, t timestamp only, st state and timestamp or timestamp and state ; default NULL; Options s, t, st, ts

use_stopwords

specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE

add_stopwords

a character vector of words to be added to the stopwords vector (if any); default is NULL.

remove_stopwords

a character vector of words to be removed to the stopwords vector (if any); default is NULL.

verbose

to see useful progress information; default is TRUE

toLower

to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: tokenize

stem

reduce word length to root; default is FALSE; see: tokenize

removeFeatures

remove particular features from inbound text as specified in a list; default is TRUE; see: quanteda

language

to define local language; default is "english" see: quanteda

valuetype

to define patterning; default is glob; see: quanteda

removeNumbers

remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: quanteda

removePunct

remove punctuation from inbound text; default is TRUE; see: quanteda

removeSeparators

remove separators from inbound text; default is TRUE; see: quanteda

removeHyphens

remove hyphen characters from inbound text; default is TRUE; see: quanteda

removeTwitter

remove twitter api characters from inbound text; default is TRUE; see: quanteda

ngrams

integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: ngrams

skip

integer vector specifying the adjacency skip size for tokens forming the ngrams; 0: see: ngrams

concatenator

character for combining words, default is _; see: ngrams

simplify

character vector of tokens rather than a length of texts; default is FALSE; see: tokenize

convert_to_tm

logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format

...

Extra arguments, not used

choice

of language to determine the content of the basic stopword list; default english. See quanteda for further information.

termNum

integer specifying the minimum frequency a word is to have been found in the matrix

Value

optionally a quanteda-type Document Feature Matrix or a tm-type Document Term Matrix object containing word frequencies with (optionally) a time-series index and a state ID identifier as a data frame.

Author(s)

Chris Kirk

References

Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# create dfm using character vector of CaseNotes and aggregated by state for use in a classifier
## LOAD ##
text_df <- read.csv("data/jtr_docs.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes)
## CREATE MATRIX ##
create_q_matrix(textColumn, stateNum=c(1,1,2,3,3), verbose=TRUE, use_stopwords=TRUE, docvaragg="s")

# create dfm using character vector of CaseNotes, states and datetimestamps for use as a time series for nnet or MARSS
## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
stateNum<-as.numeric(text_df$stateID) # typically identication of parts of journey/episode
timeNum<-as.character(text_df$datetimestamp) # typically days since start of journey/episode
## CREATE MATRIX ##
q_tm_dfm <- create_q_matrix(textColumn, stateNum, timeNum, verbose=TRUE, use_stopwords=TRUE, docvaragg="t")
# for MARSS
q_matrix <- data.matrix(q_tm_dfm) # MARSS requires standard data matrix Note timeNum as rownames
## CONVERT FOR MARSS ##
dat = t(q_matrix) # transpose to MARSS form
colnames(dat) <- rownames(q_matrix) # set column names to timeNum from docvars (rownames)
## dat is now available as MARSS DATA ##

chriskirkhub/etea documentation built on May 13, 2019, 6:55 p.m.