classify_etea: classifier-leverage function to classify text into groups of...
In chriskirkhub/etea: This package enables the classification of unstructured textual data into a structured, segmented, temporal, document frequency matrix for use as input into predictive modelling systems such as neural networks or state-space models.

Usage Arguments Value Author(s) Examples

classify_etea(textColumn, stateNum = NULL, timeNum = NULL,
  docvaragg = "null", use_stopwords = TRUE,
  stopwords_language = "english", add_stopwords = NULL,
  remove_stopwords = NULL, verbose = TRUE, toLower = FALSE,
  stem = FALSE, keptFeatures = NULL, removeFeatures = TRUE,
  language = "english", valuetype = c("glob"), thesaurus = NULL,
  dictionary = NULL, removeNumbers = TRUE, removePunct = TRUE,
  removeSeparators = TRUE, removeHyphens = TRUE, removeTwitter = TRUE,
  ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
  useSentences = TRUE, convert_to_tm = TRUE, pstrong = 1, pweak = 0.5,
  termNum = 1, ...)

`textColumn`	character vector containing the text to be analysed; mandatory
`stateNum`	numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL
`timeNum`	numeric vector containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL
`docvaragg`	specifies how the aggregation on docvars is to occur either 'state' stateID only, 'time' timestamp only, 'statetime' state and timestamp or timestamp and state ; default NULL; Options state, time, statetime, timestate
`use_stopwords`	specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE
`add_stopwords`	a character vector of words to be added to the stopwords vector (if any); default is NULL.
`remove_stopwords`	a character vector of words to be removed to the stopwords vector (if any); default is NULL.
`verbose`	to see useful progress information; default is TRUE
`toLower`	to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: `tokenize`
`stem`	reduce word length to root; default is FALSE; see: `tokenize`
`removeFeatures`	remove particular features from inbound text as specified in a list; default is TRUE; see: `quanteda`
`language`	to define local language; default is "english" see: `quanteda`
`valuetype`	to define patterning; default is `glob`; see: `quanteda`
`removeNumbers`	remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: `quanteda`
`removePunct`	remove punctuation from inbound text; default is TRUE; see: `quanteda`
`removeSeparators`	remove separators from inbound text; default is TRUE; see: `quanteda`
`removeHyphens`	remove hyphen characters from inbound text; default is TRUE; see: `quanteda`
`removeTwitter`	remove twitter api characters from inbound text; default is TRUE; see: `quanteda`
`ngrams`	integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: `ngrams`
`skip`	integer vector specifying the adjacency skip size for tokens forming the ngrams; `0`: see: `ngrams`
`concatenator`	character for combining words, default is `_`; see: `ngrams`
`simplify`	character vector of tokens rather than a length of texts; default is FALSE; see: `tokenize`
`convert_to_tm`	logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format
`termNum`	integer specifying the minimum frequency a word is to have been found in the matrix
`...`	Extra arguments, not used
`choice`	of language to determine the content of the basic stopword list; default `english`. See `quanteda` for further information.
`prior`	integer specifying the prior bayesian weighting value

a scored, classified matrix of document/note words as categories to provide input into other analytical systems.

Chris Kirk

# create a scored, classified matrix of CaseNotes and aggregated by state for use in further modelling
## LOAD ##
text_df <- read.csv("data/jtr_docs.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes)
## CLASSIFY ##
classify_eta(textColumn, stateNum=c(1,1,2,3,3), verbose=TRUE, use_stopwords=TRUE, docvaragg="state")

# create dfm using character vector of CaseNotes, states and datetimestamps for use as a time series for a neural network or MARSS
## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
statecol<-as.numeric(text_df$stateNum) # typically identication of parts of journey/episode
timecol<-as.character(text_df$timeNum) # typically days since start of journey/episode
## CLASSIFY ##
etea_df_time <- classify_etea(textColumn, statecol, timecol, verbose=TRUE, use_stopwords=TRUE, docvaragg="time")
# for MARSS
## CONVERT FOR MARSS ##
etea_matrix <- data.matrix(etea_df_time) # MARSS requires standard data matrix Note timeNum as rownames
dat = t(etea_matrix) # transpose to MARSS form
colnames(dat) <- rownames(etea_matrix) # set column names to timeNum from docvars (rownames)
## dat is now available as MARSS DATA ##