classify_etea: classifier-leverage function to classify text into groups of...

Usage Arguments Value Author(s) Examples

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
classify_etea(textColumn, stateNum = NULL, timeNum = NULL,
  docvaragg = "null", use_stopwords = TRUE,
  stopwords_language = "english", add_stopwords = NULL,
  remove_stopwords = NULL, verbose = TRUE, toLower = FALSE,
  stem = FALSE, keptFeatures = NULL, removeFeatures = TRUE,
  language = "english", valuetype = c("glob"), thesaurus = NULL,
  dictionary = NULL, removeNumbers = TRUE, removePunct = TRUE,
  removeSeparators = TRUE, removeHyphens = TRUE, removeTwitter = TRUE,
  ngrams = 1L, skip = 0L, concatenator = "_", simplify = FALSE,
  useSentences = TRUE, convert_to_tm = TRUE, pstrong = 1, pweak = 0.5,
  termNum = 1, ...)

Arguments

textColumn

character vector containing the text to be analysed; mandatory

stateNum

numeric vector containing identifiers for the condition or state when the document or note was recorded/written that it be correctly allocated in the event of more than one note or record being in a state; default NULL

timeNum

numeric vector containing an index that identifies when the document was recorded/noted to give a temporal record. This normalises progress and case note recording as a progress through a system. Typically days or minutes after the system commenced; default NULL

docvaragg

specifies how the aggregation on docvars is to occur either 'state' stateID only, 'time' timestamp only, 'statetime' state and timestamp or timestamp and state ; default NULL; Options state, time, statetime, timestate

use_stopwords

specifies whether stopwords are to be removed from the corpus (TRUE) or not removed, (FALSE). Users are reminded that system (language-specific) stopwords may need additions or removals to tailor for a specific need; default TRUE

add_stopwords

a character vector of words to be added to the stopwords vector (if any); default is NULL.

remove_stopwords

a character vector of words to be removed to the stopwords vector (if any); default is NULL.

verbose

to see useful progress information; default is TRUE

toLower

to convert all inbound text into lower case. Notably this will degrade the sentence splitting function if applied; default is FALSE; see: tokenize

stem

reduce word length to root; default is FALSE; see: tokenize

removeFeatures

remove particular features from inbound text as specified in a list; default is TRUE; see: quanteda

language

to define local language; default is "english" see: quanteda

valuetype

to define patterning; default is glob; see: quanteda

removeNumbers

remove individual numbers from inbound text, (note: numbers already aggregated with characters such as 1st or 2nd are unaffected); default is TRUE; see: quanteda

removePunct

remove punctuation from inbound text; default is TRUE; see: quanteda

removeSeparators

remove separators from inbound text; default is TRUE; see: quanteda

removeHyphens

remove hyphen characters from inbound text; default is TRUE; see: quanteda

removeTwitter

remove twitter api characters from inbound text; default is TRUE; see: quanteda

ngrams

integer vector specifying the number of elements to be concatenated in each ngram; default is 1L; see: ngrams

skip

integer vector specifying the adjacency skip size for tokens forming the ngrams; 0: see: ngrams

concatenator

character for combining words, default is _; see: ngrams

simplify

character vector of tokens rather than a length of texts; default is FALSE; see: tokenize

convert_to_tm

logical specifying the requirement for the matrix to be returned in the tm TRUE or quanteda FALSE format

termNum

integer specifying the minimum frequency a word is to have been found in the matrix

...

Extra arguments, not used

choice

of language to determine the content of the basic stopword list; default english. See quanteda for further information.

prior

integer specifying the prior bayesian weighting value

Value

a scored, classified matrix of document/note words as categories to provide input into other analytical systems.

Author(s)

Chris Kirk

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# create a scored, classified matrix of CaseNotes and aggregated by state for use in further modelling
## LOAD ##
text_df <- read.csv("data/jtr_docs.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes)
## CLASSIFY ##
classify_eta(textColumn, stateNum=c(1,1,2,3,3), verbose=TRUE, use_stopwords=TRUE, docvaragg="state")

# create dfm using character vector of CaseNotes, states and datetimestamps for use as a time series for a neural network or MARSS
## LOAD ##
text_df <- read.csv("data/militant_suffragette_extract.csv",header=TRUE, sep=";")
textColumn<-as.character(text_df$Notes) # typically textual interview or clinical notes
statecol<-as.numeric(text_df$stateNum) # typically identication of parts of journey/episode
timecol<-as.character(text_df$timeNum) # typically days since start of journey/episode
## CLASSIFY ##
etea_df_time <- classify_etea(textColumn, statecol, timecol, verbose=TRUE, use_stopwords=TRUE, docvaragg="time")
# for MARSS
## CONVERT FOR MARSS ##
etea_matrix <- data.matrix(etea_df_time) # MARSS requires standard data matrix Note timeNum as rownames
dat = t(etea_matrix) # transpose to MARSS form
colnames(dat) <- rownames(etea_matrix) # set column names to timeNum from docvars (rownames)
## dat is now available as MARSS DATA ##

chriskirkhub/etea documentation built on May 13, 2019, 6:55 p.m.