createTritonCovariateSettings: createTritonCovariateSettings

Description Usage Arguments Value

View source: R/CreateTritonSettings.R

Description

Create a covariateSettings object for constructing text representation (Triton) covariates from the notes table in the OMOP CDM. Possible representations: text statistics(TextStats), and Bag-of-Words(BoW)(binary,frequency,TFIDF) and Topic Models(TopicModel), and averaged embeddings(DocEmb) using trained models.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
createTritonCovariateSettings(
  useNoteData = TRUE,
  startDay = -30,
  endDay = 0,
  idrange = NULL,
  parallel = FALSE,
  analysisId = 999,
  note_databaseschema = NULL,
  note_tablename = "note",
  note_customWhere = "",
  pipe_preprocess_function = NULL,
  pipe_tokenizer_function = "word",
  pipe_ngrams = 1,
  pipe_saveVocab = FALSE,
  pipe_outputFolder = NULL,
  filter_stopwords = NULL,
  filter_custom_regex = NULL,
  filter_vocab_term_max = NULL,
  filter_term_count_min = NULL,
  filter_term_count_max = NULL,
  filter_doc_count_min = NULL,
  filter_doc_count_max = NULL,
  filter_doc_proportion_max = NULL,
  filter_doc_proportion_min = NULL,
  representations = c("TextStats"),
  BoW_type = c("binary"),
  BoW_validationVarImpTable = NULL,
  DocEmb_word_embeddings = NULL,
  TopicModel_type = c("lsa"),
  TopicModel_model = NULL,
  covariateDataSave = "",
  covariateDataLoad = ""
)

Arguments

startDay

integer; start day before the index date for with the text representations have to be computed. Default is -30.

endDay

integer; end day before the index date for with the text representations have to be computed. Default is 0.

idrange

(optional) integer vector; specifying the range of integers that can be used to generate the covariateids, max is 2147482. Default is c(1,2147482).

parallel

logical; to indicate whether multi-threading should be used (Not on Windows). Default is False.

note_databaseschema

character; database schema other than the one passed through FeatureExtraction. Default is NULL.

note_tablename

character; note table name, provide if different than OMOP cdm default. Default is "note".

note_customWhere

(optional) character; with a SQL where statement to filter the note import. Example "WHERE note_source_value='communication'". Default is "".

pipe_preprocess_function

function; to preprocess the stings before tokenization. Default is tolower.

pipe_tokenizer_function

character or function; to tokenize the strings. Default is quanteda tokenizer (tokens), with argument "word". Other possible arguments are "fasterword", "fastestword", "sentence", and "character". It is possible to provide a custom tokenizer function. This function should take the document strings as input and should return a list of character vectors (tokens).

pipe_ngrams

integer vector; specifying the number of elements to be concatenated in each ngram. For example: c(1,2) creates all unigrams and bigrams; c(1:3) creats all unigrams, bigrams, and trigrams. Default is 1: no ngrams (unigram).

pipe_saveVocab

logical; option to save the generated vocabulary as rds file in the outputFolder. Default is False.

pipe_outputFolder

(optional) character; file path and name for saving output files. Default is NULL.

filter_stopwords

character vector; of list of stopwords that will be removed. Default is NULL See stopwords for generating stopwords.

filter_custom_regex

(optional) character; regular expression (regex) that selects tokens that will be removed. Default is NULL.

filter_vocab_term_max

integer; maximum number of terms in vocabulary, takes top most frequent terms. Default is NULL.

filter_term_count_min

integer; minimum number of occurences over all documents. Default is NULL.

filter_term_count_max

integer; maximum number of occurences over all documents. Default is NULL.

filter_doc_count_min

integer; term will be kept when number of documents that contain this term is larger than this value. Default is NULL.

filter_doc_count_max

integer; term will be kept when number of documents that contain this term is lower than this value. Default is NULL.

filter_doc_proportion_max

numeric; maximum proportion (0.-1.) of documents which should contain term. Default is NULL.

filter_doc_proportion_min

numeric; minimum proportion (0.-1.) of documents which should contain term. Default is NULL.

representations

character vector; of text representations that should be constructed, chose from "TextStats"(default), "BoW", "TopicModel", and "DocEmb". Multiple representations can be constructed at once: c("BoW","TextStats").

BoW_type

character vector; of BoW types to be constructed, chose from "binary"(default), "frequency", and "tfidf". Multiple types can be constructed at once: c("binary","frequency").

BoW_validationVarImpTable

(optional) data.frame; used for validation of a model with bag-of-word covariates. A varImp data.frame with the covariate names and covariate values of a trained model. The varImp data.frame can be found in plpResult$model$varImp or plpModel$varImp.

DocEmb_word_embeddings

(optional) character; of a data.frame loaded in the R environment that contains the word embeddings. First column must contain the word, the other n-1 columns contain the embedding values.

TopicModel_type

character vector; todo.

TopicModel_model

character; name of a topic model object loaded in the R environment.

covariateDataSave

(optional) character; location and file name of where the created covariateData must be stored.

covariateDataLoad

(optional) character; location and file name of where the created covariateData must be loaded from. Anything else is ignored, just the covariateData is loaded and returned.

useTextData

logical; option to disable the creation of text representation covariates. Default is True.

Value

covariateSettings object, that can be used by the OHDSI FeatureExtraction package.


mi-erasmusmc/Triton documentation built on Feb. 15, 2022, 10:37 a.m.