Description Usage Arguments Value
View source: R/CreateTritonSettings.R
Create a covariateSettings object for constructing text representation (Triton) covariates from the notes table in the OMOP CDM.
Possible representations: text statistics(TextStats
), and Bag-of-Words(BoW
)(binary
,frequency
,TFIDF
) and Topic Models(TopicModel
), and averaged embeddings(DocEmb
) using trained models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | createTritonCovariateSettings(
useNoteData = TRUE,
startDay = -30,
endDay = 0,
idrange = NULL,
parallel = FALSE,
analysisId = 999,
note_databaseschema = NULL,
note_tablename = "note",
note_customWhere = "",
pipe_preprocess_function = NULL,
pipe_tokenizer_function = "word",
pipe_ngrams = 1,
pipe_saveVocab = FALSE,
pipe_outputFolder = NULL,
filter_stopwords = NULL,
filter_custom_regex = NULL,
filter_vocab_term_max = NULL,
filter_term_count_min = NULL,
filter_term_count_max = NULL,
filter_doc_count_min = NULL,
filter_doc_count_max = NULL,
filter_doc_proportion_max = NULL,
filter_doc_proportion_min = NULL,
representations = c("TextStats"),
BoW_type = c("binary"),
BoW_validationVarImpTable = NULL,
DocEmb_word_embeddings = NULL,
TopicModel_type = c("lsa"),
TopicModel_model = NULL,
covariateDataSave = "",
covariateDataLoad = ""
)
|
startDay |
integer; start day before the index date for with the text representations have to be computed. Default is |
endDay |
integer; end day before the index date for with the text representations have to be computed. Default is |
idrange |
(optional) integer vector; specifying the range of integers that can be used to generate the covariateids, max is 2147482. Default is |
parallel |
logical; to indicate whether multi-threading should be used (Not on Windows). Default is |
note_databaseschema |
character; database schema other than the one passed through FeatureExtraction. Default is |
note_tablename |
character; note table name, provide if different than OMOP cdm default. Default is "note". |
note_customWhere |
(optional) character; with a SQL where statement to filter the note import. Example "WHERE note_source_value='communication'". Default is |
pipe_preprocess_function |
function; to preprocess the stings before tokenization. Default is |
pipe_tokenizer_function |
character or function; to tokenize the strings. Default is quanteda tokenizer ( |
pipe_ngrams |
integer vector; specifying the number of elements to be concatenated in each ngram. For example: |
pipe_saveVocab |
logical; option to save the generated vocabulary as rds file in the outputFolder. Default is |
pipe_outputFolder |
(optional) character; file path and name for saving output files. Default is |
filter_stopwords |
character vector; of list of stopwords that will be removed. Default is |
filter_custom_regex |
(optional) character; regular expression (regex) that selects tokens that will be removed. Default is |
filter_vocab_term_max |
integer; maximum number of terms in vocabulary, takes top most frequent terms. Default is |
filter_term_count_min |
integer; minimum number of occurences over all documents. Default is |
filter_term_count_max |
integer; maximum number of occurences over all documents. Default is |
filter_doc_count_min |
integer; term will be kept when number of documents that contain this term is larger than this value. Default is |
filter_doc_count_max |
integer; term will be kept when number of documents that contain this term is lower than this value. Default is |
filter_doc_proportion_max |
numeric; maximum proportion (0.-1.) of documents which should contain term. Default is |
filter_doc_proportion_min |
numeric; minimum proportion (0.-1.) of documents which should contain term. Default is |
representations |
character vector; of text representations that should be constructed, chose from |
BoW_type |
character vector; of BoW types to be constructed, chose from |
BoW_validationVarImpTable |
(optional) data.frame; used for validation of a model with bag-of-word covariates. A varImp data.frame with the covariate names and covariate values of a trained model. The varImp data.frame can be found in plpResult$model$varImp or plpModel$varImp. |
DocEmb_word_embeddings |
(optional) character; of a data.frame loaded in the R environment that contains the word embeddings. First column must contain the word, the other n-1 columns contain the embedding values. |
TopicModel_type |
character vector; todo. |
TopicModel_model |
character; name of a topic model object loaded in the R environment. |
covariateDataSave |
(optional) character; location and file name of where the created covariateData must be stored. |
covariateDataLoad |
(optional) character; location and file name of where the created covariateData must be loaded from. Anything else is ignored, just the covariateData is loaded and returned. |
useTextData |
logical; option to disable the creation of text representation covariates. Default is |
covariateSettings object, that can be used by the OHDSI FeatureExtraction package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.