knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The nlpembeds package provides efficient methods to compute co-occurrence matrices, pointwise mutual information (PMI) and singular value decomposition (SVD) embeddings. In the biomedical and clinical setting, one challenge is the huge size of databases of electronic health records (EHRs), e.g. when analyzing data of millions of patients over tens of years. To address this, this package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. PMI-SVD embeddings are extensively used, e.g. in Hong C. (2021).
Co-occurrence, PMI and SVD are methods to produce vector representations of concepts, i.e. embeddings, similarly to the popular word2vec method [@mikolov2013efficient]. However, while Word2vec is highly parallelized, it may lack stability, i.e. the same computation will produce significantly different results. To improve the stability, one can increase the number of iterations, but it may be preferable to compute co-occurrence, PMI (or more precisely, single-shift positive PMI), and SVD [@levy2014neural].
The nlpembeds package aims to provide efficient methods to compute co-occurrence matrices and PMI-SVD. In the biomedical and clinical settings, one challenge is the huge size of databases, e.g. when analyzing data of millions of patients over tens of years. To address this, the nlpembeds package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices from tens of gigabytes of data.
\newpage
Consider the following data object:
library(nlpembeds) df_ehr = data.frame(Patient = c(1, 1, 2, 1, 2, 1, 1, 3, 4), Month = c(1, 1, 1, 2, 2, 3, 3, 4, 4), Parent_Code = c('C1', 'C2', 'C2', 'C1', 'C1', 'C1', 'C2', 'C3', 'C4'), Count = 1:9) df_ehr
This represents our patient-level data: Patient
is the patient identifier, Parent_Code
is the diagnosis that was performed (or medication prescribed, etc.), Month
is the year/month of the diagnosis, and Count
is the number of times the diagnosis was performed during that month.
We can compute a monthly co-occurrence which gives us:
spm_cooc = build_df_cooc(df_ehr) spm_cooc
What the data represents is:
The monthly co-occurrence between two codes is defined as the minimum of their two counts within a month. The co-occurrences are first computed for each patient for each month, and sum aggregated.
\newpage
Here, let's consider the co-occurrence between codes C1 and C2. We decompose the computation to make it easily understandable. The codes co-occurred in months 1 and 3 in patient 1. We can decompose the co-occurrence matrices as follows:
cooc_1 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 1), min_code_freq = 0) cooc_1
cooc_2 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 3)) cooc_2
When summed up, we get a co-occurrence of 7 between C1 and C2
cooc_1 + cooc_2
Note that although codes C3 and C4 were both present during month 4, they were not for the same patient, so their co-occurrence is 0.
\newpage
We can then call PMI on that object
m_pmi = get_pmi(spm_cooc) m_pmi
Here,
log(7 * 59 / ((16 + 7) * (12 + 7)))
log(16 * 59 / (16 + 7) ^ 2)
I acknowledge here Dr. Doudou Zhou and Dr. Ziming Gan for their contributions to the implementation of the PMI and SVD computations.
We then call SVD on the PMI matrix, and then feed the m_svd
object (i.e. the embedding matrix) to downstream analyses, such as knowledge graphs.
m_svd = get_svd(m_pmi, embedding_dim = 2) m_svd
The SVD is computed with Randomized SVD [@erichson2016randomized], an efficient approximation of truncated SVD in which only the first principal components are returned, with the rsvd package. The author suggests the number of dimensions requested k
should follow: k < n / 4
, where n
is the number of features, for it to be efficient, and that otherwise one should rather use either SVD or truncated SVD. Furthermore, we select only the positive principal vectors (further details in the documentation of the function).
For EHR, I recommend a SVD rank between 1000 and 2000 when analyzing codified and natural language processing (NLP) data (15k - 25k features), and less if only analyzing codified data (300 - 800 dimensions for 5k - 10k codified features). Known pairs can be used to compute AUC and select the best performing rank, e.g. with the fit_embeds_kg
function from the kgraph package or with the kgraph Shiny app. The kgraph package also includes a dataset of known pairs of NLP concepts derived from PrimeKG [@chandak2022building] (further introduced in the vignette).
When selecting the number of dimensions, one should consider balancing two objectives: higher AUC and lower number of dimensions, as a large number of dimensions can lead to overfitting and a lower number of dimensions will act as a regularization. The aim is thus to identify the range of numbers of dimensions after which increasing the number of dimensions does not significantly increase the AUC anymore. For generalization and reproducibility within other cohorts, it may be better to have a slightly smaller AUC and a lower number of dimensions. As an example, an edge case could be if for example the AUC is 0.732 at 1000 dimensions, and 0.75 at 2000 dimensions, and in my opinion in that case either way would be fine and perhaps a more thorough investigation could be performed.
\newpage
To efficiently perform co-occurrence matrices of large databases of tens of gigabytes, two important optimizations are available:
To perform batching by patients, we want to have our input file as a SQL file, which we will read by batches of patients and aggregate the results. This is performed with the sql_cooc
function.
To demonstrate how it is used, we first write the example object above as a SQL database. It is important to name the columns and the SQL table as here (i.e. four columns Patient
, Month
, Parent_Code
, Count
and table name df_monthly
). Also, we index the table by patients, and write as a second table df_uniq_codes
the unique codes (this is optional, it is done automatically in sql_cooc
if the table df_uniq_codes
is not found in the input SQL file and the autoindex
parameter is set to TRUE). It is a good idea to also order your table by patients before writing it, as it will make read accesses more efficient, and some of my experiments showed it can make the computation up to 4 times faster.
library(RSQLite) test_db_path = tempfile() test_db = dbConnect(SQLite(), test_db_path) dbWriteTable(test_db, 'df_monthly', df_ehr, overwrite = TRUE) ### # optional, done automatically by sql_cooc if table 'df_uniq_codes' not found # and parameter autoindex set to TRUE dbExecute(test_db, "CREATE INDEX patient_idx ON df_monthly (Patient)") df_uniq_codes = unique(df_ehr['Parent_Code']) dbWriteTable(test_db, 'df_uniq_codes', df_uniq_codes, overwrite = TRUE) ### dbDisconnect(test_db)
We can then call the sql_cooc
function on that filename, and specify a new filename for the output.
For each batch, information about the current memory usage is printed, which can be helpful for debugging memory usage and getting an idea of the computation progress from the log file.
output_db_path = tempfile() sql_cooc(input_path = test_db_path, output_path = output_db_path)
\newpage
Once the computation is complete, we read the co-occurrence sparse matrix from the output SQL file.
test_db = dbConnect(SQLite(), output_db_path) spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;') dbDisconnect(test_db) spm_cooc
As previously, we can then feed it to get_pmi
.
m_pmi = get_pmi(spm_cooc) m_pmi
Or transform it as a classic matrix.
spm_cooc = build_spm_cooc_sym(spm_cooc) m_cooc = as.matrix(spm_cooc) m_cooc
That is the minimum call. Two important parameters come into play here:
n_batch
which is the number of patients to include by batchn_cores
which is the number of cores on which the computation will be parallelizedFor the two parameters, the higher the value, the faster the computation and the more RAM memory required. The user can fine-tune these parameters to fit machine specifications and to optimize. As an example, n_batch = 300
and n_cores = 6
should be able to run on 16GB machines if the number of codes is not too large (e.g. if only considering rolled-up codified data).
\newpage
When the number of unique codes is large, e.g. when analyzing NLP concepts, one may want to perform a first subset of the codes by providing a dictionary of codes to include.
Three parameters are available here:
exclude_code_pattern
: Pattern of codes prefixes to exclude (will be used in SQL appended by '%'). For example, 'AB'.exclude_dict_pattern
: Used in combination with parameter codes_dict
. Pattern of codes prefixes to exclude, except if they are found in codes_dict
(will be used in grep prefixed by '^'). For example, 'C[0-9]'.codes_dict_fpaths
: Used in combination with exclude_dict_pattern
. Filepaths to define codes to avoid excluding using exclude_dict_pattern
. First column of each file must define the code identifiers.To demonstrate the behavior, we will rename the two first codes in the object above to two NLP concepts present in the dictionaries provided with the package.
df_ehr$Parent_Code %<>% ifelse(. == 'C1', 'C0000545', .) df_ehr$Parent_Code %<>% ifelse(. == 'C2', 'C0000578', .) df_ehr
We write it as a SQL file.
test_db_path = tempfile() test_db = dbConnect(SQLite(), test_db_path) dbWriteTable(test_db, 'df_monthly', df_ehr) dbDisconnect(test_db)
The code is then called like this:
codes_dict_fpaths = list.files(system.file('dictionaries', package = 'nlpembeds'), full.names = TRUE) sql_cooc(input_path = test_db_path, output_path = output_db_path, exclude_dict_pattern = 'C[0-9]', codes_dict_fpaths = codes_dict_fpaths, autoindex = TRUE, overwrite_output = TRUE)
\newpage
Finally, we read the co-occurrence sparse matrix from that SQL file.
test_db = dbConnect(SQLite(), output_db_path) spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;') dbDisconnect(test_db) spm_cooc
As previously, we can then feed it to get_pmi
.
# m_pmi = get_pmi(spm_cooc) # m_pmi
Or transform it as a classic matrix.
# spm_cooc = build_spm_cooc_sym(spm_cooc) # m_cooc = as.matrix(spm_cooc) # m_cooc
\newpage
If using a CentOS server, nlpembeds may not install correctly and you may need to use v2.4.0 of RcppAlgos. One can install it this way, before installing nlpembeds:
# remotes::install_git('https://github.com/jwood000/RcppAlgos@v2.4.0.git')
The n_batch
patients are parallelized on the n_cores
available, so n_batch
needs to be larger than n_cores
, and it is best to have at least 2-3 patients per core. The aggregated sparse matrix will logarithmically increase in size, so if you are close to the RAM limits your job may fail only after several hours.
The number of codes have a quadratic impact on memory, so depending on the number of codes considered you can optimize n_batch
and n_cores
. Assuming 250 GB available and ~30k codes, I would recommend n_batch = 75
and n_cores = 25
, and estimate the co-occurrence matrix for ~60k patients should take 3 hours. You should be able to more or less multiply linearly the parameters by the amount of RAM, meaning if you have 500 GB available you could use n_batch = 150
and n_cores = 50
and it should run in ~1h30. If only considering 10-15k codes, you can most likely have n_batch = 600
and n_cores = 20
. If considering more than 30k codes, you will need to reduce n_batch
and/or n_cores
.
\newpage
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.