knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. cui2vec
was created to learn embeddings for medical concepts using an extremely large collection of multimodal medical data. This includes an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles that can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. See our preprint [@Beam2018-vl] for more information.
In this vignette, we'll walk through the core steps of cui2vec
. Start by loading the package:
library(cui2vec)
For this vignette, we'll focus on a collection of 20 million clinical notes that have been preprocessed using NILE. term_cooccurrence_matrix.RData
contains a term co-occurrence matrix (TCM) for all pairwise combinations of CUIs (concept unique identifier) for a subsampling of 100 CUIs out of 18,000+. singleton_counts.RData
contains the raw count of each term in the vocabulary. Both are needed for cui2vec
to work. For now, we'll assume you have a TCM and singleton count for your corpus of interest.
# denominator in PMI calculation N <- 261397 load('term_cooccurrence_matrix.rda') load('singleton_counts.rda')
The first step in the cui2vec
algorithm is to construct the Pointwise Mutual Information (PMI) matrix:
pmi <- construct_pmi(term_cooccurrence_matrix,singleton_counts,N) pmi[1:5, 1:3]
Next, you need to construct the Shifted Positive Pointwise Mutual Information (SPPMI) matrix:
sppmi <- construct_sppmi(pmi) sppmi[1:5, 1:5]
Finally, you can fit cui2vec
embeddings using construct_word2vec_embedding
. We'll keep this example small and only work with 20 dimensional embeddings.
w2v_embedding <- construct_word2vec_embedding(sppmi = sppmi, dim_size = 20, iters=50) w2v_embedding[1:5, 1:5]
We can also do PCA
on the term_cooccurrence_matrix matrix. We'll refer to these as PCA embeddings.
pca_embedding <- construct_pca_embedding(term_cooccurrence_matrix, dim_size = 20) pca_embedding[1:5, 1:5]
Another baseline we can consider is GloVe
:
glove_embedding <- construct_glove_embedding(term_cooccurrence_matrix, dim_size = 20, iters = 10) glove_embedding[1:5, 1:5]
To run the benchmarks in our paper, we need some additional information about the vectors in our embedding space. Each vector has a CUI, but we also need the UMLS semantic type associated with each CUI. We also assume there is string with the English equivalent of the CUI. You can check that the first 3 columns of your embedding data frame are CUI, semantic type, and description by running check_embedding_semantic_columns
print(check_embedding_semantic_columns(w2v_embedding))
As expected, this fails, since we just created the embeddings. We have a helper function to add this information to an embedding.
glove_embedding <- bind_semantic_types(glove_embedding) w2v_embedding <- bind_semantic_types(w2v_embedding)
Let's check that it worked:
w2v_embedding[1:5, 1:5]
We are now ready to run the benchmarks we described in our paper. The benchmarking strategy leverages previously published ‘known’ relationships between medical concepts. We compare how similar the embeddings for a pair of concepts are by computing the cosine similarity of their corresponding vectors, and we use this similarity to assess whether or not the two concepts are related. There are five benchmarks:
# No CUIs in our tiny embeding that overlap with comorbidity CUIs, so don't evaluate comorbidity_results <- benchmark_comorbidities(w2v_embedding)
# No CUIs in our tiny embeding that overlap with causitive CUIs, so don't evaluate causitive_results <- benchmark_causative(w2v_embedding)
# No CUIs in our tiny embeding that overlap with NDF_RT CUIs, so don't evaluate ndf_rt_results <- benchmark_ndf_rt(w2v_embedding, bootstraps = 100)
semantic_results <- benchmark_semantic_type(w2v_embedding, bootstraps = 100)
semantic_results[1:5, -1]
# No CUIs that contain concept pairs in our tiny embedding, so don't evaluate similarity_results <- benchmark_similarity(w2v_embedding)
We can also run all the benchmarks at once for an embedding.
run_all_benchmarks(w2v_embedding)
Finally, you can also compare the performance of two embeddings on one or more benchmarks. compare_embeddings
restricts the analysis to the shared set of CUIs in both embeddings.
compare_embeddings(glove_embedding, w2v_embedding, "all")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.