| code_extend | R Documentation |
These functions use text embeddings and multinomial logistic regression
to suggest missing codes or flag potentially incorrect codes based on text data.
Two approaches are provided: one using GloVe embeddings trained on the input text,
and another using pre-trained BERT embeddings via the {text} package.
Both functions require a vector of text (e.g., titles or descriptions)
and a corresponding vector of categorical codes, with NA or empty strings
indicating missing codes to be inferred.
The functions train a multinomial logistic regression model
using glmnet on the text embeddings of the entries with known codes,
and then predict codes for the entries with missing codes.
The functions also validate the model's performance
on a holdout set and report per-class precision, recall, and F1-score.
If no missing codes are present, the functions instead
check existing codes for potential mismatches and report them.
code_extend_glove(titles, var, req_f1 = 0.8, rarity_threshold = 8)
code_extend_bert(titles, var, req_f1 = 0.8, rarity_threshold = 8, emb_texts)
titles |
A character vector of text entries (e.g., titles or descriptions). |
var |
A character vector of (categorical) codes that might be coded
from the titles or texts.
Entries with missing codes should be |
req_f1 |
The required macro-F1 score on the validation set before proceeding with inference. Default is 0.80. |
rarity_threshold |
Minimum number of occurrences for a code to be included in training. Codes with fewer occurrences are excluded from training to ensure sufficient data for learning. Default is 8. |
emb_texts |
For |
titles <- paste(emperors$Wikipedia$CityBirth,
emperors$Wikipedia$ProvinceBirth,
emperors$Wikipedia$Rise,
emperors$Wikipedia$Dynasty,
emperors$Wikipedia$Cause)
var <- emperors$Wikipedia$Killer
var[var=="Unknown"] <- NA
var[var %in% c("Senate","Court Officials","Opposing Army")] <- "Enemies"
var[var %in% c("Fire","Lightning","Aneurism","Heart Failure")] <- "God"
var[var %in% c("Wife","Usurper","Praetorian Guard","Own Army")] <- "Friends"
glo <- code_extend_glove(titles,
var)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.