FMAT_run | R Documentation |
Run the fill-mask pipeline on multiple models with CPU or GPU (faster but requiring an NVIDIA GPU device).
FMAT_run(
models,
data,
gpu,
add.tokens = FALSE,
add.method = c("sum", "mean"),
file = NULL,
progress = TRUE,
warning = TRUE,
na.out = TRUE
)
models |
Options:
|
data |
A data.table returned from |
gpu |
Use GPU (3x faster than CPU) to run the fill-mask pipeline? Defaults to missing value that will automatically use available GPU (if not available, then use CPU). An NVIDIA GPU device (e.g., GeForce RTX Series) is required to use GPU. See Guidance for GPU Acceleration. Options passing to the
|
add.tokens |
Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary?
Defaults to |
add.method |
Method used to produce the token embeddings of new added tokens.
Can be |
file |
File name of |
progress |
Show a progress bar? Defaults to |
warning |
Alert warning of out-of-vocabulary word(s)? Defaults to |
na.out |
Replace probabilities of out-of-vocabulary word(s) with |
The function automatically adjusts for
the compatibility of tokens used in certain models:
(1) for uncased models (e.g., ALBERT), it turns tokens to lowercase;
(2) for models that use <mask>
rather than [MASK]
,
it automatically uses the corrected mask token;
(3) for models that require a prefix to estimate whole words than subwords
(e.g., ALBERT, RoBERTa), it adds a certain prefix (usually a white space;
\u2581 for ALBERT and XLM-RoBERTa, \u0120 for RoBERTa and DistilRoBERTa).
Note that these changes only affect the token
variable
in the returned data, but will not affect the M_word
variable.
Thus, users may analyze data based on the unchanged M_word
rather than the token
.
Note also that there may be extremely trivial differences (after 5~6 significant digits) in the raw probability estimates between using CPU and GPU, but these differences would have little impact on main results.
A data.table (of new class fmat
) appending data
with these new variables:
model
: model name.
output
: complete sentence output with unmasked token.
token
: actual token to be filled in the blank mask
(a note "out-of-vocabulary" will be added
if the original word is not found in the model vocabulary).
prob
: (raw) conditional probability of the unmasked token
given the provided context, estimated by the masked language model.
It is NOT SUGGESTED to directly interpret the raw probabilities
because the contrast between a pair of probabilities
is more interpretable. See summary.fmat
.
BERT_download
BERT_vocab
FMAT_load
(deprecated)
FMAT_query
FMAT_query_bind
summary.fmat
## Running the examples requires the models downloaded
## Not run:
models = c("bert-base-uncased", "bert-base-cased")
query1 = FMAT_query(
c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."),
MASK = .(Male="He", Female="She"),
TARGET = .(Occupation=c("a doctor", "a nurse", "an artist"))
)
data1 = FMAT_run(models, query1)
summary(data1, target.pair=FALSE)
query2 = FMAT_query(
"The [MASK] {ATTRIB}.",
MASK = .(Male=c("man", "boy"),
Female=c("woman", "girl")),
ATTRIB = .(Masc=c("is masculine", "has a masculine personality"),
Femi=c("is feminine", "has a feminine personality"))
)
data2 = FMAT_run(models, query2)
summary(data2, mask.pair=FALSE)
summary(data2)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.