causal_pred_mats | R Documentation |
This function computes a list of matrices, where each matrix corresponds to a
unique group specified by the by
argument. Each matrix represents the
predictability of every token in the input text (x
) based on preceding
context, as evaluated by a causal transformer model.
causal_pred_mats(
x,
by = rep(1, length(x)),
sep = " ",
log.p = getOption("pangoling.log.p"),
sorted = FALSE,
model = getOption("pangoling.causal.default"),
checkpoint = NULL,
add_special_tokens = NULL,
decode = FALSE,
config_model = NULL,
config_tokenizer = NULL,
batch_size = 1,
...
)
x |
A character vector of words, phrases, or texts to evaluate. |
by |
A grouping variable indicating how texts are split into groups. |
sep |
A string specifying how words are separated within contexts or
groups. Default is |
log.p |
Base of the logarithm used for the output predictability values.
If |
sorted |
When default FALSE it will retain the order of groups we are
splitting by. When TRUE then sorted (according to |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
decode |
Logical. If |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
batch_size |
Maximum number of sentences/texts processed in parallel. Larger batches increase speed but use more memory. Since all texts in a batch must have the same length, shorter ones are padded with placeholder tokens. |
... |
Currently not in use. |
The function splits the input x
into groups specified by the by
argument
and processes each group independently. For each group, the model computes
the predictability of each token in its vocabulary based on preceding
context.
Each matrix contains:
Rows representing the model's vocabulary.
Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
By default, values in the matrices are the natural logarithm of word probabilities.
A list of matrices with tokens in their columns and the vocabulary of the model in their rows
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model functions:
causal_next_tokens_pred_tbl()
,
causal_words_pred()
data("df_sent")
df_sent
list_of_mats <- causal_pred_mats(
x = df_sent$word,
by = df_sent$sent_n,
model = "gpt2"
)
# View the structure of the resulting list
list_of_mats |> str()
# Inspect the last rows of the first matrix
list_of_mats[[1]] |> tail()
# Inspect the last rows of the second matrix
list_of_mats[[2]] |> tail()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.