predict.Transformer: Predict alongside a Transformer model
In bnosac/golgotha: Contextualised Embeddings and Language Modelling using BERT

Description Usage Arguments Value Examples

View source: R/embed.R

Extract features from the Transformer model namely get

the embedding of a sentence
the embedding of the tokens of the sentence
the tokens of a sentence

## S3 method for class 'Transformer'
predict(
  object,
  newdata,
  type = c("embed-sentence", "embed-token", "tokenise"),
  trace = 10,
  ...
)

`object`	an object of class Transformer as returned by `transformer`
`newdata`	a data.frame with columns doc_id and text indicating the text to embed
`type`	a character string, either 'embed-sentence', 'embed-token', 'tokenise' to get respectively sentence-level embeddings, token-level embeddings or the wordpiece tokens
`trace`	logical indicating to show a trace of the progress. Defaults to showing every 10 annotated embeddings
`...`	other arguments passed on to the methods

depending on the argument type the function returns:

embed-sentence: A matrix with the embedding of the text, where the doc_id's are in the rownames
embed-token: A list of matrices with token-level embeddings, one for each doc_id. The names of the list are identified by the doc_id. Note that depending on the model you will have CLS / SEP tokens at the start/back and the number of rows of the matrix is also limited by the model
tokenise: A list of subword (wordpiece) tokens. The names of the list are identified by the doc_id.
generate: generate tokens following the provided text sequence

transformer_download_model("bert-base-multilingual-uncased")
model <- transformer("bert-base-multilingual-uncased")

x <- data.frame(doc_id = c("doc_1", "doc_2"),
                text = c("provide some words to embed", "another sentence of text"),
                stringsAsFactors = FALSE)
predict(model, x, type = "tokenise")
embedding <- predict(model, x, type = "embed-sentence")
dim(embedding)
embedding <- predict(model, x, type = "embed-token")
str(embedding)


unlink(file.path(system.file(package = "golgotha", "models"),
       "bert-base-multilingual-uncased"), recursive = TRUE)