{docformer} is an implementation of DocFormer: End-to-End Transformer for Document Understanding relying on torch for R providing a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄📄📄, as a port of shabie/docformer code.
DocFormer uses text, vision and spatial features and combines them using
a novel multi-modal self-attention layer. DocFormer can be pre-trained
in an unsupervised fashion using carefully designed tasks which
encourage multi-modal interaction. DocFormer also shares learned spatial
embeddings across modalities which makes it easy for the model to
correlate text to visual tokens and vice versa. DocFormer is evaluated
on 4 different datasets each with strong baselines. DocFormer achieves
state-of-the-art results on all of them, sometimes beating models 4x
larger in no. of parameters.
You can install the development version of docformer like so:
# install.packages("remotes")
remotes::install_github("cregouby/docformer")
docformer currently supports the {sentencepiece}
package for
tokenization prerequisites, and relies on {pdftools}
for
digitally-born pdfs, and {tesseract}
with {magick}
for OCR
documents.
if (! ("sentencepiece" %in% rownames(installed.packages()))) { install.packages("sentencepiece") }
This is a basic workflow to train a docformer model:
library(sentencepiece)
library(docformer)
# get the corpus
doc <- pins::pin("https://arxiv.org/pdf/2106.11539.pdf")
# load a sentencepiece tokenizer and add a <mask> and <pad> missing token.
tok_model <-sentencepiece_load_model(system.file(package = "sentencepiece", "models/nl-fr-dekamer.model"))
# prepend tokenizer with mandatory tokens
tok_model$vocab_size <- tok_model$vocab_size + 2L
# Add <mask> and <pad>. Here <mask> is at id=0
tok_model$vocabulary <- rbind(data.frame(subword = c("<mask>", "<pad>")),
tok_model$vocabulary["subword"]) %>%
tibble::rowid_to_column("id") %>%
dplyr::mutate(id = id - 1)
# turn the document into docformer input tensor
doc_tensor <- create_features_from_doc(doc = doc, tokenizer = tok_model)
config <- docformer_config(pretrained_model_name = "microsoft/layoutlm-base-uncased")
docformer_model <- docformer(config)
config <- docformer_config(hidden_size = 76L, max_position_embeddings = 52L, num_attention_heads = 4L,num_hidden_layers = 3L, vocab_size = 5000L, device = "cpu")
docformer_model <- docformer(config)
A self-supervised training task can be run with
# train a model from that tensor
# docformer_ssl <- docformer_pretrain(doc_tensor, epochs=30)
…followed by a supervised training task on some annotated documents…
# docformer_model <- docformer_fit(doc_tensor, from_model=docformer_ssl, epochs=30)
Predict with the headless model gives a document-layout embedding tensor of shape [ , , ]
doc_embedding <- docformer_model(doc_tensor)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.