knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
{docformer} is an implementation of DocFormer: End-to-End Transformer for Document Understanding relying on torch for R providing a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄📄📄, as a port of shabie/docformer code.
DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer can be pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x larger in no. of parameters.
You can install the development version of docformer like so:
# install.packages("remotes") remotes::install_github("cregouby/docformer")
docformer currently supports the {sentencepiece}
package for
tokenization prerequisites, and relies on {pdftools}
for digitally-born pdfs, and {tesseract}
with {magick}
for OCR documents.
if (! ("sentencepiece" %in% rownames(installed.packages()))) { install.packages("sentencepiece") }
This is a basic workflow to train a docformer model:
library(sentencepiece) library(docformer) # get the corpus doc <- pins::pin("https://arxiv.org/pdf/2106.11539.pdf") # load a sentencepiece tokenizer and add a <mask> and <pad> missing token. tok_model <- sentencepiece_load_model(system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")) # prepend tokenizer with mandatory tokens tok_model$vocab_size <- tok_model$vocab_size + 2L # Add <mask> and <pad>. Here <mask> is at id=0 tok_model$vocabulary <- rbind(data.frame(subword = c("<mask>", "<pad>")), tok_model$vocabulary["subword"]) %>% tibble::rowid_to_column("id") %>% dplyr::mutate(id = id - 1) # turn the document into docformer input tensor doc_tensor <- create_features_from_doc(doc = doc, tokenizer = tok_model)
config <- docformer_config(pretrained_model_name = "microsoft/layoutlm-base-uncased") docformer_model <- docformer(config)
config <- docformer_config(hidden_size = 76L, max_position_embeddings = 52L, num_attention_heads = 4L,num_hidden_layers = 3L, vocab_size = 5000L, device = "cpu") docformer_model <- docformer(config)
A self-supervised training task can be run with
# train a model from that tensor # docformer_ssl <- docformer_pretrain(doc_tensor, epochs=30)
...followed by a supervised training task on some annotated documents...
# docformer_model <- docformer_fit(doc_tensor, from_model=docformer_ssl, epochs=30)
Predict with the headless model gives a document-layout embedding tensor of shape [
doc_embedding <- docformer_model(doc_tensor)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.