knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

docformer

Lifecycle: experimental R-CMD-check

{docformer} is an implementation of DocFormer: End-to-End Transformer for Document Understanding relying on torch for R providing a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄📄📄, as a port of shabie/docformer code.

DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer can be pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x larger in no. of parameters. High-level Neural Netword design with building-blocks around the Docformer Multi-Modal transformer \label{High-level design}

Installation

You can install the development version of docformer like so:

# install.packages("remotes")
remotes::install_github("cregouby/docformer")

docformer currently supports the {sentencepiece} package for tokenization prerequisites, and relies on {pdftools} for digitally-born pdfs, and {tesseract} with {magick} for OCR documents.

if (! ("sentencepiece" %in% rownames(installed.packages()))) { install.packages("sentencepiece") }

Usage Example

Side-by-side document ground truth and docformer prediction with superimposed color: red for title, blue for question, green for answer \label{Side-by-side ground truth and docformer prediction}

This is a basic workflow to train a docformer model:

Turn a document into input tensor

library(sentencepiece)
library(docformer)
# get the corpus
doc <- pins::pin("https://arxiv.org/pdf/2106.11539.pdf")

# load a sentencepiece tokenizer and add a <mask> and <pad> missing token.
tok_model <- sentencepiece_load_model(system.file(package = "sentencepiece", "models/nl-fr-dekamer.model"))
# prepend tokenizer with mandatory tokens
tok_model$vocab_size <- tok_model$vocab_size + 2L
# Add <mask> and <pad>. Here <mask> is at id=0
tok_model$vocabulary <- rbind(data.frame(subword = c("<mask>", "<pad>")), 
                              tok_model$vocabulary["subword"]) %>% 
  tibble::rowid_to_column("id") %>%
  dplyr::mutate(id = id - 1)

# turn the document into docformer input tensor
doc_tensor <- create_features_from_doc(doc = doc, tokenizer = tok_model)

Import a pretrained model

config  <-  docformer_config(pretrained_model_name = "microsoft/layoutlm-base-uncased")
docformer_model <- docformer(config)

or shape your own model

config  <-  docformer_config(hidden_size = 76L, max_position_embeddings = 52L, num_attention_heads = 4L,num_hidden_layers = 3L, vocab_size = 5000L, device = "cpu")
docformer_model <- docformer(config)

Pretrain the model (work in progress)

A self-supervised training task can be run with

# train a model from that tensor
# docformer_ssl <- docformer_pretrain(doc_tensor, epochs=30)

Train the model (work in progress)

...followed by a supervised training task on some annotated documents...

# docformer_model <- docformer_fit(doc_tensor, from_model=docformer_ssl, epochs=30)

Predict with the model

Predict with the headless model gives a document-layout embedding tensor of shape [ , , ]

doc_embedding <- docformer_model(doc_tensor)


cregouby/docformer documentation built on May 27, 2023, 11:19 p.m.