create_features: Turn content into docformer torch tensor input feature

create_featureR Documentation

Turn content into docformer torch tensor input feature

Description

Turn content into docformer torch tensor input feature

Usage

create_feature(filepath, config)

create_features_from_image(
  image,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  debugging = FALSE
)

create_features_from_doc(
  doc,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  debugging = FALSE
)

create_features_from_docbank(
  text_path,
  image_path,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  batch_size = 1000,
  debugging = FALSE
)

Arguments

filepath
config
image

file path, url, or raw vector to image (png, tiff, jpeg, etc)

tokenizer

tokenizer function to apply to words extracted from image. Currently, hftokenizers, tokenizer.bpe and sentencepiece tokenizer are supported.

add_batch_dim

(boolean) add a extra dimension to tensor for batch encoding, in case of single page content

target_geometry

image target magik geometry expected by the image model input

max_seq_len

size of the embedding vector in tokens

debugging

additionnal feature for debugging purposes

doc

file path, url, or raw vector to document (currently pdf only)

text_path

file path or filenames to DocBank_500K_txt

image_path

file path or filenames to the matching DocBank_500K_ori_img

batch_size

number of images to process

Value

a docformer_tensor, a list of the named tensors encoding the document feature, as expected as input to docformer_ network. Tensors are "x_features", "y_features", "text", image" and "mask", first dimension of each tensor beeing the page of the document.

Examples

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
  system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
)
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
  sent_tok$vocabulary,
  data.frame(id = sent_tok$vocab_size, subword = "<mask>")
)
# turn pdf into feature
image <- system.file(package = "docformer", "inst", "2106.11539_1.png")
image_tt <- create_features_from_image(image, tokenizer = sent_tok)

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
   system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
   )
sent_tok$vocab_size <- sent_tok$vocab_size+2L
sent_tok$vocabulary <- rbind(
  sent_tok$vocabulary,
  data.frame(id = c(sent_tok$vocab_size - 1, sent_tok$vocab_size), subword = c("<mask>", "<pad>"))
  )
# turn pdf into feature
doc <- system.file(package = "docformer", "2106.11539_1_2.pdf")
doc_tt <- create_features_from_doc(doc, tokenizer = sent_tok)

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
   system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
   )
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
   sent_tok$vocabulary,
   data.frame(id = sent_tok$vocab_size, subword = "<mask>")
   )
# turn pdf into feature
text_path <- system.file(package = "docformer", "DocBank_500K_txt")
image_path <- system.file(package = "docformer", "DocBank_500K_ori_img")
docbanks_tt <- create_features_from_docbank(text_path, image_path, tokenizer = sent_tok)


cregouby/docformer documentation built on May 27, 2023, 11:19 p.m.