create_features: Turn content into docformer torch tensor input feature
In cregouby/docformer: Fit 'DocFormer' Models for Classification and Regression

create_feature

R Documentation

Turn content into docformer torch tensor input feature

Description

Turn content into docformer torch tensor input feature

Usage

create_feature(filepath, config)

create_features_from_image(
  image,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  debugging = FALSE
)

create_features_from_doc(
  doc,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  debugging = FALSE
)

create_features_from_docbank(
  text_path,
  image_path,
  tokenizer,
  add_batch_dim = TRUE,
  target_geometry = "384x500",
  max_seq_len = 512,
  batch_size = 1000,
  debugging = FALSE
)

Arguments

`filepath`
`config`
`image`	file path, url, or raw vector to image (png, tiff, jpeg, etc)
`tokenizer`	tokenizer function to apply to words extracted from image. Currently, hftokenizers, tokenizer.bpe and sentencepiece tokenizer are supported.
`add_batch_dim`	(boolean) add a extra dimension to tensor for batch encoding, in case of single page content
`target_geometry`	image target magik geometry expected by the image model input
`max_seq_len`	size of the embedding vector in tokens
`debugging`	additionnal feature for debugging purposes
`doc`	file path, url, or raw vector to document (currently pdf only)
`text_path`	file path or filenames to DocBank_500K_txt
`image_path`	file path or filenames to the matching DocBank_500K_ori_img
`batch_size`	number of images to process

Value

a docformer_tensor, a list of the named tensors encoding the document feature, as expected as input to docformer_ network. Tensors are "x_features", "y_features", "text", image" and "mask", first dimension of each tensor beeing the page of the document.

Examples

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
  system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
)
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
  sent_tok$vocabulary,
  data.frame(id = sent_tok$vocab_size, subword = "<mask>")
)
# turn pdf into feature
image <- system.file(package = "docformer", "inst", "2106.11539_1.png")
image_tt <- create_features_from_image(image, tokenizer = sent_tok)

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
   system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
   )
sent_tok$vocab_size <- sent_tok$vocab_size+2L
sent_tok$vocabulary <- rbind(
  sent_tok$vocabulary,
  data.frame(id = c(sent_tok$vocab_size - 1, sent_tok$vocab_size), subword = c("<mask>", "<pad>"))
  )
# turn pdf into feature
doc <- system.file(package = "docformer", "2106.11539_1_2.pdf")
doc_tt <- create_features_from_doc(doc, tokenizer = sent_tok)

# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
   system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
   )
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
   sent_tok$vocabulary,
   data.frame(id = sent_tok$vocab_size, subword = "<mask>")
   )
# turn pdf into feature
text_path <- system.file(package = "docformer", "DocBank_500K_txt")
image_path <- system.file(package = "docformer", "DocBank_500K_ori_img")
docbanks_tt <- create_features_from_docbank(text_path, image_path, tokenizer = sent_tok)

cregouby/docformer documentation built on May 27, 2023, 11:19 p.m.