create_feature | R Documentation |
Turn content into docformer torch tensor input feature
create_feature(filepath, config)
create_features_from_image(
image,
tokenizer,
add_batch_dim = TRUE,
target_geometry = "384x500",
max_seq_len = 512,
debugging = FALSE
)
create_features_from_doc(
doc,
tokenizer,
add_batch_dim = TRUE,
target_geometry = "384x500",
max_seq_len = 512,
debugging = FALSE
)
create_features_from_docbank(
text_path,
image_path,
tokenizer,
add_batch_dim = TRUE,
target_geometry = "384x500",
max_seq_len = 512,
batch_size = 1000,
debugging = FALSE
)
filepath |
|
config |
|
image |
file path, url, or raw vector to image (png, tiff, jpeg, etc) |
tokenizer |
tokenizer function to apply to words extracted from image. Currently, hftokenizers, tokenizer.bpe and sentencepiece tokenizer are supported. |
add_batch_dim |
(boolean) add a extra dimension to tensor for batch encoding, in case of single page content |
target_geometry |
image target magik geometry expected by the image model input |
max_seq_len |
size of the embedding vector in tokens |
debugging |
additionnal feature for debugging purposes |
doc |
file path, url, or raw vector to document (currently pdf only) |
text_path |
file path or filenames to DocBank_500K_txt |
image_path |
file path or filenames to the matching DocBank_500K_ori_img |
batch_size |
number of images to process |
a docformer_tensor
, a list of the named tensors encoding the document feature,
as expected as input to docformer_ network. Tensors are
"x_features", "y_features", "text", image" and "mask",
first dimension of each tensor beeing the page of the document.
# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
)
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
sent_tok$vocabulary,
data.frame(id = sent_tok$vocab_size, subword = "<mask>")
)
# turn pdf into feature
image <- system.file(package = "docformer", "inst", "2106.11539_1.png")
image_tt <- create_features_from_image(image, tokenizer = sent_tok)
# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
)
sent_tok$vocab_size <- sent_tok$vocab_size+2L
sent_tok$vocabulary <- rbind(
sent_tok$vocabulary,
data.frame(id = c(sent_tok$vocab_size - 1, sent_tok$vocab_size), subword = c("<mask>", "<pad>"))
)
# turn pdf into feature
doc <- system.file(package = "docformer", "2106.11539_1_2.pdf")
doc_tt <- create_features_from_doc(doc, tokenizer = sent_tok)
# load a tokenizer with <mask> encoding capability
sent_tok <- sentencepiece::sentencepiece_load_model(
system.file(package = "sentencepiece", "models/nl-fr-dekamer.model")
)
sent_tok$vocab_size <- sent_tok$vocab_size+1L
sent_tok$vocabulary <- rbind(
sent_tok$vocabulary,
data.frame(id = sent_tok$vocab_size, subword = "<mask>")
)
# turn pdf into feature
text_path <- system.file(package = "docformer", "DocBank_500K_txt")
image_path <- system.file(package = "docformer", "DocBank_500K_ori_img")
docbanks_tt <- create_features_from_docbank(text_path, image_path, tokenizer = sent_tok)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.