dai_async_tab: OCR asynchronously and get table data

View source: R/send_to_dai.R

dai_async_tabR Documentation

OCR asynchronously and get table data

Description

Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1beta2 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional information, including table-related data.

Usage

dai_async_tab(
  files,
  filetype = "pdf",
  dest_folder = NULL,
  bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
  proj_id = get_project_id(),
  loc = "eu",
  token = dai_token(),
  pps = 100
)

Arguments

files

A vector or list of pdf filepaths in a GCS Storage bucket. Filepaths must include all parent bucket folder(s) except the bucket name.

filetype

Either "pdf", "gif", or "tiff". If files is a vector, all elements must be of the same type.

dest_folder

The name of the bucket subfolder where you want the JSON output.

bucket

The name of the GCS Storage bucket. Not necessary if you have set a default bucket as a .Renviron variable named GCS_DEFAULT_BUCKET as described in the package vignette

proj_id

a GCS project id

loc

a two-letter region code ("eu" or "us")

token

an access token generated by dai_auth() or another auth function.

pps

an integer from 1 to 100 for the desired number of pages per shard in the JSON output

Details

This function accesses a different API endpoint than the main dai_async() function, one that has less language support, but returns table data in addition to parsed text (which dai_async() currently does not). This function may be deprecated if/when the v1 API endpoint incorporates table extraction. Use of this service requires a GCS access token and some configuration of the .Renviron file; see vignettes for details. Note that this API endpoint does not require a Document AI processor id. Maximum pdf document length is 2,000 pages, and the maximum number of pages in active processing is 10,000. Also note that this function does not provide 'true' batch processing; instead it successively submits single requests at 10-second intervals.

Value

A list of HTTP responses

Examples

## Not run: 
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async_tab("my_document.pdf")

# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async_tab("for_processing/pdfs/my_document.pdf")

# Bulk process by passing a vector of filepaths in the files argument:
dai_async_tab(my_files)

# Specify a bucket subfolder for the json output:
dai_async_tab(my_files, dest_folder = "processed")

## End(Not run)

daiR documentation built on Sept. 8, 2023, 5:43 p.m.