dai_async_tab: OCR asynchronously and get table data
In daiR: Interface with Google Cloud Document AI API

dai_async_tab

R Documentation

OCR asynchronously and get table data

Description

Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1beta2 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional information, including table-related data.

Usage

dai_async_tab(
  files,
  filetype = "pdf",
  dest_folder = NULL,
  bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
  proj_id = get_project_id(),
  loc = "eu",
  token = dai_token(),
  pps = 100
)

Arguments

`files`	A vector or list of pdf filepaths in a GCS Storage bucket. Filepaths must include all parent bucket folder(s) except the bucket name.
`filetype`	Either "pdf", "gif", or "tiff". If `files` is a vector, all elements must be of the same type.
`dest_folder`	The name of the bucket subfolder where you want the JSON output.
`bucket`	The name of the GCS Storage bucket. Not necessary if you have set a default bucket as a .Renviron variable named `GCS_DEFAULT_BUCKET` as described in the package vignette
`proj_id`	a GCS project id
`loc`	a two-letter region code ("eu" or "us")
`token`	an access token generated by `dai_auth()` or another auth function.
`pps`	an integer from 1 to 100 for the desired number of pages per shard in the JSON output

Details

This function accesses a different API endpoint than the main dai_async() function, one that has less language support, but returns table data in addition to parsed text (which dai_async() currently does not). This function may be deprecated if/when the v1 API endpoint incorporates table extraction. Use of this service requires a GCS access token and some configuration of the .Renviron file; see vignettes for details. Note that this API endpoint does not require a Document AI processor id. Maximum pdf document length is 2,000 pages, and the maximum number of pages in active processing is 10,000. Also note that this function does not provide 'true' batch processing; instead it successively submits single requests at 10-second intervals.

Value

A list of HTTP responses

Examples

## Not run: 
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async_tab("my_document.pdf")

# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async_tab("for_processing/pdfs/my_document.pdf")

# Bulk process by passing a vector of filepaths in the files argument:
dai_async_tab(my_files)

# Specify a bucket subfolder for the json output:
dai_async_tab(my_files, dest_folder = "processed")

## End(Not run)

daiR documentation built on Sept. 8, 2023, 5:43 p.m.