dai_async_tab | R Documentation |
Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1beta2 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional information, including table-related data.
dai_async_tab(
files,
filetype = "pdf",
dest_folder = NULL,
bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
proj_id = get_project_id(),
loc = "eu",
token = dai_token(),
pps = 100
)
files |
A vector or list of pdf filepaths in a GCS Storage bucket. Filepaths must include all parent bucket folder(s) except the bucket name. |
filetype |
Either "pdf", "gif", or "tiff". If |
dest_folder |
The name of the bucket subfolder where you want the JSON output. |
bucket |
The name of the GCS Storage bucket. Not necessary if
you have set a default bucket as a .Renviron variable named
|
proj_id |
a GCS project id |
loc |
a two-letter region code ("eu" or "us") |
token |
an access token generated by |
pps |
an integer from 1 to 100 for the desired number of pages per shard in the JSON output |
This function accesses a different API endpoint than the main
dai_async()
function, one that has less language support, but
returns table data in addition to parsed text (which dai_async()
currently does not). This function may be deprecated if/when the v1
API endpoint incorporates table extraction. Use of this service
requires a GCS access token and some configuration of the .Renviron file;
see vignettes for details. Note that this API endpoint does not require
a Document AI processor id. Maximum pdf document length is 2,000 pages,
and the maximum number of pages in active processing is 10,000. Also note
that this function does not provide 'true' batch processing; instead it
successively submits single requests at 10-second intervals.
A list of HTTP responses
## Not run:
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async_tab("my_document.pdf")
# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async_tab("for_processing/pdfs/my_document.pdf")
# Bulk process by passing a vector of filepaths in the files argument:
dai_async_tab(my_files)
# Specify a bucket subfolder for the json output:
dai_async_tab(my_files, dest_folder = "processed")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.