jst_import: Wrapper for file import
In tklebel/jstor: Read Data from JSTOR/DfR

jst_import

R Documentation

Wrapper for file import

Description

This function applies an import function to a list of xml-files or a .zip-archive in case of jst_import_zip and saves the output in batches of .csv-files to disk.

Usage

jst_import(
  in_paths,
  out_file,
  out_path = NULL,
  .f,
  col_names = TRUE,
  n_batches = NULL,
  files_per_batch = NULL,
  show_progress = TRUE
)

jst_import_zip(
  zip_archive,
  import_spec,
  out_file,
  out_path = NULL,
  col_names = TRUE,
  n_batches = NULL,
  files_per_batch = NULL,
  show_progress = TRUE,
  rows = NULL
)

Arguments

`in_paths`	A character vector to the `xml`-files which should be imported
`out_file`	Name of files to export to. Each batch gets appended by an increasing number.
`out_path`	Path to export files to (combined with filename).
`.f`	Function to use for import. Can be one of `jst_get_article`, `jst_get_authors`, `jst_get_references`, `jst_get_footnotes`, `jst_get_book` or `jst_get_chapter`.
`col_names`	Should column names be written to file? Defaults to `TRUE`.
`n_batches`	Number of batches, defaults to 1.
`files_per_batch`	Number of files for each batch. Can be used instead of n_batches, but not in conjunction.
`show_progress`	Displays a progress bar for each batch, if the session is interactive.
`zip_archive`	A path to a .zip-archive from DfR
`import_spec`	A specification from jst_define_import for which parts of a .zip-archive should be imported via which functions.
`rows`	Mainly used for testing, to decrease the number of files which are imported (i.e. 1:100).

Details

Along the way, we wrap three functions, which make the process of converting many files easier:

purrr::safely()
furrr::future_map()
readr::write_csv()

When using one of the ⁠find_*⁠ functions, there should usually be no errors. To avoid the whole computation to fail in the unlikely event that an error occurs, we use safely() which let's us continue the process, and catch the error along the way.

If you have many files to import, you might benefit from executing the function in parallel. We use futures for this to give you maximum flexibility. By default the code is executed sequentially. If you want to run it in parallel, simply call future::plan() with future::multisession() as an argument before running jst_import or jst_import_zip.

After importing all files, they are written to disk with readr::write_csv().

Since you might run out of memory when importing a large quantity of files, you can split up the files to import into batches. Each batch is being treated separately, therefore for each batch multiple processes from future::multisession() are spawned, if you added this plan. For this reason, it is not recommended to have very small batches, as there is an overhead for starting and ending the processes. On the other hand, the batches should not be too large, to not exceed memory limitations. A value of 10000 to 20000 for files_per_batch should work fine on most machines. If the session is interactive and show_progress is TRUE, a progress bar is displayed for each batch.

Value

Writes .csv-files to disk.

Examples

## Not run: 
# read from file list --------
# find all files
meta_files <- list.files(pattern = "xml", full.names = TRUE)

# import them via `jst_get_article`
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
           files_per_batch = 25000)
           
# do the same, but in parallel
library(future)
plan(multiprocess)
jst_import(meta_files, out_file = "imported_metadata", .f = jst_get_article,
           files_per_batch = 25000)

# read from zip archive ------ 
# define imports
imports <- jst_define_import(article = c(jst_get_article, jst_get_authors))

# convert the files to .csv
jst_import_zip("my_archive.zip", out_file = "my_out_file", 
                 import_spec = imports)

## End(Not run)

tklebel/jstor documentation built on July 20, 2024, 11:07 p.m.