cas_get_files_to_extract: Get path to (locally available) files to be extracted
In giocomai/castarter: Content Analysis Starter Toolkit

View source: R/cas_get_files_to_extract.R

cas_get_files_to_extract

R Documentation

Get path to (locally available) files to be extracted

Description

Mostly used internally by cas_extract or for troubleshooting.

Usage

cas_get_files_to_extract(
  id = NULL,
  ignore_id = TRUE,
  custom_path = NULL,
  index = FALSE,
  store_as_character = TRUE,
  check_previous = TRUE,
  db_connection = NULL,
  file_format = "html",
  sample = FALSE,
  keep_if_status = 200,
  only_available = TRUE,
  ...
)

Arguments

`id`	Defaults to NULL, identifiers to process when extracting. If given, must be a numeric vector, logically corresponding to the identifiers in the `id` column, e.g. as returned by `cas_read_db_contents_id()`
`ignore_id`	Defaults to TRUE. If TRUE, it checks if identifiers have been added to the local ignore list, typically with `cas_ignore_id()`, and as retrieved with `cas_read_db_ignore_id()`. It can also be a numeric vector of identifiers: the given identifiers will not be processed. If FALSE, items will be processed normally.
`index`	Logical, defaults to FALSE. If TRUE, downloaded files will be considered `index` files. If not, they will be considered `contents` files. See Readme for a more extensive explanation.
`store_as_character`	Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types.
`check_previous`	Logical, defaults to TRUE. If FALSE, no check will be conducted to verify if the same content had been previously extracted. If FALSE, `write_to_db` must be set (or will be set) to FALSE, to prevent duplication of data.
`file_format`	Defaults to `html`. Used for storing files in dedicated folders, but also for determining processing options. For example, if a sitemap is downloaded as an index with `file_format` set to xml, it will be processed accordingly. If it is stored as xml.gz, it will be automatically decompressed for correct processing.
`sample`	Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded.
`keep_if_status`	Defaults to 200. Keep only if recorded download status matches the given status.
`only_available`	Defaults to TRUE. If TRUE, returns only files available locally. If FALSE, returns also path to files that according to logging data have already been downloaded, yet are not available where expected.
`...`	Passed to `cas_get_db_file()`.

Examples

#'
## Not run: 
if (interactive) {
  cas_get_files_to_extract()
}

## End(Not run)

giocomai/castarter documentation built on June 12, 2025, 8:49 p.m.

giocomai/castarter index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

giocomai/castarter
Content Analysis Starter Toolkit

cas_get_files_to_extract: Get path to (locally available) files to be extracted
In giocomai/castarter: Content Analysis Starter Toolkit

Get path to (locally available) files to be extracted

Description

Usage

Arguments

Examples

Related to cas_get_files_to_extract in giocomai/castarter...

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter Content Analysis Starter Toolkit

cas_get_files_to_extract: Get path to (locally available) files to be extracted In giocomai/castarter: Content Analysis Starter Toolkit

Get path to (locally available) files to be extracted

Description

Usage

Arguments

Examples

Related to cas_get_files_to_extract in giocomai/castarter...

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter
Content Analysis Starter Toolkit

cas_get_files_to_extract: Get path to (locally available) files to be extracted
In giocomai/castarter: Content Analysis Starter Toolkit