cas_extract | R Documentation |
Extract fields and contents from downloaded files
cas_extract(
extractors,
post_processing = NULL,
id = NULL,
ignore_id = TRUE,
custom_path = NULL,
index = FALSE,
store_as_character = TRUE,
check_previous = TRUE,
db_connection = NULL,
file_format = "html",
sample = FALSE,
write_to_db = FALSE,
keep_if_status = 200,
encoding = "UTF-8",
readability = FALSE,
...
)
extractors |
A named list of functions. See examples for details. |
post_processing |
Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns. |
id |
Defaults to NULL, identifiers to process when extracting. If given,
must be a numeric vector, logically corresponding to the identifiers in the
|
ignore_id |
Defaults to TRUE. If TRUE, it checks if identifiers have
been added to the local ignore list, typically with |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
store_as_character |
Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types. |
check_previous |
Logical, defaults to TRUE. If FALSE, no check will be
conducted to verify if the same content had been previously extracted. If
FALSE, |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
keep_if_status |
Defaults to 200. Keep only if recorded download status matches the given status. |
... |
Passed to |
## Not run:
if (interactive) {
### Post-processing example ####
# For example, in order to add a column called `internal_id`
# that takes the ending digits of the url (assuming the url ends with digits)
# a function such as the following would be passed to cas_extract
pp <- function(df) {
df |>
dplyr::mutate(internal_id = stringr::str_extract(url, "[[:digit:]]+$"))
}
}
cas_extract(
extractors = extractors_l, # assuming it has already been set
post_processing = pp
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.