spod_convert | R Documentation |
Converts data for faster analysis into either DuckDB
file or into parquet
files in a hive-style directory structure. Running analysis on these files is sometimes 100x times faster than working with raw CSV files, espetially when these are in gzip archives. To connect to converted data, please use 'mydata <- spod_connect(data_path = path_returned_by_spod_convert)' passing the path to where the data was saved. The connected mydata
can be analysed using dplyr
functions such as select, filter, mutate, group_by, summarise, etc. In the end of any sequence of commands you will need to add collect to execute the whole chain of data manipulations and load the results into memory in an R data.frame
/tibble
. For more in-depth usage of such data, please refer to DuckDB documentation and examples at https://duckdb.org/docs/api/r#dbplyr . Some more useful examples can be found here https://arrow-user2022.netlify.app/data-wrangling#combining-arrow-with-duckdb . You may also use arrow
package to work with parquet files https://arrow.apache.org/docs/r/.
spod_convert(
type = c("od", "origin-destination", "os", "overnight_stays", "nt", "number_of_trips"),
zones = c("districts", "dist", "distr", "distritos", "municipalities", "muni",
"municip", "municipios"),
dates = NULL,
save_format = "duckdb",
save_path = NULL,
overwrite = FALSE,
data_dir = spod_get_data_dir(),
quiet = FALSE,
max_mem_gb = max(4, spod_available_ram() - 4),
max_n_cpu = max(1, parallelly::availableCores() - 1),
max_download_size_gb = 1,
ignore_missing_dates = FALSE
)
type |
The type of data to download. Can be |
zones |
The zones for which to download the data. Can be |
dates |
A The possible values can be any of the following:
|
save_format |
A You can also set |
save_path |
A
|
overwrite |
A |
data_dir |
The directory where the data is stored. Defaults to the value returned by |
quiet |
A |
max_mem_gb |
The maximum memory to use in GB. A conservative default is 3 GB, which should be enough for resaving the data to |
max_n_cpu |
The maximum number of threads to use. Defaults to the number of available cores minus 1. |
max_download_size_gb |
The maximum download size in gigabytes. Defaults to 1. |
ignore_missing_dates |
Logical. If |
Path to saved DuckDB
database file or to a folder with parquet
files in hive-style directory structure.
# Set data dir for file downloads
spod_set_data_dir(tempdir())
# download and convert data
dates_1 <- c(start = "2020-02-17", end = "2020-02-18")
db_2 <- spod_convert(
type = "number_of_trips",
zones = "distr",
dates = dates_1,
overwrite = TRUE
)
# now connect to the converted data
my_od_data_2 <- spod_connect(db_2)
# disconnect from the database
spod_disconnect(my_od_data_2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.