knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
A data commons project is a means of handing data that is distributed across multiple repositories. This article covers components of these projects, and describes how to build and maintain them.
See the Community Wiki for more technical detail.
The Social Data Commons project is a working example, where its build.R script lists the steps to build/update it.
The init_datacommons function can be used to start a data commons project.
library(community) dir <- tempdir() init_datacommons(dir)
init_datacommons
also builds a monitor site, so you can rerun it after updating a project for an
updated monitor site (like the Social Data Commons Monitor).
After initialization, creating or updating a data commons project involves 3 primary steps:
The most basic component of a data commons project is the repository list, which points to the data repositories that make up the data commons.
You can specify these repositories through init_datacommons
, or add them to either the
commons.json
or scripts/repos.txt
list.
The listed repositories are then kept in the created repos
subdirectory, as managed by the
datacommons_refresh function.
For this example, we can add a single repository:
init_datacommons(dir, repos = "uva-bi-sdad/sdc.education") datacommons_refresh(dir, verbose = FALSE)
The most basic requirement of a data repository is that it contain a data file in a tall format, as
initially handled by data_reformat_sdad --
these should at least have columns containing IDs (default is geoid
) and values (default is value
).
The datacommons_map_files function searches for files in each repository to make an index, which is used to create the views.
datacommons_map_files
also looks for measure info files (such as those created by data_measure_info),
which are collected and saved to cache/measure_info.json
for use by the monitor site.
Once files are indexed, you can use the datacommons_find_variables function to search for variables within them:
datacommons_find_variables("2year", dir)[[1]][, c(1, 6)]
So far, the data commons project has collected and indexed existing repositories, but the goal of these projects is to build unified datasets from these repositories. This is done with views.
Views are essentially lists of variables and IDs, which form a subset of the broader data commons.
Views can be specified and run with the datacommons_view function.
The product of a view is a set of unified data files (containing the requested variables and IDs, if found),
and a collected measures info file containing information about any of the included variables found.
These need to be directed to an output
directory:
output <- paste0(dir, "/view1") datacommons_view( commons = dir, name = "view1", output = output, variables = "schools_2year_all", ids = "51059", verbose = FALSE )
Now, we can see what was added to the output directory:
list.files(output) cat(readLines(paste0(output, "/manifest.json")), sep = "\n") read.csv(paste0(output, "/dataset.csv.xz"))
Usually the output would be a data site project, with a build script that documents
the new datasets and rebuilds the site (such as community_example/build.R).
Such a script can be set as the view's run_after
, so after the datasets get rebuilt,
the site is also rebuilt.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.