library(manydata)
The first thing users of the package will want to do is to identify
datasets that might contribute to their research goals.
Since some of these data packages are too big for CRAN,
we expect that their developers will instead choose
to make their packages available on GitHub.
To make it easier to identify all packages in the
many packages universe, we have developed the get_packages()
function.
The function lists the many packages available and allow users to download them.
get_packages()
Packages in the many packages universe have the advantage to facilitate comparison and analysis of multiple datasets in a specific domain of global governance. This is possible with a particular coding system which follows the same principles across the different packages.
In {manystates}
for example, all datasets from the states database contain
variables named Beg
and End
which represent the beginning and ending date
of an episode of state sovereignty.
In {manyenviron}
, the agreements database also have the Beg
and End
variables
but those are attributed to treaties (signature and term dates).
For the memberships database, Beg
and End
represent when a relationship
between states and an agreement starts (either signature, ratification or entry into force)
and ends (either withdrawal or term).
This specific variable name allows the comparison across the datasets which have different sources but same informations. It enables to point out the recurrence, difference or absence of observations between the datasets and extract more robust data when researching on a particular governance domain.
Let us say that we wish to download the {manystates}
package,
which offers a set of datasets related to state actors in global governance.
We can download and install the latest release version of
the {manystates}
package using the same function as before, only
specifying which package we want to 'get', 'get_packages("manystates")'.
For now, let's work with the Roman Emperors database included in manydata. We can get a quick summary of the datasets included in this package with the following command:
data(package = "manydata") data(emperors, package = "manydata") emperors
We can see that there are three named datasets relating to emperors here:
wikipedia
(dataset assembled from Wikipedia pages), UNVR
(United Nations of Roman Vitrix),
and britannica
(Britannica Encyclopedia List of Roman Emperors).
Each of these datasets has their advantages and so we may wish
to understand their differences, summarise
variables across them, and perhaps also rerun models across them.
To retrieve an individual dataset from this database,
we can use the pluck()
function.
wikipedia <- pluck(emperors, "wikipedia")
However, the real value of the various 'many packages' is that multiple datasets relating to the same phenomenon are presented together.
First of all, we want to understand what the differences between the datasets in a database. One important way to understand the relationship between these datasets is to understand what their relative advantages and disadvantages are. For example, one dataset may be long (has many observations) while another is shorter but wider (has more variables). One might include details further back in history while the other is more recent, but include more missing data or less precise data (i.e. coded at a less granular level) than another with a more restrictive. Or one might appear complete yet offer less information on where the original data points were sourced or how certain variables were coded, while another provides an extensive and transparent codebook that facilitates replication.
data_source()
and data_contrast()
We can bring up the database level documentation using: ?emperors
.
This informs users on the datasets present in the database
as well as the variables in the various datasets.
Though, if we want a more detailed summary of the various levels of data
and sources, we can use data_source()
and data_contrast()
.
The data_source()
function displays bibliographic references for
the datasets within a database.
data_source(pkg = "manydata", database = NULL, dataset = NULL)
The data_contrast()
function returns a data frame with the key metadata
of each level of data objects (many package, database, and dataset).
This metadata includes the following elements:
data_contrast(pkg = "manydata", database = NULL, dataset = NULL)
Next we may be interested in whether any relationships we are interested in or inferences we want to draw are sensitive to which data we use. That is, we are interested in the robustness of any results to different data specifications.
We can start by exploring whether our conclusion about when emperors began their reign
would differ depending on which dataset we use.
We can use the purrr::map()
function used above, but this time pass it the mean()
function
and tell it to operate on just the "Beg" variable, which represents
when emperors began their reign (removing any NAs).
Since manydata datasets are always ordered by "Beg" (and then "ID"),
we can remove any subsequent (duplicated) entries by ID to concentrate on first appearances.
library(dplyr) emperors %>% purrr::map(function(x){ x %>% dplyr::filter(!duplicated(ID)) %>% dplyr::summarise(mean(Beg, na.rm = TRUE)) })
Now that we have compared the data and looked at some of the different inferences drawn, let us examine how to select and consolidate databases.
The consolidate()
function facilitates consolidating a set of datasets, or a database,
from a 'many' package into a single dataset with some combination of the rows and columns.
The function includes separate arguments for rows and columns,
as well as for how to resolve conflicts in observations across datasets.
The key argument indicates the column to collapse datasets by.
This provides users with considerable flexibility in how they combine data.
For example, users may wish to see units and variables coded in "any" dataset (i.e. units or variables present in at least one of the datasets in the database) or units and variables coded in "every" dataset (i.e. units or variables present in all of the datasets in the database).
consolidate(database = emperors, rows = "any", cols = "any", resolve = "coalesce", key = "ID") consolidate(database = emperors, rows = "every", cols = "every", resolve = "coalesce", key = "ID")
Users can also choose how they want to resolve conflicts between observations in
consolidate()
with several 'resolve' methods:
consolidate(database = emperors, rows = "any", cols = "every", resolve = "max", key = "ID") consolidate(database = emperors, rows = "every", cols = "any", resolve = "min", key = "ID") consolidate(database = emperors, rows = "every", cols = "every", resolve = "mean", key = "ID") consolidate(database = emperors, rows = "any", cols = "any", resolve = "median", key = "ID") consolidate(database = emperors, rows = "every", cols = "every", resolve = "random", key = "ID")
Users can even specify how conflicts for different variables should be 'resolved':
consolidate(database = emperors, rows = "any", cols = "every", resolve = c(Beg = "min", End = "max"), key = "ID")
Alternatively, users can "favour" a dataset in a database over others:
consolidate(database = favour(emperors, "UNRV"), rows = "every", cols = "any", resolve = "coalesce", key = "ID")
Users can, even, declare multiple key ID columns to consolidate a database or multiple datasets:
consolidate(database = emperors, rows = "any", cols = "any", resolve = c(Death = "max", Cause = "coalesce"), key = c("ID", "Beg"))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.