View source: R/check_repository.R
check_repository | R Documentation |
Performs a series of checks to see if data in a given repository can be ingested by a datacommons project.
check_repository(dir = ".", search_pattern = "\\.csv(?:\\.[gbx]z2?)?$",
exclude = NULL, value = "value", value_name = "measure",
id = "geoid", time = "year", dataset = "region_type",
entity_info = c("region_type", "region_name"), check_values = TRUE,
attempt_repair = FALSE, write_infos = FALSE, verbose = TRUE)
dir |
Root directory of the data repository. |
search_pattern |
Regular expression used to search for data files. |
exclude |
Subdirectories to exclude from the file search. |
value |
Name of the column containing variable values. |
value_name |
Name of the column containing variable names. |
id |
Column name of IDs that uniquely identify entities. |
time |
Column name of the variable representing time. |
dataset |
Column name used to separate data into sets (such as by region), or a vector
of datasets, with |
entity_info |
A vector of variable names to go into making |
check_values |
Logical; if |
attempt_repair |
Logical; if |
write_infos |
Logical; if |
verbose |
Logical; If |
An invisible list of check results, in the form of paths to files and/or measure name. These may include general entries:
info
(always): All measurement information (measure_info.json
) files found.
data
(always): All data files found.
not_considered
: Subset of data files that do not contain the minimal
columns (id
and value
), and so are not checked further.
summary
(always): Summary of results.
or those relating to issues with measure information (see data_measure_info
) files:
info_malformed
: Files that are not in the expected format (a single object with
named entries for each measure), but can be converted automatically.
info_incomplete
: Measure entries that are missing some of the required fields.
info_invalid
: Files that could not be read in (probably because they do not contain valid JSON).
info_refs_names
: Files with a _references
entry with no names
(where it should be a named list).
info_refs_missing
: Files with an entry in its _references
entry that
is missing one or more required entries (author
, year
, and/or title
).
info_refs_*
: Files with an entry in its _references
entry that has an entry
(*
) that is a list (where they should all be strings).
info_refs_author_entry
: Files with an entry in its _references
entry that has an
author
entry that is missing a family
entry.
info_source_missing
: Measures with an entry in its source
entry that is missing a
required entry (name
and/or date_accessed
).
info_source_*
: Measures with an entry (*
) in its source
entry that is a
list (where they should all be strings).
info_citation
: Measures with a citation
entry that cannot be found in any
_references
entries across measure info files within the repository.
info_layer_source
: Measures with an entry in its layer
entry that is missing a
source
entry.
info_layer_source_url
: Measures with an entry in its layer
entry that has a list
source
entry that is missing a url
entry. source
entries can either be a string containing a
URL, or a list with a url
entry.
info_layer_filter
: Measures with an entry in its layer
entry that has a filter
entry that is missing required entries (feature
, operator
, and/or value
).
or relating to data files with warnings:
warn_compressed
: Files that do not have compression extensions
(.gz
, .bz2
, or .xz
).
warn_blank_colnames
: Files with blank column names (often due to saving files with row names).
warn_value_nas
: Files that have NA
s in their value
columns; NA
s here
are redundant with the tall format, and so, should be removed.
warn_double_ints
: Variable names that have an int
type, but with values that have
remainders.
warn_small_percents
: Variable names that have a percent
type, but that are all under
1
(which are assumed to be raw proportions).
warn_small_values
: Variable names with many values (over 40%) that are under .00001
, and
no values under 0
or over 1
. These values should be scaled in some way to be displayed reliably.
warn_value_name_nas
: Files that have NA
s in their name
column.
warn_entity_info_nas
: Files that have NA
s in any of their entity_info
columns.
warn_dataset_nas
: Files that have NA
s in their dataset
column.
warn_time_nas
: Files that have NA
s in their time
column.
warn_id_nas
: Files that have NA
s in their id
column.
warn_scientific
: Files with IDs that appear to have scientific notation (e.g., 1e+5
);
likely introduced when the ID column was converted from numbers to characters – IDs should always be saved as
characters.
warn_bg_agg
: Files with IDs that appear to be census block group GEOIDs,
that do not include their tract parents (i.e., IDs consisting of 12 digits, and there are no IDs consisting of
their first 11 digits). These are automatically aggregated by site_build
, but they should
be manually aggregated.
warn_tr_agg
: Files with IDs that appear to be census tract GEOIDs,
that do not include their county parents (i.e., IDs consisting of 11 digits, and there are no IDs consisting of
their first 5 digits). These are automatically aggregated by site_build
, but they should
be manually aggregated.
warn_missing_info
: Measures in files that do not have a corresponding measure_info.json
entry. Sometimes this happens because the entry includes a prefix that cannot be derived from the file name
(which is the part after a year, such as category
from set_geo_2015_category.csv.xz
).
It is recommended that entries not include prefixes, and that measure names be specific
(e.g., category_count
rather than count
with a category:count
entry).
or relating to data files with failures:
fail_read
: Files that could not be read in.
fail_rows
: Files containing no rows.
fail_time
: Files with no time
column.
fail_idlen_county
: Files with "county" dataset
s with corresponding IDs
that are not consistently 5 characters long.
fail_idlen_tract
: Files with "tract" dataset
s with corresponding IDs
that are not consistently 11 characters long.
fail_idlen_block_group
: Files with "block group" dataset
s with corresponding IDs
that are not consistently 12 characters long.
## Not run:
# from a data repository
check_repository()
# to automatically fix most warnings
check_repository(attempt_repair = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.