knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(megametadata)

# here::here("R") %>% {glue::glue("{.}/{list.files(.)}")} %>% purrr::map(source)
library(yaml)
library(dplyr)
library(purrr)

Generating the initial Typing

When generating the specification of a new dataset, there are a few first steps.

The dataset needs a name and a description. This allows users to indentify a dataset consistently even as it requires different names on different platforms. It also allows users to understand what the underlying data is recording and its purpose, armed with which they will be able to more effectively make use of it.

The default generator also captures the column names and two measures of typing: - The first is of the type of data in the real world sense, what role does it play. Example being either a Category, or a value - The second is the internal R type of that variable, this can help to capture the state of the data import, and give a picture of what kinds of data are in your set.

# system.file("extdata", "template.yml", package = "megametadata") %>%
#        read_yaml() -> tester

tester <- iris %>%
  metaDictionary(list(), levelgen = "AccountLevel")
tester[1]
tester[[2]][1]

Once you have the initial specification, its worth looking into the assumptions made by the system. While the metrics used to determine the data category and class are made as robust as we can, there are of course limitiations to automated choices.

If you are happy with the categorisation, it is likely that the next thing you want to have available are some metrics for each variable. These metrics are likely to change depending on the data category and type, indeed some metrics are not sensible on some variable types.

You can use a specification file (This is easiest recorded as a YAML file but can be any list of the appropriate structure), to define the metrics for each R type in your data. megametadata will then update your dictionary with the all the metrics in your specification file.

system.file("extdata", "dict_spec.yml", package = "megametadata") %>%
        read_yaml() -> spec


test_data <- iris %>% as_tibble()


new_dict <- metaUpdateDictWithSpec(tester, spec, test_data, "AccountLevel")
new_dict[1]
new_dict[[2]][1]

The new dictionary will now have been expanded to include the metrics from your specification file.

An extra feature of the dictionary structure is a function to determine the Nominal and Continuous varaibles in the data you are working with. This may be valuable in designing pre-processing steps or just in understanding how you may be able to break down and visualise or predict your data.

metaSplitNominal(new_dict, "AccountLevel")


JDOsborne1/megametadata documentation built on April 9, 2020, 10:47 a.m.