r-package: Support for Rejustify API

Description Usage Arguments Value Examples

View source: R/fill.R

The function submits the request to the API fill endpoint to retrieve the desired extra data points. At the current stage dataset must be rectangular, and structure should be in the shape proposed analyze function. The minimum required by the endpoint is the data set and the corresponding structure. You can browse the available resources at https://rejustify.com/repos). Other features, including private resources and models, are taken as defined for the account.

The API defines the submitted data set as x and any server-side data set as y. The corresponding structures are marked with the same principle, as structure.x and structure.y, for instance. The principle rule of any data manipulation is to never change data x (except for missing values), but only adjust y.

fill(
  df,
  structure,
  keys = NULL,
  default = NULL,
  shape = "vertical",
  inits = 1,
  sep = ",",
  learn = getOption("rejustify.learn"),
  accu = 0.75,
  form = "full",
  token = getOption("rejustify.token"),
  email = getOption("rejustify.email"),
  url = getOption("rejustify.mainUrl")
)

`df`	The data set to be analyzed. Must be matrix-convertible. If data frame, the dimension names will be taken as the row/column names. If matrix, the row/column names will be ignored, and the header will be set from matrix values in line with `inits` and `sep` specification.
`structure`	Structure of the `x` data set, characterizing classes, features, cleaners and formats of the columns/rows, and data provider/tables for empty columns. Perfectly, it should come from `analyze` endpoint.
`keys`	The matching keys and matching methods between dimensions in `x` and y data sets. The elements in `keys` are determined based on information provided in data `x` and `y`, for each empty column. The details behind both data structures can be visualized by `structure.x` and `structure.y`. Matching keys are given consecutively, i.e. the first element in `id.x` and `name.x` corresponds to the first element in `id.y` and `name.y`, and so on. Dimension names are given for better readability of the results, however, they are not necessary for API recognition. `keys` return also data classification in element `class` and the proposed matching method for each part of `id.x` and `id.y`. Currently, API suports 6 matching methods: `synonym-proximity-matching`, `synonym-matching`, `proximity-matching`, `time-matching`, `exact-matching` and `value-selection`, which are given in a diminishing order of complexitiy. `synonym-proximity-matching` uses the proximity between the values in data `x` and `y` to the coresponding values in rejustify dictionary. If the proximity is above threshold `accu` and there are values in `x` and `y` pointing to the same element in the dictionary, the values will be matched. `synonym-matching` and `proximity-matching` use similar logic of either of the steps described for `synonym-proximity-matching`. `time-matching` aims at standardizing the time values to the same format before matching. For proper functioning it requires an accurate characterization of date format in `structure.x` (`structure.y` is already classified by rejustify). `exact-matching` will match two values only if they are identical. `value-selection` is a quasi matching method which for single-valued dimension `x` will return single value from `y`, as suggested by `default` specification. It is the most efficient matching method for dimensions which do not show any variability.
`default`	Default values used to lock dimensions in data `y` which will be not used for matching against data `x`. Each empty column to be filled, characterized by `default$column.id.x`, must contain description of the default values. If missing, the API will propose the default values in line with the history of how it was used in the past.
`shape`	It informs the API whether the data set should be read by columns (`vertical`) or by rows (`horizontal`). The default is `vertical`.
`inits`	It informs the API how many initial rows (or columns in horizontal data), correspond to the header description. The default is `inits=1`.
`sep`	The header can also be described by single field values, separated by a given character separator, for instance 'GDP, Austria, 1999'. The option informs the API which separator should be used to split the initial header string into corresponding dimensions. The default is `sep=','`.
`learn`	It is `TRUE` if the user accepts rejustify to track her/his activity to enhance the performance of the AI algorithms (it is not enabled by default). To change this option for all API calls run `setCurl(learn=TRUE)`.
`accu`	Acceptable accuracy level on a scale from 0 to 1. It is used in the matching algorithms to determine string similarity. The default is `accu=0.75`.
`form`	Requests the data to be returned either in `full`, or `partial` shape. The former returns the original data with filled empty columns. The latter returns only the filled columns.
`token`	API token. By default read from global variables.
`email`	E-mail address for the account. By default read from global variables.
`url`	API url. By default read from global variables.

list consisting of 5 elements: data, structure.x, structure.y, keys and default

#API setup
setCurl()

#register token/email
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")

#sample data set
df <- data.frame(year = c("2009", "2010", "2011"),
                 country = c("Poland", "Poland", "Poland"),
                 `gross domestic product` = NA,
                 check.names = FALSE, stringsAsFactors = FALSE)

#endpoint analyze
st <- analyze(df)

#endpoint fill
df1 <- fill(df, st)