fill: communicates with rejustify/fill API endpoint

Description Usage Arguments Value Examples

View source: R/fill.R

Description

The function submits the request to the API fill endpoint to retrieve the desired extra data points. At the current stage dataset must be rectangular, and structure should be in the shape proposed analyze function. The minimum required by the endpoint is the data set and the corresponding structure. You can browse the available resources at https://rejustify.com/repos). Other features, including private resources and models, are taken as defined for the account.

The API defines the submitted data set as x and any server-side data set as y. The corresponding structures are marked with the same principle, as structure.x and structure.y, for instance. The principle rule of any data manipulation is to never change data x (except for missing values), but only adjust y.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
fill(
  df,
  structure,
  keys = NULL,
  default = NULL,
  shape = "vertical",
  inits = 1,
  sep = ",",
  learn = getOption("rejustify.learn"),
  accu = 0.75,
  form = "full",
  token = getOption("rejustify.token"),
  email = getOption("rejustify.email"),
  url = getOption("rejustify.mainUrl")
)

Arguments

df

The data set to be analyzed. Must be matrix-convertible. If data frame, the dimension names will be taken as the row/column names. If matrix, the row/column names will be ignored, and the header will be set from matrix values in line with inits and sep specification.

structure

Structure of the x data set, characterizing classes, features, cleaners and formats of the columns/rows, and data provider/tables for empty columns. Perfectly, it should come from analyze endpoint.

keys

The matching keys and matching methods between dimensions in x and y data sets. The elements in keys are determined based on information provided in data x and y, for each empty column. The details behind both data structures can be visualized by structure.x and structure.y.

Matching keys are given consecutively, i.e. the first element in id.x and name.x corresponds to the first element in id.y and name.y, and so on. Dimension names are given for better readability of the results, however, they are not necessary for API recognition. keys return also data classification in element class and the proposed matching method for each part of id.x and id.y.

Currently, API suports 6 matching methods: synonym-proximity-matching, synonym-matching, proximity-matching, time-matching, exact-matching and value-selection, which are given in a diminishing order of complexitiy. synonym-proximity-matching uses the proximity between the values in data x and y to the coresponding values in rejustify dictionary. If the proximity is above threshold accu and there are values in x and y pointing to the same element in the dictionary, the values will be matched. synonym-matching and proximity-matching use similar logic of either of the steps described for synonym-proximity-matching. time-matching aims at standardizing the time values to the same format before matching. For proper functioning it requires an accurate characterization of date format in structure.x (structure.y is already classified by rejustify). exact-matching will match two values only if they are identical. value-selection is a quasi matching method which for single-valued dimension x will return single value from y, as suggested by default specification. It is the most efficient matching method for dimensions which do not show any variability.

default

Default values used to lock dimensions in data y which will be not used for matching against data x. Each empty column to be filled, characterized by default$column.id.x, must contain description of the default values. If missing, the API will propose the default values in line with the history of how it was used in the past.

shape

It informs the API whether the data set should be read by columns (vertical) or by rows (horizontal). The default is vertical.

inits

It informs the API how many initial rows (or columns in horizontal data), correspond to the header description. The default is inits=1.

sep

The header can also be described by single field values, separated by a given character separator, for instance 'GDP, Austria, 1999'. The option informs the API which separator should be used to split the initial header string into corresponding dimensions. The default is sep=','.

learn

It is TRUE if the user accepts rejustify to track her/his activity to enhance the performance of the AI algorithms (it is not enabled by default). To change this option for all API calls run setCurl(learn=TRUE).

accu

Acceptable accuracy level on a scale from 0 to 1. It is used in the matching algorithms to determine string similarity. The default is accu=0.75.

form

Requests the data to be returned either in full, or partial shape. The former returns the original data with filled empty columns. The latter returns only the filled columns.

token

API token. By default read from global variables.

email

E-mail address for the account. By default read from global variables.

url

API url. By default read from global variables.

Value

list consisting of 5 elements: data, structure.x, structure.y, keys and default

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#API setup
setCurl()

#register token/email
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")

#sample data set
df <- data.frame(year = c("2009", "2010", "2011"),
                 country = c("Poland", "Poland", "Poland"),
                 `gross domestic product` = NA,
                 check.names = FALSE, stringsAsFactors = FALSE)

#endpoint analyze
st <- analyze(df)

#endpoint fill
df1 <- fill(df, st)

rejustify/r-package documentation built on Nov. 7, 2021, 2:10 p.m.