r-package: Support for Rejustify API

Description Usage Arguments Value Examples

View source: R/analyze.R

The function submits the data set to the analyze API endpoint and returns the proposed structure of the data. At the current stage data set must be rectangular, either vertical or horizontal.

API recognizes the multi-dimension and multi-line headers. The first inits rows/columns are collapsed using sep character. Make sure that the separator doesn't appear in the header values. It is possible to separate dimensions in single-line headers (see examples below).

The classification algorithms are applied to the values in the rows/columns if they are not empty, and to the headers if the rows/columns are empty. For efficiency reasons only a sample of values in each column is analyzed. To improve the classification accuracy, you can ask the API to draw a larger sample by setting fast=FALSE. For empty columns the API returns the proposed resources that appear to fit well in the empty spaces given the header information and overall structure of df.

The basic properties are characterized by classes. Currently, the API distinguishes between 6 classes: general, geography, unit, time, sector, number. They describe the basic characteristics of the values, and they are further used to propose the best transformations and matching methods for data reconciliation. Classes are further supported by features, which determine the characteristics in greater detail, such as class geography may be further supported by feature country.

Cleaner contains the basic set of transformations applied to each value in a dimension to retrieve machine-readable representation. For instance, values y1999, y2000, ..., clearly correspond to years, however, they will be processed much faster if stripped from the initial y character, such as ^y. Cleaner allows basic regular expressions.

Finally, format corresponds to the format of the values, and it is particularly useful for time-series operations. Format allows the standard date formats (see ?as.Date).

The classification algorithm can be substantially improved by allowing it to recall how it was used in the past and how well it performed. Parameter learn controls this feature, however, by default it is disabled. The information stored by rejustify is tailored to each user individually and it can substantially increase the usability of the API. For instance, the proposed provider for empty row/column with header 'gross domestic product' is IMF. Selecting another provider, for instance AMECO, will teach the algorithm that for this combination of headers and rows/columns AMECO is the preferred provider, such that the next time API is called, there will be higher chance of AMECO to be picked by default. To enable learning option in all API calls by default, run setCurl(learn=TRUE).

If learn=TRUE, the information stored by rejustify include (i) the information changed by the user with respect to assigned class, feature, cleaner and format, (ii) resources determined by provider, table and headers of df, (iii) hand-picked matching values for value-selection. The information will be stored only upon a change of values within groups (i-iii).

analyze(
  df,
  shape = "vertical",
  inits = 1,
  fast = TRUE,
  sep = ",",
  learn = getOption("rejustify.learn"),
  token = getOption("rejustify.token"),
  email = getOption("rejustify.email"),
  url = getOption("rejustify.mainUrl")
)

`df`	The data set to be analyzed. Must be matrix-convertible. If data frame, the dimension names will be taken as the row/column names. If matrix, the row/column names will be ignored, and the header will be set from matrix values in line with `inits` and `sep` specification.
`shape`	It informs the API whether the data set should be read by columns (`vertical`) or by rows (`horizontal`). The default is `vertical`.
`inits`	It informs the API how many initial rows (or columns in horizontal data), correspond to the header description. The default is `inits=1`.
`fast`	Informs the API on how big a sample draw of original data should be. The larger the sample, the more precise but overall slower the algorithm. Under the `fast = TRUE` the API samples 5 `fast = FALSE` option it is 25%. Default is `fast=TRUE`.
`sep`	The header can also be described by single field values, separated by a given character separator, for instance 'GDP, Austria, 1999'. The option informs the API which separator should be used to split the initial header string into corresponding dimensions. The default is `sep=','`.
`learn`	It should be set as `TRUE` if the user accepts rejustify to track her/his activity to enhance the performance of the AI algorithms (it is not enabled by default). To change this option for all API calls run `setCurl(learn=TRUE)`.
`token`	API token. By default read from global variables.
`email`	E-mail address for the account. By default read from global variables.
`url`	API url. By default read from global variables.

structure of the df data set

#API setup
setCurl()

#register token/email
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")

#sample data set
df <- data.frame(year = c("2009", "2010", "2011"),
                 country = c("Poland", "Poland", "Poland"),
                 `gross domestic product` = NA,
                 check.names = FALSE, stringsAsFactors = FALSE)
analyze(df)

#data set with one-line multi-dimension header (semi-colon separated)
df <- data.frame(country = c("Poland", "Poland", "Poland"),
                 `gross domestic product;2009` = NA,
                 `gross domestic product;2010` = NA,
                  check.names = FALSE, stringsAsFactors = FALSE)
analyze(df, sep = ";")

#data set with multi-line header
df <- cbind(c(NA, "country", "Poland", "Poland", "Poland"),
            c("gross domestic product", "2009", NA, NA, NA),
            c("gross domestic product", "2010", NA, NA, NA))
analyze(df, inits = 2)