Description Usage Arguments Value Examples
The function submits the data set to the analyze API endpoint and returns the proposed structure of the data. At the current stage data set must be rectangular, either vertical or horizontal.
API recognizes the multi-dimension and multi-line headers. The first inits
rows/columns are
collapsed using sep
character. Make sure that the separator doesn't appear in the header values.
It is possible to separate dimensions in single-line headers (see examples below).
The classification algorithms are applied to the values in the rows/columns if they are not empty, and
to the headers if the rows/columns are empty. For efficiency reasons only a sample of values in each column is analyzed.
To improve the classification accuracy, you can ask the API to draw a larger sample by setting fast=FALSE
.
For empty columns the API returns the proposed resources that appear to fit well in the empty spaces given the header
information and overall structure of df
.
The basic properties are characterized by classes. Currently, the API distinguishes between 6 classes: general
,
geography
, unit
, time
, sector
, number
. They describe the basic characteristics of the
values, and they are further used to propose the best transformations and matching methods for data reconciliation. Classes
are further supported by features, which determine the characteristics in greater detail, such as class geography
may be further supported by feature country
.
Cleaner contains the basic set of transformations applied to each value in a dimension to retrieve machine-readable
representation. For instance, values y1999
, y2000
, ..., clearly correspond to years, however, they will
be processed much faster if stripped from the initial y
character, such as ^y
. Cleaner allows basic regular expressions.
Finally, format corresponds to the format of the values, and it is particularly useful for time-series operations. Format allows
the standard date formats (see ?as.Date
).
The classification algorithm can be substantially improved by allowing it to recall how
it was used in the past and how well it performed. Parameter learn
controls this feature, however, by default it
is disabled. The information stored by rejustify is tailored to each user individually and it can substantially
increase the usability of the API. For instance, the proposed provider
for empty row/column with header 'gross domestic product'
is IMF
. Selecting another provider, for instance AMECO
, will teach the algorithm that for this combination
of headers and rows/columns AMECO
is the preferred provider
, such that the next time API is called, there will be
higher chance of AMECO
to be picked by default. To enable learning option in all API calls by default, run
setCurl(learn=TRUE)
.
If learn=TRUE
, the information stored by rejustify include (i) the information changed by the user with respect
to assigned class
, feature
, cleaner
and format
, (ii) resources determined by provider
,
table
and headers of df
, (iii) hand-picked matching values for value-selection
. The information will
be stored only upon a change of values within groups (i-iii).
1 2 3 4 5 6 7 8 9 10 11 |
df |
The data set to be analyzed. Must be matrix-convertible. If data frame,
the dimension names will be taken as the row/column names. If matrix, the row/column
names will be ignored, and the header will be set from matrix values in line with |
shape |
It informs the API whether the data set should be read by
columns ( |
inits |
It informs the API how many initial rows (or columns in
horizontal data), correspond to the header description. The default
is |
fast |
Informs the API on how big a sample draw of original data should be.
The larger the sample, the more precise but overall slower the algorithm.
Under the |
sep |
The header can also be described by single field values,
separated by a given character separator, for instance 'GDP, Austria, 1999'.
The option informs the API which separator should be used to split the
initial header string into corresponding dimensions. The default is |
learn |
It should be set as |
token |
API token. By default read from global variables. |
email |
E-mail address for the account. By default read from global variables. |
url |
API url. By default read from global variables. |
structure of the df
data set
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #API setup
setCurl()
#register token/email
register(token = "YOUR_TOKEN", email = "YOUR_EMAIL")
#sample data set
df <- data.frame(year = c("2009", "2010", "2011"),
country = c("Poland", "Poland", "Poland"),
`gross domestic product` = NA,
check.names = FALSE, stringsAsFactors = FALSE)
analyze(df)
#data set with one-line multi-dimension header (semi-colon separated)
df <- data.frame(country = c("Poland", "Poland", "Poland"),
`gross domestic product;2009` = NA,
`gross domestic product;2010` = NA,
check.names = FALSE, stringsAsFactors = FALSE)
analyze(df, sep = ";")
#data set with multi-line header
df <- cbind(c(NA, "country", "Poland", "Poland", "Poland"),
c("gross domestic product", "2009", NA, NA, NA),
c("gross domestic product", "2010", NA, NA, NA))
analyze(df, inits = 2)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.