data_matching: Matching products

View source: R/f_data_processing.R

data_matchingR Documentation

Matching products

Description

This function returns a data set defined in the first parameter (data) with an additional column (prodID). Two products are treated as being matched if they have the same prodID value.

Usage

data_matching(
  data,
  start,
  end,
  interval = FALSE,
  variables = c(),
  codeIN = TRUE,
  codeOUT = TRUE,
  description = TRUE,
  onlydescription = FALSE,
  precision = 0.95
)

Arguments

data

The user's data frame with information about products to be matched. It must contain columns: time (as Date in format: year-month-day, e.g. '2020-12-01') and at least one of the following columns: codeIN (as numeric, factor or character), codeOUT (as numeric, factor or character) and description (as character).

start

The base period (as character) limited to the year and month, e.g. "2020-03".

end

The research period (as character) limited to the year and month, e.g. "2020-04".

interval

A logical value indicating whether the matching process concerns only two periods defined by start and end parameters (then the interval is set to FALSE) or whether that function is to match products sold during the whole time interval <start, end>.

variables

The optional parameter describing the vector of additional column names. Values of these additional columns must be identical for matched products.

codeIN

A logical value, e.g. if there are retailer (internal) product codes (as numeric or character) written in codeIN column and there is a need to use that column while data matching, then that parameter should be set to TRUE. Otherwise it is set to FALSE.

codeOUT

A logical value, e.g. if there are external product codes, such as GTIN or SKU (as numeric or character) written in codeOUT column and there is a need to use that column while data preparing then, that parameter should be set to TRUE. Otherwise it is set to FALSE.

description

A logical value, e.g. if there are product labels (as character) written in description column and there is a need to use that column while data preparing, then that parameter should be set to TRUE. Otherwise it is set to FALSE.

onlydescription

A logical value indicating whether products with identical labels (described in the description) are to be matched.

precision

A threshold value for the Jaro-Winkler similarity measure when comparing labels (its value must belong to the interval [0,1]). Two labels are treated as similar enough if their Jaro-Winkler similarity exceeds the precision value.

Value

This function returns a data set defined in the first parameter (data) with an additional column (prodID). Two products are treated as being matched if they have the same prodID value. The procedure of generating the above-mentioned additional column depends on the set of chosen columns for matching. In most extreme case, when the onlydescription parameter value is TRUE, two products are also matched if they have identical descriptions. Other cases are as follows: Case 1: Parameters codeIN, codeOUT and description are set to TRUE. Products with two identical codes or one of the codes identical and an identical description are automatically matched. Products are also matched if they have identical one of codes and the Jaro-Winkler similarity of their descriptions is bigger than the precision value.Case 2: Only one of the parameters: codeIN or codeOUT are set to TRUE and also the description parameter is set to TRUE. Products with an identical chosen code and an identical description are automatically matched. In the second stage, products are also matched if they have an identical chosen code and the Jaro-Winkler similarity of their descriptions is bigger than the precision value. Case 3: Parameters codeIN and codeOUT are set to TRUE and the parameter description is set to FALSE. In this case, products are matched if they have both codes identical. Case 4: Only the parameter description is set to TRUE. This case requires the onlydescription parameter to be TRUE and then the matching process is based only on product labels (two products are matched if they have identical descriptions). Case 5: Only one of the parameters: codeIN or codeOUT are set to TRUE and the description parameter is set to FALSE. In this case, the only reasonable option is to return the prodID column which is identical with the chosen code column. Please note that if the set of column names defined in the variables parameter is not empty, then the values of these additional columns must be identical while product matching.

Examples

data_matching(dataMATCH, start="2018-12",end="2019-02",onlydescription=TRUE,interval=TRUE)
data_matching(dataMATCH, start="2018-12",end="2019-02",precision=0.98, interval=TRUE)


PriceIndices documentation built on July 9, 2023, 6:20 p.m.