In nitesh-1507/WindPlus: Covariates matching and performance quantification

knitr::opts_chunk$set(
  fig.width=8, 
  fig.height=4,
  collapse = TRUE,
  comment = "#>"
)

Introduction

The package aims at unveiling a technique called covariate matching, which takes in different data sets and returns data sets whose feature values exhibit similar characteristics or similar distribution. In other words, if a wind speed of 3.2 m/s exists in one data set, the algorithm tries to find wind speed in the vicinity of 3.2 m/s in the other data set.

Theory

Covariate matching methods are rooted in the statistical literature. In stabilizing the non-experimental discrepancy between non-treated and treated subjects of observational data, covariate distributions are adjusted by selecting non-treated subjects that have a similar covariate condition as that of treated ones. Through the process of matching, non-treated and treated groups become only randomly different on all background covariates, as if these covariates were designed by experimenters. As a result, the outcomes of the matched non-treated and treated groups, which keep the originally observed values, are comparable under the matched covariate conditions.

1. Definition

Covariate matching is a technique to adjust covariates distribution to make treated and non treated group to have the same covariates condition.

2. Algorithm

Covariate matching is broadly divided into steps : Hierarchial subgrouping and one to one matching.

Hierarchial subrouping - Filters data record in non-treated group which satifies the threshold condition for each data record in treated group.

Locate a data record in the treated group, Q_aft, and label it by the index j.
Select one of the covariates, for instance, wind speed, V , and designate it as the variable of which the similarity between two data records is computed.
Go through the data records in the non-treated group, Q_bef, by selecting the subset of data records such that the difference, in terms of the designated covariate, between the data record j in Q_aft and any one of the records in Q_bef is smaller than a pre-specified threshold. When V is in fact the one designated in Step 2, the resulting subset is then labeled by placing V as the subscript to Q, namely Q_V.
Next, designate another covariate and use it to prune Q_V in the same way as one prunes Q_bef into Q_V in Step 3. Doing so produces a smaller subset nested within Q_V . Then continue with another covariate until all covariates are used.

One to One matching - returns the most similar data record after hierarchial sub-grouping

Select the most similar record from Q_bef for each Q_aft using mahalanobis distance as the similarity measure.
Remove the record from Q_aft, if a record in Q_bef satisfies the above mentioned criteria.
Repeat the process for each record in Q_aft

Getting Started with WindPlus

Load the library

library(WindPlus)

Functions available

Matching two data sets

Function : covmatch.binary(dname, cov, weight, cov_circ)

dname - List of data sets to match
cov - vector defining position or column number of non circular covariates such as wind speed, temperature etc.
weight - vector defining the threshold value for each covariate to skim out most similar observation
cov_circ - vector defining position or column number of circular covariates such as wind direction, nacelle position etc.
Matching multiple data sets

Function : covmatch.mult(dname, cov, weight, cov_circ)

dname - List of data sets to match
cov - vector defining position or column number of non circular covariates such as wind speed, temperature etc.
weight - vector defining the threshold value for each covariate to skim out most similar observation
cov_circ - vector defining position or column number of circular covariates such as wind direction, nacelle position etc.

Functions distinction

Even though, the arguments seem pretty similar for both the functions, there is a subtle difference in execution. For matching multiple data sets, data set is matched once, whereas for macthing two data sets, the matching is done twice keeping each of the data set as a reference. Since the method is entirely subjective, an exact explanation of such is not possible.

Data set available

Matching two data sets

The two data sets attached to make use of function covmatch.binary for users are data1 and data2. The data can be made available after installing the package and loading the library. The data sets correspond to wind energy data set, and both of these are two different turbines data aggregated by 10 minutes over a year.

head(data1)

head(data2)

Matching multiple data sets

The three data sets attached to make use of function covmatch.mult for users are Season1, Season2 and Season3. The data sets correspond to wind energy data set of a single turbine, splitted in 3 seasons over a year.

head(Season1)

head(Season2)

head(Season3)

Usage

Matching two data sets

# Prepare data set for matching
dname = rep(list(), 2)
dname[[1]] = data1
dname[[2]] = data2


# Non circular covariates column
cov = c(6, 1, 13)

# Weight 
weight = c(0.1, 0.1, 0.05)

# Matching
matched = covmatch.binary(dname = dname, weight = weight, cov = cov, cov_circ = NULL)

# Commpare result of one covariate with original data
par(mfrow =  c(1, 2))
par(bg = 'grey')

plot(density(dname[[1]][, 13]), col = 'red', main = 'Before Matching', xlab = 'Turbulence Intensity', lwd = 2, xlim = c(0, 0.8), ylim = c(0, 8))
lines(density(dname[[2]][, 13]), col = 'blue', lwd = 2, ylim = c(0, 8), xlim = c(0, 0.8), ylim = c(0, 8))
legend('topright', legend = c('data1', 'data2'), col=c("red", "blue"), lty=1, lwd = 2)

plot(density(matched[[1]][, 13]), col = 'red', main = 'After Matching', xlab = 'Turbulence Intensity', lwd = 2, xlim = c(0, 0.8), ylim = c(0, 8))
lines(density(matched[[2]][,13]), col = 'blue', lwd = 2, xlim = c(0, 0.8), ylim = c(0, 8))
legend('topright', legend = c('data1', 'data2'), col=c("red", "blue"), lty=1, lwd = 2)

Matching multiple data sets

# Prepare data set for matching
dname = rep(list(), 3)
dname[[1]] = Season1
dname[[2]] = Season2
dname[[3]] = Season3

# Non circular covariates column
cov = c(1, 6, 14)

# Weight 
weight = c(0.2, 0.2, 0.2)

# Matching
matched = covmatch.mult(dname = dname, weight = weight, cov = cov, cov_circ = NULL)

# Commpare result of one covariate with original data
par(mfrow =  c(1, 2))
par(bg = 'grey')

plot(density(dname[[1]][, 1]), col = 'red', main = 'Before Matching', xlab = 'Wind Speed (m/s)', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
lines(density(dname[[2]][, 1]), col = 'blue', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
lines(density(dname[[3]][, 1]), col = 'green', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
legend('topright',legend = c('Season1', 'Season2', 'Season3'), col=c("red", "blue", "green"), lty=1, lwd = 2)

plot(density(matched[[1]][, 1]), col = 'red', main = 'After Matching', xlab = 'Wind Speed (m/s)', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
lines(density(matched[[2]][, 1]), col = 'blue', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
lines(density(matched[[3]][, 1]), col = 'green', lwd = 2, xlim = c(0, 20), ylim = c(0, 0.18))
legend('topright',legend = c('Season1', 'Season2', 'Season3'), col=c("red", "blue", "green"), lty=1, lwd = 2)

Result

The task at hand is to find the observations from two or more different set, which exhibit similar characteristics. The functions return a list containing the data sets only with matched observations. It should be kept in mind that, only the supplied covariates are matched. Even though data set containing all the columns are returned back, the columns of interest should be looked into.

nitesh-1507/WindPlus documentation built on Dec. 8, 2019, 1:57 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nitesh-1507/WindPlus
Covariates matching and performance quantification

In nitesh-1507/WindPlus: Covariates matching and performance quantification

Introduction

Theory

1. Definition

2. Algorithm

Hierarchial subrouping - Filters data record in non-treated group which satifies the threshold condition for each data record in treated group.

One to One matching - returns the most similar data record after hierarchial sub-grouping

Getting Started with WindPlus

Functions available

Functions distinction

Data set available

Usage

Result

R Package Documentation

Browse R Packages

We want your feedback!

nitesh-1507/WindPlus Covariates matching and performance quantification

In nitesh-1507/WindPlus: Covariates matching and performance quantification

Introduction

Theory

1. Definition

2. Algorithm

Hierarchial subrouping - Filters data record in non-treated group which satifies the threshold condition for each data record in treated group.

One to One matching - returns the most similar data record after hierarchial sub-grouping

Getting Started with WindPlus

Functions available

Functions distinction

Data set available

Usage

Result

R Package Documentation

Browse R Packages

We want your feedback!

nitesh-1507/WindPlus
Covariates matching and performance quantification