arrange.data: Split, Merge, and Filter Given Datasets for the Subsequent...

Description Usage Arguments Value Examples

Description

Generates datasets that consist of the measurements from REF, CTR-b, and CTR-n turbines only. Filters the datasets by eliminating data points with a missing measurement and those with negative power output (optional). Generates training and test datasets for k-fold CV and splits the entire data into period 1 data and period 2 data.

Usage

1
2
3
4
arrange.data(df1, df2, df3, p1.beg, p1.end, p2.beg, p2.end,
  time.format = "%Y-%m-%d %H:%M:%S", k.fold = 5, col.time = 1,
  col.turb = 2, bootstrap = NULL, free.sec = NULL,
  neg.power = FALSE)

Arguments

df1

A dataframe for reference turbine data. This dataframe must include five columns: timestamp, turbine id, wind direction, power output, and air density.

df2

A dataframe for baseline control turbine data. This dataframe must include four columns: timestamp, turbine id, wind speed, and power output.

df3

A dataframe for neutral control turbine data. This dataframe must include four columns and have the same structure with df2.

p1.beg

A string specifying the beginning date of period 1. By default, the value needs to be specified in %Y-%m-%d format, for example, '2014-10-24'. A user can use a different format as long as it is consistent with the format defined in time.format below.

p1.end

A string specifying the end date of period 1. For example, if the value is '2015-10-24', data observed until '2015-10-23 23:50:00' would be considered for period 1.

p2.beg

A string specifying the beginning date of period 2.

p2.end

A string specifying the end date of period 2. Defined similarly as p1.end.

time.format

A string describing the format of time stamps used in the data to be analyzed. The default value is '%Y-%m-%d %H:%M:%S'.

k.fold

An integer defining the number of data folds for the period 1 analysis and prediction. In the period 1 analysis, k-fold cross validation (CV) will be applied to choose the optimal set of covariates that results in the least prediction error. The value of k.fold corresponds to the k of the k-fold CV. The default value is 5.

col.time

An integer specifying the column number of time stamps in wind turbine datasets. The default value is 1.

col.turb

An integer specifying the column number of turbines' id in wind turbine datasets. The default value is 2.

bootstrap

An integer indicating the current replication (run) number of bootstrap. If set to NULL, bootstrap is not applied. The default is NULL. A user is not recommended to set this value and directly run bootstrap; instead, use bootstrap.gain to run bootstrap.

free.sec

A list of vectors defining free sectors. Each vector in the list has two scalars: one for starting direction and another for ending direction, ordered clockwise. For example, a vector of c(310 , 50) is a valid component of the list. By default, this is set to NULL.

neg.power

Either TRUE or FALSE, indicating whether or not to use data points with a negative power output, respectively, in the analysis. The default value is FALSE, i.e., negative power output data will be eliminated.

Value

The function returns a list of several datasets including the following.

train

A list containing k datasets that will be used to train the machine learning model.

test

A list containing k datasets that will be used to test the machine learning model.

per1

A dataframe containing the period 1 data.

per2

A dataframe containing the period 2 data.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
df.ref <- with(wtg, data.frame(time = time, turb.id = 1, wind.dir = D, power = y,
 air.dens = rho))
df.ctrb <- with(wtg, data.frame(time = time, turb.id = 2, wind.spd = V, power = y))
df.ctrn <- df.ctrb
df.ctrn$turb.id <- 3

# For Full Sector Analysis
data <- arrange.data(df.ref, df.ctrb, df.ctrn, p1.beg = '2014-10-24', p1.end = '2014-10-27',
 p2.beg = '2014-10-27', p2.end = '2014-10-30')

# For Free Sector Analysis
free.sec <- list(c(310, 50), c(150, 260))
data <- arrange.data(df.ref, df.ctrb, df.ctrn, p1.beg = '2014-10-24', p1.end = '2014-10-27',
 p2.beg = '2014-10-27', p2.end = '2014-10-30', free.sec = free.sec)

length(data$train) #This equals to k.
length(data$test)  #This equals to k.
head(data$per1)    #This shows the beginning of the period 1 dataset.
head(data$per2)    #This shows the beginning of the period 2 dataset.

gainML documentation built on June 28, 2019, 5:05 p.m.