OrderByR2: Create numerical variable ranking using R2 between date to...

View source: R/plots_order.R


Calculates R2 of a linear model of the formula var ~ dateNm for each var of class nmrcl and returns a vector of variable names ordered by highest R2. The linear model can be calculated over a subset of dates, see details of parameter buildTm. Non-numerical variables are returned in alphabetical order after the sorted numerical variables.


OrderByR2(dataFl, dateNm, buildTm = NULL, weightNm = NULL,
  kSample = 50000)



A data.table of data; must be the output of the PrepData function.


Name of column containing the date variable.


Vector identify time period for ranking/anomaly detection (most likely model build period). Allows for a subset of plotting time period to be used for anomaly detection.

  • Must be a vector of dates and must be inclusive i.e. buildTm[1] <= date <= buildTm[2] will define the time period.

  • Must be either NULL, a vector of length 2, or a vector of length 3.

  • If NULL, the entire dataset will be used for ranking/anomaly detection.

  • If a vector of length 2, the format of the dates must be a character vector in default R date format (e.g. "2017-01-30").

  • If a vector of length 3, the first two columns must contain dates in any strptime format, while the 3rd column contains the strptime format (see strptime).

  • The following are equivalent ways of selecting all of 2014:

    • c("2014-01-01","2014-12-31")

    • c("01JAN2014","31DEC2014", "%d%h%Y")


Name of the variable containing row weights, or NULL for no weights (all rows receiving weight 1).


Either NULL or a positive integer. If an integer, indicates the sample size for both drawing boxplots and ordering numerical graphs by R^2. When the data is large, setting kSample to a reasonable value (default is 50K) dramatically improves processing speed. Therefore, for larger datasets (e.g. > 10 percent system memory), this parameter should not be set to NULL, or boxplots may take a very long time to render. This setting has no impact on the accuracy of time series plots on quantiles, mean, SD, and missing and zero rates.


A vector of variable names sorted by R2 of lm of the formula var ~ dateNm (highest R2 to lowest)


Functions depend on this function: vlm.

This function depends on: CalcR2, PrepData.


bankData <- PrepData(bankData, dateNm = "date", dateGp = "months", 
                     dateGpBp = "quarters")
OrderByR2(bankData, dateNm = "date")

