OrderByR2: Create numerical variable ranking using R2 between date to...

View source: R/plots_order.R

OrderByR2R Documentation

Create numerical variable ranking using R2 between date to and variable

Description

Calculates R2 of a linear model of the formula var ~ dateNm for each var of class nmrcl and returns a vector of variable names ordered by highest R2. The linear model can be calculated over a subset of dates, see details of parameter buildTm. Non-numerical variables are returned in alphabetical order after the sorted numerical variables.

Usage

OrderByR2(dataFl, dateNm, buildTm = NULL, weightNm = NULL,
  kSample = 50000)

Arguments

dataFl

A data.table of data; must be the output of the PrepData function.

dateNm

Name of column containing the date variable.

buildTm

Vector identify time period for ranking/anomaly detection (most likely model build period). Allows for a subset of plotting time period to be used for anomaly detection.

  • Must be a vector of dates and must be inclusive i.e. buildTm[1] <= date <= buildTm[2] will define the time period.

  • Must be either NULL, a vector of length 2, or a vector of length 3.

  • If NULL, the entire dataset will be used for ranking/anomaly detection.

  • If a vector of length 2, the format of the dates must be a character vector in default R date format (e.g. "2017-01-30").

  • If a vector of length 3, the first two columns must contain dates in any strptime format, while the 3rd column contains the strptime format (see strptime).

  • The following are equivalent ways of selecting all of 2014:

    • c("2014-01-01","2014-12-31")

    • c("01JAN2014","31DEC2014", "%d%h%Y")

weightNm

Name of the variable containing row weights, or NULL for no weights (all rows receiving weight 1).

kSample

Either NULL or a positive integer. If an integer, indicates the sample size for both drawing boxplots and ordering numerical graphs by R^2. When the data is large, setting kSample to a reasonable value (default is 50K) dramatically improves processing speed. Therefore, for larger datasets (e.g. > 10 percent system memory), this parameter should not be set to NULL, or boxplots may take a very long time to render. This setting has no impact on the accuracy of time series plots on quantiles, mean, SD, and missing and zero rates.

Value

A vector of variable names sorted by R2 of lm of the formula var ~ dateNm (highest R2 to lowest)

License

Copyright 2017 Capital One Services, LLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See Also

Functions depend on this function: vlm.

This function depends on: CalcR2, PrepData.

Examples

data(bankData)
bankData <- PrepData(bankData, dateNm = "date", dateGp = "months", 
                     dateGpBp = "quarters")
OrderByR2(bankData, dateNm = "date")

capitalone/otvPlots documentation built on March 15, 2024, 8:25 a.m.