vlm: Create over time variable plots and summary statistics for...

View source: R/vlm.R

vlmR Documentation

Create over time variable plots and summary statistics for variable level monitoring

Description

Sorts variables according to either user input or correlation with time (among numerical variables only), and create output files including:

  • A PDF file of plots saved as outFl.pdf, with each indivual page on one variable. Variables are plotted in the order indicated in the argument sortVars or sortFn. For each numerical variable, the output plots include

    • side-by-side boxplots grouped by dateGpBp (left),

    • a trace plot of p1, p50, and p99 percentiles, grouped by dateGp (top right),

    • a trace plot of mean and +-1 SD control limits, grouped by dateGp(middle right), and

    • a trace plot of missing and zerorates, grouped by dateGp (bottom right).

    For each categorical variable (including a numerical variable with no more than 2 unique levels not including NA), the output plots include

    • a frequency bar plot (left), and

    • a grid of trace plots on categories' proportions over time (right). If the variable contains more than kCategories number of categories, trace plots of only the largest kCategories will be plotted. If the variable contains only two categories, then only the trace plot of the less prevalent cateogy will be plotted.

  • CSV file(s) on summary statistics of variable, both globally and over time aggregated by dateGp. The order of variables in the CSV files are the same as in the PDF file.

    • For numerical varaibles, number of observations (counts), p1, p25, p50, p75, and p99 qunatiles, mean, SD, missing and zerorates are saved as outFl_numerical_summary.csv.

    • For categorical varaibles, number of observations (counts) and categories' proportions are saved as outFl_categorical_summary.csv. Each row is a category of a categorical (or binary) variable. The row whose category == 'NA' corresponds to missing. Categories among the same variable are ordered by global prevalence in a descending order.

Usage

vlm(dataFl, dateNm, labelFl = NULL, outFl = "otvplots", genCSV = TRUE,
  dataNeedPrep = FALSE, dateGp = NULL, dateGpBp = NULL, weightNm = NULL,
  varNms = NULL, sortVars = NULL, sortFn = NULL, selectCols = NULL,
  dropCols = NULL, dateFt = "%d%h%Y", buildTm = NULL,
  highlightNms = NULL, skewOpt = NULL, kSample = 50000,
  fuzzyLabelFn = NULL, dropConstants = FALSE, kCategories = 9, ...)

Arguments

dataFl

Either the name of an object that can be converted using as.data.table (e.g., a data frame), or a character string containing the name of dataset that can be loaded using fread (e.g., a csv file). If the dataset is not in your working directory then dataFl must include (relative or absolute) path to file.

dateNm

Name of column containing the date variable.

labelFl

Either the path of a dataset (a csv file) containing labels, an R object convertible to data.table (e.g., data frame) or NULL. If NULL, no labels will be used. The label dataset must contain at least 2 columns: varCol (variable names) and labelCol (variable labels).

outFl

Name of the output file, with no extension names (e.g., "bank"). A pdf file of plots ("bank.pdf"), and two csv files of summary statistics ("bank_categorical_summary.csv" and "bank_numerical_summary.csv") will be saved to your working directory, unless a path is included in outFl (e.g. "../plots/bank").

genCSV

Logical, whether to generate the two csv files of summary statistics for numerical and categorical variables.

dataNeedPrep

Logical, indicates if data should be run through the PrepData function. This should be set to TRUE unless the PrepData function has been applied to the input data dataFl.

dateGp

Name of the variable that the time series plots should be grouped by. Options are NULL, "weeks", "months", "quarters", "years". See IDate for details. If NULL, then dateNm will be used as dateGp.

dateGpBp

Name of variable the boxplots should be grouped by. Same options as dateGp. If NULL, then dateGp will be used.

weightNm

Name of the variable containing row weights, or NULL for no weights (all rows receiving weight 1).

varNms

Either NULL or a vector of names or indices of variables to be plotted. If NULL, will default to all columns which are not dateNm or weightNm. Can also be a vector of indices of the column names, after dropCols or selectCols have been applied, if applicable, and not including dateGp, dateGpBp (which will be added to the dataFl by the function PrepData).

sortVars

Determines which variables to be plotted and their order. Either a character vector of variable names to plot variables in the same order as in the sortVars argument), or NULL to keep the original ordering, with numerical variables will being plotted before categorical and binary ones. sortVars should be NULL when the sortFn argument is used.

sortFn

A sorting function which returns sortVars as an output. The function may take the following variables as input: dataFl, dateNm, buildTm, weightNm, kSample. Currently, the only build-in sorting function is OrderByR2, which sorts numerical variables in the order of strength of linear association with date, and adds categorical (and binary) variables sorted in alphabetical order after the numerical ones.

selectCols

Either NULL, or a vector of names or indices of variables to read into memory – must include dateNm, weightNm (if not NULL) and all variables to be plotted. If both selectCols and dropCols are NULL, then all variables will be read in.

dropCols

Either NULL, or a vector of variables names or indices of variables not to read into memory. If both selectCols and dropCols are NULL, then all variables will be read in.

dateFt

strptime format of date variable. The default is SAS format "%d%h%Y". But input data with R date format "%Y-%m-%d" will also be detected. Both of two formats can be parsed automatically.

buildTm

Vector identify time period for ranking/anomaly detection (most likely model build period). Allows for a subset of plotting time period to be used for anomaly detection.

  • Must be a vector of dates and must be inclusive i.e. buildTm[1] <= date <= buildTm[2] will define the time period.

  • Must be either NULL, a vector of length 2, or a vector of length 3.

  • If NULL, the entire dataset will be used for ranking/anomaly detection.

  • If a vector of length 2, the format of the dates must be a character vector in default R date format (e.g. "2017-01-30").

  • If a vector of length 3, the first two columns must contain dates in any strptime format, while the 3rd column contains the strptime format (see strptime).

  • The following are equivalent ways of selecting all of 2014:

    • c("2014-01-01","2014-12-31")

    • c("01JAN2014","31DEC2014", "%d%h%Y")

highlightNms

Either NULL or a character vector of variables to recieve red label. Currently NULL means all variables will get a black legend. Ignored this argument if labelFl == NULL.

skewOpt

Either a numeric constant or NULL. Default is NULL (no transformation). If numeric, say 5, then all box plots of a variable whose skewness exceeds 5 will be on a log10 scale if possible. Negative input of skewOpt will be converted to 3.

kSample

Either NULL or a positive integer. If an integer, indicates the sample size for both drawing boxplots and ordering numerical graphs by R^2. When the data is large, setting kSample to a reasonable value (default is 50K) dramatically improves processing speed. Therefore, for larger datasets (e.g. > 10 percent system memory), this parameter should not be set to NULL, or boxplots may take a very long time to render. This setting has no impact on the accuracy of time series plots on quantiles, mean, SD, and missing and zero rates.

fuzzyLabelFn

Either NULL or a function of 2 parameters: A label file in the format of an output by PrepLabels and a string giving a variable name. The function should return the label corresponding to the variable given by the second parameter. This function should describe how fuzzy matching should be performed to find labels (see example below). If NULL, only exact matches will be retuned.

dropConstants

Logical, indicates whether or not constant (all duplicated or NA) variables should be dropped from dataFl prior to plotting.

kCategories

If a categorical variable has more than kCategories, trace plots of only the kCategories most prevalent categories are plotted.

...

Additional parameters to be passed to fread.

Details

If the argument dataNeedPrep is set to FALSE, then

  • dataFl must be a data.table containing variables weightNm, dateNm, dateGp, and dateGpBp, and names of these variables must be the same as the corresponding arguments of the vlm function.

  • the arguments selectCols, dropCols, dateFt, dropConstants will be ignored by the vlm function.

  • When analyzing a dataset for the first time, it is recommended to first run the PrepData function on it, and then apply the vlm function with the argument dataNeedPrep = FALSE. Please see the examples for details.

License

Copyright 2017 Capital One Services, LLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See Also

This function depends on: PrintPlots, OrderByR2, PrepData, PrepLabels.

Examples

## Load the data and its label
data(bankData)
data(bankLabels)

## The PrepData function should only need to be run once on a dataset, 
## after that vlm can be run with the argument dataNeedPrep = FALSE
bankData <- PrepData(bankData, dateNm = "date", dateGp = "months", 
                    dateGpBp = "quarters")
bankLabels <- PrepLabels(bankLabels)

## Not run:  
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters", 
    outFl = "bank")
    
## If csv files of summary statistics are not need, set genCSV = FALSE
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, genCSV = FALSE,
    sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters", 
    outFl = "bank")
    
## If weights are provided, they will be used in all statistical calculations
bankData[, weight := rnorm(.N, 1, .1)]
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
    dateGp = "months", dateGpBp = "quarters", weightNm = "weight", 
    outFl = "bank")

## Customize plotting order by passing a vector of variable names to 
## sortVars, but the "date" column must be excluded from sortVars
sortVars <- sort(bankLabels[varCol!="date", varCol])
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    dateGp = "months", dateGpBp = "quarters", outFl = "bank", 
    sortVars = sortVars)
            
## Create plots for a specific variable using the varNms parameter
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    dateGp = "months", dateGpBp = "quarters", outFl = "bank", 
    varNms = "age", sortVars = NULL)

## End(Not run)

capitalone/otvPlots documentation built on March 15, 2024, 8:25 a.m.