vlm: Create over time variable plots and summary statistics for...
In capitalone/otvPlots: Over Time Variable Plots

View source: R/vlm.R

vlm	R Documentation

Create over time variable plots and summary statistics for variable level monitoring

Description

Sorts variables according to either user input or correlation with time (among numerical variables only), and create output files including:

A PDF file of plots saved as outFl.pdf, with each indivual page on one variable. Variables are plotted in the order indicated in the argument sortVars or sortFn. For each numerical variable, the output plots include
- side-by-side boxplots grouped by dateGpBp (left),
- a trace plot of p1, p50, and p99 percentiles, grouped by dateGp (top right),
- a trace plot of mean and +-1 SD control limits, grouped by dateGp(middle right), and
- a trace plot of missing and zerorates, grouped by dateGp (bottom right).
For each categorical variable (including a numerical variable with no more than 2 unique levels not including NA), the output plots include
- a frequency bar plot (left), and
- a grid of trace plots on categories' proportions over time (right). If the variable contains more than kCategories number of categories, trace plots of only the largest kCategories will be plotted. If the variable contains only two categories, then only the trace plot of the less prevalent cateogy will be plotted.
CSV file(s) on summary statistics of variable, both globally and over time aggregated by dateGp. The order of variables in the CSV files are the same as in the PDF file.
- For numerical varaibles, number of observations (counts), p1, p25, p50, p75, and p99 qunatiles, mean, SD, missing and zerorates are saved as outFl_numerical_summary.csv.
- For categorical varaibles, number of observations (counts) and categories' proportions are saved as outFl_categorical_summary.csv. Each row is a category of a categorical (or binary) variable. The row whose category == 'NA' corresponds to missing. Categories among the same variable are ordered by global prevalence in a descending order.

Usage

vlm(dataFl, dateNm, labelFl = NULL, outFl = "otvplots", genCSV = TRUE,
  dataNeedPrep = FALSE, dateGp = NULL, dateGpBp = NULL, weightNm = NULL,
  varNms = NULL, sortVars = NULL, sortFn = NULL, selectCols = NULL,
  dropCols = NULL, dateFt = "%d%h%Y", buildTm = NULL,
  highlightNms = NULL, skewOpt = NULL, kSample = 50000,
  fuzzyLabelFn = NULL, dropConstants = FALSE, kCategories = 9, ...)

Arguments

`dataFl`	Either the name of an object that can be converted using `as.data.table` (e.g., a data frame), or a character string containing the name of dataset that can be loaded using `fread` (e.g., a csv file). If the dataset is not in your working directory then `dataFl` must include (relative or absolute) path to file.
`dateNm`	Name of column containing the date variable.
`labelFl`	Either the path of a dataset (a csv file) containing labels, an R object convertible to `data.table` (e.g., data frame) or `NULL`. If `NULL`, no labels will be used. The label dataset must contain at least 2 columns: `varCol` (variable names) and `labelCol` (variable labels).
`outFl`	Name of the output file, with no extension names (e.g., "bank"). A pdf file of plots ("bank.pdf"), and two csv files of summary statistics ("bank_categorical_summary.csv" and "bank_numerical_summary.csv") will be saved to your working directory, unless a path is included in `outFl` (e.g. "../plots/bank").
`genCSV`	Logical, whether to generate the two csv files of summary statistics for numerical and categorical variables.
`dataNeedPrep`	Logical, indicates if data should be run through the `PrepData` function. This should be set to `TRUE` unless the `PrepData` function has been applied to the input data `dataFl`.
`dateGp`	Name of the variable that the time series plots should be grouped by. Options are `NULL`, `"weeks"`, `"months"`, `"quarters"`, `"years"`. See `IDate` for details. If `NULL`, then `dateNm` will be used as `dateGp`.
`dateGpBp`	Name of variable the boxplots should be grouped by. Same options as `dateGp`. If `NULL`, then `dateGp` will be used.
`weightNm`	Name of the variable containing row weights, or `NULL` for no weights (all rows receiving weight 1).
`varNms`	Either `NULL` or a vector of names or indices of variables to be plotted. If `NULL`, will default to all columns which are not `dateNm` or `weightNm`. Can also be a vector of indices of the column names, after `dropCols` or `selectCols` have been applied, if applicable, and not including `dateGp`, `dateGpBp` (which will be added to the `dataFl` by the function `PrepData`).
`sortVars`	Determines which variables to be plotted and their order. Either a character vector of variable names to plot variables in the same order as in the `sortVars` argument), or `NULL` to keep the original ordering, with numerical variables will being plotted before categorical and binary ones. `sortVars` should be `NULL` when the `sortFn` argument is used.
`sortFn`	A sorting function which returns `sortVars` as an output. The function may take the following variables as input: `dataFl`, `dateNm`, `buildTm`, `weightNm`, `kSample`. Currently, the only build-in sorting function is `OrderByR2`, which sorts numerical variables in the order of strength of linear association with date, and adds categorical (and binary) variables sorted in alphabetical order after the numerical ones.
`selectCols`	Either `NULL`, or a vector of names or indices of variables to read into memory – must include `dateNm`, `weightNm` (if not `NULL`) and all variables to be plotted. If both `selectCols` and `dropCols` are `NULL`, then all variables will be read in.
`dropCols`	Either `NULL`, or a vector of variables names or indices of variables not to read into memory. If both `selectCols` and `dropCols` are `NULL`, then all variables will be read in.
`dateFt`	`strptime` format of date variable. The default is SAS format `"%d%h%Y"`. But input data with R date format `"%Y-%m-%d"` will also be detected. Both of two formats can be parsed automatically.
`buildTm`	Vector identify time period for ranking/anomaly detection (most likely model build period). Allows for a subset of plotting time period to be used for anomaly detection. Must be a vector of dates and must be inclusive i.e. buildTm[1] <= date <= buildTm[2] will define the time period. Must be either `NULL`, a vector of length 2, or a vector of length 3. If `NULL`, the entire dataset will be used for ranking/anomaly detection. If a vector of length 2, the format of the dates must be a character vector in default R date format (e.g. "2017-01-30"). If a vector of length 3, the first two columns must contain dates in any strptime format, while the 3rd column contains the strptime format (see `strptime`). The following are equivalent ways of selecting all of 2014: `c("2014-01-01","2014-12-31")` `c("01JAN2014","31DEC2014", "%d%h%Y")`
`highlightNms`	Either `NULL` or a character vector of variables to recieve red label. Currently `NULL` means all variables will get a black legend. Ignored this argument if `labelFl == NULL`.
`skewOpt`	Either a numeric constant or `NULL`. Default is `NULL` (no transformation). If numeric, say 5, then all box plots of a variable whose skewness exceeds 5 will be on a log10 scale if possible. Negative input of `skewOpt` will be converted to 3.
`kSample`	Either `NULL` or a positive integer. If an integer, indicates the sample size for both drawing boxplots and ordering numerical graphs by `R^2`. When the data is large, setting `kSample` to a reasonable value (default is 50K) dramatically improves processing speed. Therefore, for larger datasets (e.g. > 10 percent system memory), this parameter should not be set to `NULL`, or boxplots may take a very long time to render. This setting has no impact on the accuracy of time series plots on quantiles, mean, SD, and missing and zero rates.
`fuzzyLabelFn`	Either `NULL` or a function of 2 parameters: A label file in the format of an output by `PrepLabels` and a string giving a variable name. The function should return the label corresponding to the variable given by the second parameter. This function should describe how fuzzy matching should be performed to find labels (see example below). If `NULL`, only exact matches will be retuned.
`dropConstants`	Logical, indicates whether or not constant (all duplicated or NA) variables should be dropped from `dataFl` prior to plotting.
`kCategories`	If a categorical variable has more than `kCategories`, trace plots of only the `kCategories` most prevalent categories are plotted.
`...`	Additional parameters to be passed to `fread`.

Details

If the argument dataNeedPrep is set to FALSE, then

dataFl must be a data.table containing variables weightNm, dateNm, dateGp, and dateGpBp, and names of these variables must be the same as the corresponding arguments of the vlm function.
the arguments selectCols, dropCols, dateFt, dropConstants will be ignored by the vlm function.
When analyzing a dataset for the first time, it is recommended to first run the PrepData function on it, and then apply the vlm function with the argument dataNeedPrep = FALSE. Please see the examples for details.

License

Copyright 2017 Capital One Services, LLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Examples

## Load the data and its label
data(bankData)
data(bankLabels)

## The PrepData function should only need to be run once on a dataset, 
## after that vlm can be run with the argument dataNeedPrep = FALSE
bankData <- PrepData(bankData, dateNm = "date", dateGp = "months", 
                    dateGpBp = "quarters")
bankLabels <- PrepLabels(bankLabels)

## Not run:  
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters", 
    outFl = "bank")
    
## If csv files of summary statistics are not need, set genCSV = FALSE
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, genCSV = FALSE,
    sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters", 
    outFl = "bank")
    
## If weights are provided, they will be used in all statistical calculations
bankData[, weight := rnorm(.N, 1, .1)]
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
    dateGp = "months", dateGpBp = "quarters", weightNm = "weight", 
    outFl = "bank")

## Customize plotting order by passing a vector of variable names to 
## sortVars, but the "date" column must be excluded from sortVars
sortVars <- sort(bankLabels[varCol!="date", varCol])
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    dateGp = "months", dateGpBp = "quarters", outFl = "bank", 
    sortVars = sortVars)
            
## Create plots for a specific variable using the varNms parameter
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, 
    dateGp = "months", dateGpBp = "quarters", outFl = "bank", 
    varNms = "age", sortVars = NULL)

## End(Not run)

capitalone/otvPlots documentation built on March 15, 2024, 8:25 a.m.