vlm | R Documentation |
Sorts variables according to either user input or correlation with time (among numerical variables only), and create output files including:
A PDF file of plots saved as outFl
.pdf, with each indivual page
on one variable. Variables are plotted in the order indicated in the argument
sortVars
or sortFn
.
For each numerical variable, the output plots include
side-by-side boxplots grouped by dateGpBp
(left),
a trace plot of p1, p50, and p99 percentiles, grouped by dateGp
(top right),
a trace plot of mean and +-1 SD control limits, grouped by
dateGp
(middle right), and
a trace plot of missing and zerorates, grouped by dateGp
(bottom right).
For each categorical variable (including a numerical variable with no more than 2 unique levels not including NA), the output plots include
a frequency bar plot (left), and
a grid of trace plots on categories' proportions over time (right).
If the variable contains more than kCategories
number of
categories, trace plots of only the largest kCategories
will be
plotted. If the variable contains only two categories, then only the
trace plot of the less prevalent cateogy will be plotted.
CSV file(s) on summary statistics of variable, both globally and over
time aggregated by dateGp
. The order of variables in the CSV files
are the same as in the PDF file.
For numerical varaibles, number of observations (counts), p1, p25,
p50, p75, and p99 qunatiles, mean, SD, missing and zerorates are saved
as outFl
_numerical_summary.csv.
For categorical varaibles, number of observations (counts) and
categories' proportions are saved as outFl
_categorical_summary.csv.
Each row is a category of a categorical (or binary) variable.
The row whose category == 'NA'
corresponds to missing. Categories
among the same variable are ordered by global prevalence in a descending
order.
vlm(dataFl, dateNm, labelFl = NULL, outFl = "otvplots", genCSV = TRUE,
dataNeedPrep = FALSE, dateGp = NULL, dateGpBp = NULL, weightNm = NULL,
varNms = NULL, sortVars = NULL, sortFn = NULL, selectCols = NULL,
dropCols = NULL, dateFt = "%d%h%Y", buildTm = NULL,
highlightNms = NULL, skewOpt = NULL, kSample = 50000,
fuzzyLabelFn = NULL, dropConstants = FALSE, kCategories = 9, ...)
dataFl |
Either the name of an object that can be converted using
|
dateNm |
Name of column containing the date variable. |
labelFl |
Either the path of a dataset (a csv file) containing
labels, an R object convertible to |
outFl |
Name of the output file, with no extension names (e.g., "bank").
A pdf file of plots ("bank.pdf"), and two csv files of summary statistics
("bank_categorical_summary.csv" and "bank_numerical_summary.csv") will be
saved to your working directory, unless a path is included in |
genCSV |
Logical, whether to generate the two csv files of summary statistics for numerical and categorical variables. |
dataNeedPrep |
Logical, indicates if data should be run through the
|
dateGp |
Name of the variable that the time series plots should be
grouped by. Options are |
dateGpBp |
Name of variable the boxplots should be grouped by. Same
options as |
weightNm |
Name of the variable containing row weights, or |
varNms |
Either |
sortVars |
Determines which variables to be plotted and their order.
Either a character vector of variable names to plot variables in the same
order as in the |
sortFn |
A sorting function which returns |
selectCols |
Either |
dropCols |
Either |
dateFt |
|
buildTm |
Vector identify time period for ranking/anomaly detection (most likely model build period). Allows for a subset of plotting time period to be used for anomaly detection.
|
highlightNms |
Either |
skewOpt |
Either a numeric constant or |
kSample |
Either |
fuzzyLabelFn |
Either |
dropConstants |
Logical, indicates whether or not constant (all
duplicated or NA) variables should be dropped from |
kCategories |
If a categorical variable has more than |
... |
Additional parameters to be passed to
|
If the argument dataNeedPrep
is set to FALSE
, then
dataFl
must be a data.table
containing variables
weightNm
, dateNm
, dateGp
, and dateGpBp
, and
names of these variables must be the same as the corresponding arguments
of the vlm
function.
the arguments selectCols
, dropCols
, dateFt
,
dropConstants
will be ignored by the vlm
function.
When analyzing a dataset for the first time, it is recommended to first
run the PrepData
function on it, and then apply the
vlm
function with the argument dataNeedPrep = FALSE
.
Please see the examples for details.
Copyright 2017 Capital One Services, LLC Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This function depends on:
PrintPlots
,
OrderByR2
,
PrepData
,
PrepLabels
.
## Load the data and its label
data(bankData)
data(bankLabels)
## The PrepData function should only need to be run once on a dataset,
## after that vlm can be run with the argument dataNeedPrep = FALSE
bankData <- PrepData(bankData, dateNm = "date", dateGp = "months",
dateGpBp = "quarters")
bankLabels <- PrepLabels(bankLabels)
## Not run:
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters",
outFl = "bank")
## If csv files of summary statistics are not need, set genCSV = FALSE
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels, genCSV = FALSE,
sortFn = "OrderByR2", dateGp = "months", dateGpBp = "quarters",
outFl = "bank")
## If weights are provided, they will be used in all statistical calculations
bankData[, weight := rnorm(.N, 1, .1)]
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
dateGp = "months", dateGpBp = "quarters", weightNm = "weight",
outFl = "bank")
## Customize plotting order by passing a vector of variable names to
## sortVars, but the "date" column must be excluded from sortVars
sortVars <- sort(bankLabels[varCol!="date", varCol])
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
dateGp = "months", dateGpBp = "quarters", outFl = "bank",
sortVars = sortVars)
## Create plots for a specific variable using the varNms parameter
vlm(dataFl = bankData, dateNm = "date", labelFl = bankLabels,
dateGp = "months", dateGpBp = "quarters", outFl = "bank",
varNms = "age", sortVars = NULL)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.