getenumCI2022: Summarizes veris enumerations from verisr objects

View source: R/getenumCI2022.R

getenumCI2022R Documentation

Summarizes veris enumerations from verisr objects

Description

This is the primary analysis function for veris. calculates the point estimate and credible intervals for enumerations. (For example, 'Malware', 'Hacking', etc within 'action').

The 'by' parameter allows enumerating one feature by another, (for example to count the frequency of each action by year).

Usage

getenumCI2022(
  veris,
  enum,
  by = NULL,
  na.rm = NULL,
  unk = FALSE,
  short.names = TRUE,
  ci.method = c(),
  cred.mass = 0.95,
  ci.level = NULL,
  ci.params = FALSE,
  round.freq = 5,
  na = NULL,
  top = NULL,
  force = FALSE,
  quietly = FALSE,
  ...
)

Arguments

veris

A verisr object

enum

A veris feature or enumeration to summarize

by

A veris feature or enumeration to group by

na.rm

A boolean of whether to include not applicable in the sample set. This is REQUIRED if enum has a potential value of NA as there is no 'default' method for handling NAs. Instead, it depends on the hypothesis being tested.

unk

A boolean referring whether to include 'unknown' in the sample. The default is 'FALSE' and should rarely be overwritten.

short.names

A boolean identifying whether to use the full enumeration name or just the last section. (i.e. action.hacking.variety.SQLi vs just SQLi.)

ci.method

A confidence interval method to use. Options are "mcmc" or "bootstrap". "bootstrap"uses the bayes process from the binom package. "mcmc" uses a binomial model based on rstan, rstanarm, brms.

cred.mass

the amount of probability mass that will be contained in reported credible intervals. This argument fills a similar role as conf.level in binom.test.

ci.level

DEPRECIATED! same as cred.mass.

ci.params

Set to TRUE to recieve a list column in the output of or used to recreate the model used to determine the ci.

round.freq

An integer indicating how many places to round the frequency value to. (default = 5)

na

DEPRECIATED! Use 'na.rm' parameter.

top

Integer limiting the output to top enumerations.

force

getenumCI() will attempt to enforce sane confidence-based practices (such as hiding x and freq in low sample sizes). Setting force to 'TRUE' will override these best practices.

quietly

When TRUE, suppress all warnings and messages. This is helpful when getenumCI is used in a larger script or markdown document.

...

A catch all for functions using arguments from previous versions of getenum.

Details

Unknowns are generally excluded as 'not tested'. If 'NA' is an enumeration in the feature being enumerated, it must be specified with the 'na.rm' parameter as whether NA should be included or not is highly dependent on the hypothesis being tested.

This function accurately enumerates single logical columns, character feature columns, and features spanning multiple logical columns (such as action.*). It cannot enumerate free-form text columns. It accurately calculates the sample size 'n' as the number of rows (independent of the number of enumerations present in the feature).

GetenumCI() can also provide binomial confidence intervals for the enumerations tested within the features. See the parameters for details.

While getenumCI() may work on other types of dataframes, it was designed for verisr dataframes and data.tables. It is not tested nor recommended for any other type.

Value

A data frame summarizing the enumeration

Examples

tmp <- tempfile(fileext = ".dat")
download.file("https://github.com/vz-risk/VCDB/raw/master/data/verisr/vcdb.dat", tmp, quiet=TRUE)
load(tmp, verbose=TRUE)
library(magrittr)
chunk <- getenumCI(vcdb, "action.hacking.variety")
chunk
chunk <- getenumCI(vcdb, "action.hacking.variety", top=10)
chunk <- getenumCI(vcdb, "action.hacking.variety", by="timeline.incident.year")
chunk
chunk <- getenumCI(vcdb, 
                   "action.hacking.variety", 
                   by="timeline.incident.year") 
chunk %>% 
    dplyr::select(by, enum, freq) %>% 
    tidyr::pivot_wider(names_from=enum, values_from=freq, values_fill = list(freq=0))
getenumCI(vcdb, "action")
getenumCI(vcdb, "asset.variety")
getenumCI(vcdb, "asset.assets.variety")
getenumCI(vcdb, "asset.assets.variety", ci.method="wilson")
getenumCI(vcdb, "asset.cloud", na.rm=FALSE)
getenumCI(vcdb, "action.social.variety.Phishing")
getenumCI(vcdb, "actor.*.motive", ci.method="wilson", na.rm=FALSE)
rm(vcdb)

vz-risk/verisr documentation built on Aug. 5, 2023, 4:34 a.m.