generate_mild_df: Generate mild_df using multivariate t and normal...
In mildsvm: Multiple-Instance Learning with Support Vector Machines

generate_mild_df

R Documentation

Generate mild_df using multivariate t and normal distributions.

Description

This function samples multiple instance distributional data (a mild_df object) where each row corresponds to a sample from a given instance distribution. Instance distributions can be multivariate t and normal, with mean and variance parameters that can be fixed or sampled based on prior parameters. These instances are grouped into bags and the bag labels follow the standard MI assumption.

Usage

generate_mild_df(
  nbag = 50,
  ninst = 4,
  nsample = 50,
  ncov = 10,
  nimp_pos = 1:ncov,
  nimp_neg = 1:ncov,
  positive_prob = 0.2,
  dist = c("mvt", "mvnormal", "mvnormal"),
  mean = list(rep(0, length(nimp_pos)), rep(0, length(nimp_neg)), 0),
  sd_of_mean = c(0.5, 0.5, 0.5),
  cov = list(diag(1, nrow = length(nimp_pos)), diag(1, nrow = length(nimp_neg)), 1),
  sample_cov = FALSE,
  df_wishart_cov = c(length(nimp_pos), length(nimp_neg), ncov - length(nimp_pos)),
  degree = c(3, NA, NA),
  positive_bag_prob = NULL,
  n_noise_inst = NULL,
  ...
)

Arguments

`nbag`	The number of bags (default 50).
`ninst`	The number of instances for each bag (default 4).
`nsample`	The number of samples for each instance (default 50).
`ncov`	The number of total covariates (default 10).
`nimp_pos`	An index of important covariates for positve instances (default `1:ncov`).
`nimp_neg`	An index of important covariates for negative instances (default `1:ncov`). (default `1:ncov`).
`positive_prob`	A numeric value between 0 and 1 indicating the probability of an instance being positive (default 0.2).
`dist`	A vector (length 3) of distributions for the positive, negative, and remaining instances, respectively. Distributions can be one of `'mvnormal'` for multivariate normal or `'mvt'` for multivariate student's t.
`mean`	A list (length 3) of mean vectors for the positive, negative, and remaining distributions. `mean[[1]]` should match `nimp_pos` in length; `mean[[2]]` should match `nimp_neg` in length.
`sd_of_mean`	A vector (length 3) of standard deviations in sampling the mean for positive, negative, and remaining distributions, where the prior is given by `mean`. Use `sd_of_mean = c(0, 0, 0)` to keep the mean consistent across all instances.
`cov`	A list (length 3) of covariance matrices for the positive, negative, and remaining distributions. `cov[[3]]` should be an integer since the dimension of remaining features can vary depending on if the important distribution is positive or negative.
`sample_cov`	A logical value for whether to sample the covariance for each distribution. If `FALSE` (the default), each covariance is fixed at `cov`. If `TRUE`, the prior is given by `cov` and sampled from a Wishart distribution with `df_wishart_cov` degrees of freedom to have an expectation of `cov`.
`df_wishart_cov`	A vector (length 3) of degrees-of-freedom to use in the Wishart covariance matrix sampling.
`degree`	A vector (length 3) of degrees-of-freedom used when any of `dist` is `'mvt'`. This parameter is ignored when `dist[i] == 'mvnormal'`, in which case `NA` can be specified.
`positive_bag_prob`	A numeric value between 0 and 1 indicating the probability of a bag being positive. Must be specified jointly with `n_noise_inst`, in which case `positive_prob` is ignored. If `NULL` (the default), instance labels are sampled first according to `positive_prob`.
`n_noise_inst`	An integer indicating the number of negative instances in a positive bag. Must be specified jointly with `positive_bag_prob`. `n_noise_inst` should be less than `ninst`.
`...`	Arguments passed to or from other methods.

Details

The first consideration to use this function is to determine the number of bags, instances per bag, and samples per instance using the nbag, ninst, and nsample arguments. Next, one must consider the number of covariates ncov, and how those covariates will differ between instances with positive and negative labels. Some covariates can be common between the positive and negative instances, which we call the remainder distribution. Use nimp_pos and nimp_neg to specify the index of the important (non-remainder) covariates in the distributions with positive and negative instance labels.

The structure of how many instances/bags are positive and negative is determined by positive_prob or the joint specification of positive_bag_prob and n_noise_inst. In the first case, instances labels have independent Bernoulli draws based on positive_prob and bag labels are determined by the standard MI assumption (i.e. positive if any instance in the bag is positive). In the second case, bag labels are drawn independently as Bernoilli with positive_bag_prob chance of success. Each positive bag will be given n_noise_inst values with instance label of 0, and the remaining with instance label of 1.

The remaining arguments are used to determine the distributions used for the positive, negative, and remaining features. Each argument will be a vector of list of length 3 corresponding to these 3 different groups. To create different distributions, the strategy is to first draw the mean parameter from Normal(mean, sd_of_mean * I) and the covariance parameter from Wishart(df_wishart_cov, cov), with expectation equal to cov. Then we can sample i.i.d. draws from the specified distribution (either multivariate normal or student's t). To ensure that each instance distribution has the same mean, set sd_of_mean to 0. To ensure that each instance distribution has the same covariance, set sample_cov = FALSE.

The final data.frame will have nsample * nbag * ninst rows and ncov + 3 columns including the bag_label, bag_name, instance_name, and ncov sampled covariates.

Value

A mild_df object.

Author(s)

Yifei Liu, Sean Kent

Examples

set.seed(8)
mild_data <- generate_mild_df(nbag = 7, ninst = 3, nsample = 20,
                              ncov = 2,
                              nimp_pos = 1,
                              dist = rep("mvnormal", 3),
                              mean = list(
                                rep(5, 1),
                                rep(15, 2),
                                0
                              ))

library(dplyr)
distinct(mild_data, bag_label, bag_name, instance_name)
split(mild_data[, 4:5], mild_data$instance_name) %>%
  sapply(colMeans) %>%
  round(2) %>%
  t()

mildsvm documentation built on July 14, 2022, 9:08 a.m.