synthetic_new_attribute: Add a new attribute to a synthetic_micro dataset

View source: R/synth_new_attr.R

synthetic_new_attributeR Documentation

Add a new attribute to a synthetic_micro dataset

Description

Add a new attribute to a synthetic_micro dataset using conditional relationships between the new attribute and existing attributes (eg. wage rate conditioned on age and education level).

Usage

synthetic_new_attribute(
  df,
  prob_name = "p",
  attr_name = "variable",
  conditional_vars = NULL,
  sym_tbl = NULL
)

Arguments

df

An R object of class "synthetic_micro".

prob_name

A string specifying the column name of the df containing the probabilities for each synthetic observation.

attr_name

A string specifying the desired name of the new attribute to be added to the data.

conditional_vars

An character vector specifying the existing variables, if any, on which the new attribute (variable) is to be conditioned on. Variables must be specified in order. Defaults to NULL ie- an unconditional new attribute.

sym_tbl

sym_tbl A data.frame symbol table with N + 2 columns. The last two columns must be: 1. A vector containing the new attribute counts or percentages; 2. is a vector of the new attribute levels. The first N columns must match the conditioning scheme imposed by the variables in conditional_vars. See details and examples.

Value

A new synthetic_micro dataset with class "synthetic_micro".

Details

New synthetic variables are introduced to the existing data via conditional probability. Similar to derive_synth_datasets, the goal with this function is to generate a joint probability distribution for an attribute vector; and, to create synthetic individuals from this distribution. Although no limit is placed on the number of variables on which to condition, in practice, data rarely exists which allows more than two or three conditioning variables. Other variables are assumed to be independent from the new attribute.

** There are four different types of conditional/marginal probability models which may be considered for a given new attribute: (1) Independence: it is assumed that each of the variables is independent of the others (2) Pairwise conditional independence: it is assumed that attributes are related to only one other attribute and independent of all others. (3) Conditional independence: Attributes can be depedent on some subset of other attributes and independent of the rest. (4) In the most general case, all attributes are jointly interrelated.

Conditioning is implemented via symbol-tables (sym_tbl) to ensure accurate matching between conditioning variables, new attribute levels, and new attribute probabilities. The symbol table is constructed such that the key in the symbol-table's key-value pair is the specific values for the set of conditioning variables. This key is the first N columns of sym_tbl. A recursive approach is employed to conditionally partition sym_tbl. In this sense, the *order* in which the conditional variables are supplied matters.

The value is final 2 columns of sym_tbl which are a pair of (A) either counts or percentages used to specify the probability for the new attribute and (B) the level that the new attribute takes on.

Examples

{
set.seed(567L)
df <- data.frame(gender= factor(sample(c("male", "female"), size= 100, replace= TRUE)),
                edu= factor(sample(c("LT_college", "BA_degree"), size= 100, replace= TRUE)),
                p= runif(100))
df$p <- df$p / sum(df$p)
class(df) <- c("data.frame", "micro_synthetic")
ST <- data.frame(gender= c(rep("male", 3), rep("female", 3)),
                 attr_pct= c(0.1, 0.8, 0.1, 0.05, 0.7, 0.25),
                 levels= rep(c("low", "middle", "high"), 2))
df2 <- synthetic_new_attribute(df, prob_name= "p", attr_name= "SES", conditional_vars= "gender",
         sym_tbl= ST)

ST2 <- data.frame(gender= c(rep("male", 3), rep("female", 6)),
                  edu= c(rep(NA, 3), rep(c("LT_college", "BA_degree"), each= 3)),
                  attr_pct= c(0.1, 0.8, 0.1, 10, 80, 10, 5, 70, 25),
                  levels= rep(c("low", "middle", "high"), 3))
df2 <- synthetic_new_attribute(df, prob_name= "p", attr_name= "SES",
         conditional_vars= c("gender", "edu"),
         sym_tbl= ST2)
}

synthACS documentation built on Oct. 26, 2022, 5:09 p.m.