add_variable: Add a synthetic but realistic variable to a dataset following...
In sdglinkage: Synthetic Data Generation for Linkage Methods Development

Description Usage Arguments Value Examples

add_variable adds a column of new variable to a dataset. This new variable generated by some realistic rules. Several type of variables are included:

nhsid: each row is assigned with an identifical 10-digit id that is randomly generated following the Modulus 11 Algorithm;
dob: if the age_dependency is TRUE and there is a variable called 'age' in the dataset, the dob is generated based on the value of age and end_date. If age_dependency is FALSE, the dob is randomly generated between start_date and end_date;
address: a random UK address sampled from 30,000 UK addresses, see gen_address;
firstname: randomly sample a firstname from the selected database:
- country If is 'uk' and gender_dependency and age_dependency are both TRUE, the generated firstnames will automatically sample a firstname that based on the gender and age of the indviduals within the dataset. The uk firstname database was extracted from ONS containing firstnames and their frequencies in England and Wales from 1996 to 2018.
- If country is 'us' and gender_dependency and race_dependency are both TRUE, the generated firstnames will automatically sample a firstname that based on the gender and ethnicity of the indviduals within the dataset. The us firstname database was extracted from randomNamesData. Current ethnicity codes are: 1 American Indian or Native Alaskan, 2 Asian or Pacific Islander, 3 Black (not Hispanic), 4 Hispanic, 5 White (not Hispanic) and 6 Middle-Eastern, Arabic.
lastname: randomly sample a lastname from the selected database:
- If country is 'uk', the generated lastnames will automatically sample a lastname from a extracted lastname database. The lastname database was extracted from ONS.
- If country is 'us' and race_dependency is TRUE, the generated lastnames will automatically sample a lastname that based on the indvidual's ethnicity. The us lastname database was extracted from randomNamesData.

add_variable(
  dataset,
  type,
  country = "uk",
  start_date = "1900-01-01",
  end_date = "2020-01-01",
  age_dependency = FALSE,
  gender_dependency = FALSE,
  race_dependency = FALSE
)

`dataset`	A data frame of the dataset.
`type`	A string of the type of variable we want to add: 'nhsid', 'dob', 'address', 'firstname' or 'lastname'.
`country`	A string variable with a default of 'uk'. It can be either 'uk' or 'us'.
`start_date`	A Date variable with a default of '1900-01-01'.
`end_date`	A Date variable with a default of '2020-01-01'.
`age_dependency`	A logical variable with a default of FALSE
`gender_dependency`	A logical variable with a default of FALSE
`race_dependency`	A logical variable with a default of FALSE.

A data frame of the dataset with a new generated variable.

tmp1 <- add_variable(adult[1:100,], "nhsid")
tmp2 <- add_variable(adult[1:100,], "dob", end_date = "2015-03-02", age_dependency = TRUE)
tmp3 <- add_variable(adult[1:100,], "address")
tmp4 <- add_variable(adult[1:100,], "firstname", country = "uk", age_dependency = TRUE,
                     gender_dependency = TRUE)
tmp5 <- add_variable(adult[1:100,], "lastname", country = "uk")
tmp6 <- add_variable(adult[1:100,], 'firstname', country = 'us', gender_dependency=TRUE,
                     race_dependency=TRUE)
tmp7 <- add_variable(adult[1:100,], 'lastname', country='us', race_dependency = TRUE)