rpin: Generate random (or anonymise existing pins to) non-personal...

Description Usage Arguments Details Value Simulation Anonymise Examples

Description

rpin is a generic function to generate non-personal pins for testing and educational purposes. pin_anonymise is a wrapper to anonymise/de-personalise existing pins.

Usage

1
2
3
4
5
6
7
8
9
rpin(x, ...)

## S3 method for class 'integer'
rpin(x, l_birth = "1900-01-01", u_birth = Sys.Date(),
  unique = TRUE, male_prob = 0.5, ...)

## S3 method for class 'pin'
rpin(x, l_birth, u_birth, unique = TRUE,
  male_prob = mean(pin_sex(x) == "Male"), keep_rel = TRUE, ...)

Arguments

x

is either an integer (numeric vector of length one) specifing the length of the generated pin vector, or a pin vector itself to be used for generating similair but anonymised pins (see section "Anonymise").

...

additional arguments to be passed to or from methods.

l_birth,u_birth

are dates (or objects that can be coerced to such) constituting a possible time intervall, limiting the period from which birth dates are drawn. If x is an integer, these are "1900-01-01" and Sys.Date() by default. If x is a pin vector, these are matched to the birth years in pin.

unique

Should all generated pins be unique, i e should the sampling be done without replacement (TRUE as default). A possible relation between pins in x (if x is of class pin) will however be kept if keep_rel = TRUE.

male_prob

probability that a generated pin refers to a man (female_prob = 1 - male_prob). If x is an integer, male_prob is 0.5 by default. If x is a pin vector, male_prob is estimated as the observed probability from x.

keep_rel

Should a possible relationship between pins in x be kept in the output, i e if the same pin is repeated in x, should pins at the same positions in the output also be repeated? This is TRUE by default and works independently of unique.

Details

A pin, where the birth number (digit 9-11 in a 12 number pin) falls in the interval [880, 999], is a valid personal identification number but is never assigned to an actual person. Numbers of this form can instead be used for testing and educational procedures without the risk to intefer with personal (and possibly sensitive) data.

Value

rpin returns a vector of class pin with length x if x is an integer or with length length(x) if x is itself a pin object. The object will also have an extra attribute "non_personal" set to TRUE to indicate that the generated pins are non-personal ("fake").

Simulation

The simulation is done by the following steps:

Anonymise

Given that x is an object of class pin, the output of rpin is a pin vector that tries to mimic x in all aspects except identifying real persons. The empirical age (birthday) distribution from x will be estimated by logspline. A random sample of length(x) is drawn from that distribution. The last four digits are generated as in section Simulation but with sex distribution estimated from x. The internal relationships between elements in x are maintaind as described for argument keep_rel.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(sweidnumbr)
set.seed(12345)
## Generate some fake pins
p <- rpin(100)

## Most pin-functions can be applied to p
is.pin(p) # TRUE
pin_sex(p) # With mean(pin_sex(p) == "Male") -> male_prob when x -> Inf
table(pin_birthplace(p)) # non-informative
pin_age(p)
pin_to_date(p)

## If we want to simulate university students in a med course in Sweden,
## we migh try
p_ms <- rpin(100, l_birth = "1974-01-01", u_birth = "1994-01-01", male_prob = .25)
table(pin_sex(p_ms))
summary(pin_age(p_ms))

## Now, assume for a moment that p_ms is actually real data that we want to anonymise.
## The easy way:
p_ms2 <- rpin(p_ms)
## We then have new (fake) numbers but with the same age- and sex distribuiton.
table(pin_sex(p_ms2))
summary(pin_age(p_ms2))

## The empirical age distribution from p_ms itself could of course also generate
## birth dates outside of the empirical birthdate interval from p_ms. The default limit
## is to not generate pins with birth year before the birth year of the oldest pin in the input
## (and wice versa for the upper limit). But we could also chose to not tolerate any
## pins "older" than the "oldest" pin from the input
p_ms3 <- rpin(p_ms, l_birth = min(y <- pin_to_date(p_ms)), u_birth = max(y))
min(pin_to_date(p_ms3)) >= min(pin_to_date(p_ms))
max(pin_to_date(p_ms3)) <= max(pin_to_date(p_ms))

## We can modify the sex distribution even though we keep the age-distribution
x <- rpin(p_ms, male_prob = .01)
x <- pin_sex(x)
table(x)

eribul/rinca documentation built on May 16, 2019, 8:26 a.m.