Introduction

This is an example of how to go from an INDIVIDUAL file to a chronicle file. We use the example data base created by Luciana Quaranta. It is included in this R package.

Quaranta's Demo Data Base

To make it really simple, with no messy data, we use Quaranta's demo data base, which is included in this package as external data (an excel file). We convert the first sheet in the file to the R data frame individual.

x <- system.file("extdata", "DemoDatabase.xlsx", mustWork = TRUE, package = "idsr")
library(readxl)
library(dplyr)
individual <- read_excel(x, sheet = 1)
names(individual)

It turns out that the variables Start_day, Start_month, Start_year, End_day, End_month, and End_year all contain only missing values (NA), so they can safely be removed.

with(individual, table(Id_D))

Obviously, we can get rid of the Id_D column as well:

individual$Start_day <- NULL
individual$Start_month <- NULL
individual$Start_year <- NULL
individual$End_day <- NULL
individual$End_month <- NULL
individual$End_year <- NULL
individual$Id_D <- NULL
##individual <- individual %>%
  ##  select(Id_I, Type, Value, Value_Id_C, Day, Month, Year, Date_type, Source) %>%
    ##arrange(Id_I, Year, Month, Day)
knitr::kable(head(individual, 8))

This is the full information available for individual No. r as.integer(individual$Id_I[1]).

Properties of a chronicle frame

The chronicle frame should only contain variables corresponding to Date_type = Event. However, this is too restrictive, because for some reason events of certain types are not reported as Event. From a strict database-management perspective that is reasonable, but not in the afterworld of data (survival) analysis. Therefore I let all rows with a (correct) date be part of the chronicle frame.

The other variables are time-constant and we keep them separate from the rest: They are very easy to add on at the end. There are two exceptions: The Declared Types Start_observation and End_observation need to be in the chronicle frame

##to_chron <- (individual$Date_type == "Event") | 
  ##          (individual$Type %in% c("Start_observation", "End_observation"))
to_chron <- !is.na(individual$Year)
personal <- individual[!to_chron, ]
chronicle <- individual[to_chron, ]
chronicle$Date_Type <- NULL # Not needed

So, the chronicle frame looks like this at the start:

knitr::kable(chronicle)

We obvously don't need the variables Value_Id_C and Source in chronicle, so we remove them. The variable Date_type can also be removed, since it is constant (Event) in this frame. We also convert the Year-Month-Year triple into a Date variable, after wich we can remove the triple. After that the frame is sorted by Id_I,date, and in case of tied dates, Type == Start_observation is first and Type == End_observation is last.

##chronicle$Value_Id_C <- NULL
##chronicle$Source <- NULL
##chronicle$Date_type <- NULL
chronicle$date <- as.Date(paste(chronicle$Year, 
                                chronicle$Month, 
                                chronicle$Day, sep = "-"))
##chronicle$Year <- NULL
##chronicle$Month <- NULL
##chronicle$Day <- NULL
chronicle <- chronicle[, c("Id_I", "Type", "Value", "date")]
chronicle <- chronicle[order(chronicle$Id_I,
                             chronicle$date,
                             chronicle$Type != "Start_observation",
                             chronicle$Type == "End_observation"), ]
knitr::kable(chronicle)

Very neat! However, for a specific study we need a well-defined start event and a likewise well-defined end event. We exemplify by thinking of mortality: Following individuals from birth to death. So our start event is Birth, and it must be defined (including a date) for all individuals. The end event is Death, but its date is not necessarily known to us ("right censoring"). Note the difference: We may not observe a birth in our data, but its date must be known.

For our purpose, Birth and Birth_date carry the same information in the presence of the event Start_observation, which is mandatory. So we replace all Type = Birth_date by Type = Birth and then remove duplicates:

chronicle$Type[chronicle$Type == "Birth_date"] <- "Birth"
chronicle <- chronicle[!duplicated(chronicle), ]

Next we must define two new logical variables, start_event (TRUE exactly once for all individuals) and end_event (TRUE at most once for all). However, it s more practical to defer that exercise to the episodes file creation.

Now, for each distinct Type (event) we need to define a variable: The events Birth and Death corresponds to the logical variable alive (becomes TRUE at a birth and FALSE at a death). The event Marriage corresponds to the variable civil_status (becomes married at a marriage, and is unmarried at start).

So,

chronicle$Variable <- NA
take <- chronicle$Type %in% c("Birth", "Birth_date")
chronicle$Variable[take] <- "alive"
chronicle$Value[take] <- "yes"
##
take <- chronicle$Type %in% c("Death", "Death_date")
chronicle$Variable[take] <- "alive"
chronicle$Value[take] <- "no"
##

take <- chronicle$Type == "Start_observation"
chronicle$Variable[take] <- "present"
chronicle$Value[take] <- "yes"
take <- chronicle$Type == "End_observation"
chronicle$Variable[take] <- "present"
chronicle$Value[take] <- "no"
##
chronicle$Variable[chronicle$Type == "Occupation"] <- "occupation"
##
take <- chronicle$Type == "Marriage"
chronicle$Variable[take] <- "civil_status"
chronicle$Value[take] <- "married"
## Reorder:
chronicle <- chronicle[, c("Id_I", "Variable", "Value", "date", "Type")]
knitr::kable(chronicle)

There is one problem remaining: Duplicates. Let us look at individual Id_I = 1548468:

knitr::kable(chronicle[chronicle$Id_I == 1548468, ])

Two problems:

  1. Two birth notifications on the same date (thank heaven). Easy to fix: Remove one of them.

  2. Death on the same day as birth. Must keep both (of course), but we must decide which comes first. That is also easy, birth before death. Could be solved by introcucing a DayFrac, meaning that we add a small number (0.02, say)to the death date, but we avoid that, at least for the moment.

chronicle <- chronicle[, c("Id_I", "Variable", "Value", "date", "Type")]
chronicle <- chronicle[!duplicated(chronicle), ]
##rows <- duplicated(chronicle[, c("Id_I", "Variable", "date")])   # NOTE:
##chronicle$date[rows] <- chronicle$date[rows] + 0.02     # Outcommented!!

The personal frame

The personal frame is made tidy in a similar manner as the chronicle frame.

personal <- personal[, c("Id_I", "Type", "Value", "Value_Id_C")]
knitr::kable(personal)

The column Value_Id_C and the rows with Type == Birth_location are without useful information, so they are removed:

personal$Value_Id_C <- NULL
personal <- personal[personal$Type != "Birth_location", ]
knitr::kable(personal)

Conclusion

So, the input to an "EpisodesFileCreator" should be the frames chronicle and personal. In addition, we need a description frame linking events to variables in the chronicle frame. See the episodes vignette for the continuation.

References

Alter, George & Kees Mandemakers, 'The Intermediate Data Structure (IDS) for Longitudinal Historical Microdata, version 4', Historical Life Course Studies 1 (2014), 1--26. http

Quaranta, Luciana, 'Using the Intermediate Data Structure (IDS) to Construct Files for Statistical Analysis', Historical Life Course Studies 2 (2015), 86--107. http

Quaranta, Luciana, 'Stata Programs for Using the Intermediate Data Structure (IDS) to Construct files for Statistical Analysis', Historical Life Course Studies 3 (2015), 1--19. http



goranbrostrom/idsr documentation built on May 17, 2019, 7:59 a.m.