This is an example of how to go from an INDIVIDUAL file to a chronicle file. We use the example data base created by Luciana Quaranta. It is included in this R package.
To make it really simple, with no messy data, we use Quaranta's demo data base, which is included in this package as external data (an excel file). We convert the first sheet in the file to the R data frame individual.
x <- system.file("extdata", "DemoDatabase.xlsx", mustWork = TRUE, package = "idsr") library(readxl) library(dplyr) individual <- read_excel(x, sheet = 1) names(individual)
It turns out that the variables Start_day, Start_month, Start_year, End_day, End_month, and End_year all contain only missing values (NA), so they can safely be removed.
with(individual, table(Id_D))
Obviously, we can get rid of the Id_D column as well:
individual$Start_day <- NULL individual$Start_month <- NULL individual$Start_year <- NULL individual$End_day <- NULL individual$End_month <- NULL individual$End_year <- NULL individual$Id_D <- NULL ##individual <- individual %>% ## select(Id_I, Type, Value, Value_Id_C, Day, Month, Year, Date_type, Source) %>% ##arrange(Id_I, Year, Month, Day) knitr::kable(head(individual, 8))
This is the full information available for individual No. r as.integer(individual$Id_I[1])
.
The chronicle frame should only contain variables corresponding to Date_type = Event. However, this is too restrictive, because for some reason events of certain types are not reported as Event. From a strict database-management perspective that is reasonable, but not in the afterworld of data (survival) analysis. Therefore I let all rows with a (correct) date be part of the chronicle frame.
The other variables are time-constant and we keep them separate from the rest: They are very easy to add on at the end. There are two exceptions: The Declared Types Start_observation and End_observation need to be in the chronicle frame
##to_chron <- (individual$Date_type == "Event") | ## (individual$Type %in% c("Start_observation", "End_observation")) to_chron <- !is.na(individual$Year) personal <- individual[!to_chron, ] chronicle <- individual[to_chron, ] chronicle$Date_Type <- NULL # Not needed
So, the chronicle frame looks like this at the start:
knitr::kable(chronicle)
We obvously don't need the variables Value_Id_C and Source in chronicle, so we remove them. The variable Date_type can also be removed, since it is constant (Event) in this frame. We also convert the Year-Month-Year triple into a Date variable, after wich we can remove the triple. After that the frame is sorted by Id_I,date, and in case of tied dates, Type == Start_observation is first and Type == End_observation is last.
##chronicle$Value_Id_C <- NULL ##chronicle$Source <- NULL ##chronicle$Date_type <- NULL chronicle$date <- as.Date(paste(chronicle$Year, chronicle$Month, chronicle$Day, sep = "-")) ##chronicle$Year <- NULL ##chronicle$Month <- NULL ##chronicle$Day <- NULL chronicle <- chronicle[, c("Id_I", "Type", "Value", "date")] chronicle <- chronicle[order(chronicle$Id_I, chronicle$date, chronicle$Type != "Start_observation", chronicle$Type == "End_observation"), ] knitr::kable(chronicle)
Very neat! However, for a specific study we need a well-defined start event and a likewise well-defined end event. We exemplify by thinking of mortality: Following individuals from birth to death. So our start event is Birth, and it must be defined (including a date) for all individuals. The end event is Death, but its date is not necessarily known to us ("right censoring"). Note the difference: We may not observe a birth in our data, but its date must be known.
For our purpose, Birth and Birth_date carry the same information in the presence of the event Start_observation, which is mandatory. So we replace all Type = Birth_date by Type = Birth and then remove duplicates:
chronicle$Type[chronicle$Type == "Birth_date"] <- "Birth" chronicle <- chronicle[!duplicated(chronicle), ]
Next we must define two new logical variables, start_event (TRUE exactly once for all individuals) and end_event (TRUE at most once for all). However, it s more practical to defer that exercise to the episodes file creation.
Now, for each distinct Type (event) we need to define a variable: The events Birth and Death corresponds to the logical variable alive (becomes TRUE at a birth and FALSE at a death). The event Marriage corresponds to the variable civil_status (becomes married at a marriage, and is unmarried at start).
So,
chronicle$Variable <- NA take <- chronicle$Type %in% c("Birth", "Birth_date") chronicle$Variable[take] <- "alive" chronicle$Value[take] <- "yes" ## take <- chronicle$Type %in% c("Death", "Death_date") chronicle$Variable[take] <- "alive" chronicle$Value[take] <- "no" ## take <- chronicle$Type == "Start_observation" chronicle$Variable[take] <- "present" chronicle$Value[take] <- "yes" take <- chronicle$Type == "End_observation" chronicle$Variable[take] <- "present" chronicle$Value[take] <- "no" ## chronicle$Variable[chronicle$Type == "Occupation"] <- "occupation" ## take <- chronicle$Type == "Marriage" chronicle$Variable[take] <- "civil_status" chronicle$Value[take] <- "married" ## Reorder: chronicle <- chronicle[, c("Id_I", "Variable", "Value", "date", "Type")] knitr::kable(chronicle)
There is one problem remaining: Duplicates. Let us look at individual Id_I = 1548468:
knitr::kable(chronicle[chronicle$Id_I == 1548468, ])
Two problems:
Two birth notifications on the same date (thank heaven). Easy to fix: Remove one of them.
Death on the same day as birth. Must keep both (of course), but we must decide which comes first. That is also easy, birth before death. Could be solved by introcucing a DayFrac, meaning that we add a small number (0.02, say)to the death date, but we avoid that, at least for the moment.
chronicle <- chronicle[, c("Id_I", "Variable", "Value", "date", "Type")] chronicle <- chronicle[!duplicated(chronicle), ] ##rows <- duplicated(chronicle[, c("Id_I", "Variable", "date")]) # NOTE: ##chronicle$date[rows] <- chronicle$date[rows] + 0.02 # Outcommented!!
The personal frame is made tidy in a similar manner as the chronicle frame.
personal <- personal[, c("Id_I", "Type", "Value", "Value_Id_C")] knitr::kable(personal)
The column Value_Id_C and the rows with Type == Birth_location are without useful information, so they are removed:
personal$Value_Id_C <- NULL personal <- personal[personal$Type != "Birth_location", ] knitr::kable(personal)
So, the input to an "EpisodesFileCreator" should be the frames chronicle and personal. In addition, we need a description frame linking events to variables in the chronicle frame. See the episodes vignette for the continuation.
Alter, George & Kees Mandemakers, 'The Intermediate Data Structure (IDS) for Longitudinal Historical Microdata, version 4', Historical Life Course Studies 1 (2014), 1--26. http
Quaranta, Luciana, 'Using the Intermediate Data Structure (IDS) to Construct Files for Statistical Analysis', Historical Life Course Studies 2 (2015), 86--107. http
Quaranta, Luciana, 'Stata Programs for Using the Intermediate Data Structure (IDS) to Construct files for Statistical Analysis', Historical Life Course Studies 3 (2015), 1--19. http
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.