rmarkdown::render(input = "/home/aurelius/btsync/mk/workspace/paketit/r.eusilc/vignettes/r.eusilc_tutorial.Rmd", output_file = "/home/aurelius/btsync/mk/workspace/paketit/r.eusilc/vignettes/r.eusilc_tutorial.html")
Perhaps the most compherensive documentation of EU-SILC data is by GESIS
| Cross-sectional data | Panel data | | --------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | | 2013 | 2013 | | 2012 | 2012 | | 2011 | 2011 | | 2010 | 2010 | | 2009 | 2009 | | 2008 | 2008 | | 2007 | 2007 | | 2006 | 2006 | | 2005 | 2005 |
This tutorial will go through a process for merging EU-SILC raw .csv data files using r.eusilc-package and creating an analysis ready-dataset. Tutorial will explain five different scenarios for creating datasets:
All the analysis are implemented in R, but the output data can be formatted also for proprietary software, like SPSS, SAS or Stata.
This demo will use data from 2010, but in the 5th example will pool it with data from 2009.
Requirements:
=8 GB of RAM
Eu-silc datasets are delivered on a laser disc, where you have on each disc a four .csv files representing one either cross-sectional or longitudinal version of data. Raw .csv-files are named as UDB_L06D_ver 2006-2 from 01-03-2009
, that in this case would mean:
You have to specify path to each of the four original .csv files for the following parameters:
path.personal.register path.personal.data path.household.register path.household.data
Consider structuring your raw data archive into something like (this is how my data is organised for this demo):
demo_data |\ | \-eusilc_raw | |-----2010 | | |---longi_rev3 | | | |-------- UDB_l10R_ver 2010-4 from 01-03-2014.csv | | | |-------- UDB_l10P_ver 2010-4 from 01-03-2014.csv | | | |-------- UDB_l10D_ver 2010-4 from 01-03-2014.csv | | | |-------- UDB_l10H_ver 2010-4 from 01-03-2014.csv | | |---cross_rev4 | | | |-------- UDB_c10R_ver 2010-5 from 01-03-14.csv | | | |-------- UDB_c10P_ver 2010-5 from 01-03-14.csv | | | |-------- UDB_c10D_ver 2010-5 from 01-03-14.csv |____|______|____|-------- UDB_c10H_ver 2010-5 from 01-03-14.csv | |-----2011 | | |---longi_rev1 | | | |-------- UDB_l11R_ver 2011-2 from 01-03-2014.csv | | | |-------- UDB_l11P_ver 2011-2 from 01-03-2014.csv | | | |-------- UDB_l11D_ver 2011-2 from 01-03-2014.csv |____|______|____|-------- UDB_l11H_ver 2011-2 from 01-03-2014.csv | | |---cross_rev2 | | | |-------- UDB_c11R_ver 2012-3 from 01-03-14.csv | | | |-------- UDB_c11P_ver 2012-3 from 01-03-14.csv | | | |-------- UDB_c11D_ver 2012-3 from 01-03-14.csv |____|______|____|-------- UDB_c11H_ver 2012-3 from 01-03-14.csv | |-----2012 | | |---cross_rev0 | | | |-------- UDB_c12R_ver 2012-1 from 01-03-14.csv | | | |-------- UDB_c12P_ver 2012-1 from 01-03-14.csv | | | |-------- UDB_c12D_ver 2012-1 from 01-03-14.csv |____|______|____|-------- UDB_c12H_ver 2012-1 from 01-03-14.csv |\ | \-eusilc_merged | |---2010 | | |----- | | |----- | |---2011 | | |----- |____|_____|-----
For installation you need to have devtools-package installed with install.packages("devtools")
. In Windows you will need RTools to be installed before installing devtools-package.
library(devtools) install_github("muuankarski/r.eusilc") library(r.eusilc)
There is only one function merge_eusilc()
in the package. Typing ?merge_eusilc
will print you the help file.
Here we are merging the household register file with household data file. Function reads the raw -csv files in ~/data_demo/eusilc_raw/2010/cross_rev4
and writes the merged output in csv
format in ~/data_demo/eusilc_merged/2010
with filename 2010hh_merge_cross.csv
. In addition, function returns data.frame
object hh_cross_2010
in the R session.
library(r.eusilc) hh_cross_2011 <- merge_eusilc(path.household.register = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11D_ver 2011-3 from 01-03-14.csv", path.household.data = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11H_ver 2011-3 from 01-03-14.csv", output.path="~/demo_data/eusilc_merged/2011", level="household", format="csv", output.filename = "hh_cross_2011", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all")
You can create a boxplot on total disposable household income HY020
by country DB020
with following code:
library(ggplot2) ggplot(hh_cross_2010, aes(x=factor(reorder(DB020, HY020, median, na.rm=TRUE)), y=HY020)) + geom_boxplot() + coord_cartesian(ylim=c(0,80000))
Here we are merging the personal register file with personal data file. Function reads the raw -csv files in ~/data_demo/eusilc_raw/2010/cross_rev4
and writes the merged output in csv
format in ~/data_demo/eusilc_merged/2010
with filename 2010per_merge_cross.csv
. In addition, function returns data.frame
object per_cross_2010
in the R session.
library(r.eusilc) per_cross_2011 <- merge_eusilc(path.personal.register = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11R_ver 2011-3 from 01-03-14.csv", path.personal.data = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11P_ver 2011-3 from 01-03-14.csv", output.path="~/demo_data/eusilc_merged/2011", output.filename = "per_cross_2011", format="RDS", level="personal", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all")
You can create a boxplot on Employee cash or near cash income(net) PY010N
by sex RB090
and country RB020
with following code:
# removing NA values from sex-variable plot_data <- per_cross_2010[!is.na(per_cross_2010$RB090),] # relevelling the sex-variable plot_data$RB090[plot_data$RB090 == 1] <- "male" plot_data$RB090[plot_data$RB090 == 2] <- "female" library(ggplot2) ggplot(plot_data, aes(x=factor(RB090), y=PY010N)) + geom_boxplot() + coord_cartesian(ylim=c(0,50000)) + facet_wrap(~RB020)
Here we are merging the household register with household data files and then merging the merged household register and data files into it. Function reads the raw -csv files in ~/data_demo/eusilc_raw/2010/cross_rev4
and writes the merged output in csv
format in ~/data_demo/eusilc_merged/2010
with filename 2010both_merge_cross.csv
. In addition, function returns data.frame
object both_cross_2010
in the R session.
library(r.eusilc) both_cross_2011 <- merge_eusilc(path.personal.register = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11R_ver 2011-3 from 01-03-14.csv", path.personal.data = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11P_ver 2011-3 from 01-03-14.csv", path.household.register = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11D_ver 2011-3 from 01-03-14.csv", path.household.data = "~/demo_data/eusilc_raw/2011/cross_rev2/UDB_c11H_ver 2011-3 from 01-03-14.csv", output.path="~/demo_data/eusilc_merged/2011", output.filename = "both_cross_2011", format="RDS", level="both", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all")
For this data we shall create a scatterplot where we have household level total disposable household income HY020
on the x-axis and personal level Employee cash or near cash income(net) PY010N
on the y-axis.
library(ggplot2) ggplot(both_cross_2010, aes(x=HY020, y=PY010N)) + geom_point(alpha=.1, shape=1) + geom_smooth(method=lm) + coord_cartesian(xlim=c(-5000,100000), ylim=c(-5000,50000)) + facet_wrap(~RB020)
Here we are merging the personal register with personal data from longitudinal data. Function reads the raw -csv files in ~/data_demo/eusilc_raw/2010/longi_rev2
and writes the merged output in csv
format in ~/data_demo/eusilc_merged/2010
with filename 2010per_merge_longi.csv
. In addition, function returns data.frame
object per_longi_2010
in the R session.
library(r.eusilc) both_longi_2011 <- merge_eusilc(path.personal.register = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11R_ver 2011-2 from 01-03-2014.csv", path.personal.data = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11P_ver 2011-2 from 01-03-2014.csv", path.household.register = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11D_ver 2011-2 from 01-03-2014.csv", path.household.data = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11H_ver 2011-2 from 01-03-2014.csv", output.path="~/demo_data/eusilc_merged/2011", level="both", output.filename = "both_longi_2011", format="RDS", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all")
# lets take sample of 500 000 plot_data <- per_longi_2010[sample(1:nrow(per_longi_2010), 500000, replace=FALSE),] library(ggplot2) ggplot(plot_data, aes(x=RB010, y=PY010N, group=PER_ID)) + geom_point(alpha=.01, shape=1) + geom_line(alpha=.1) + coord_cartesian(ylim=c(-5000,80000)) + facet_wrap(~RB020)
Loading the datasets
library(r.eusilc) both_longi_2010 <- merge_eusilc(path.personal.register = "~/demo_data/eusilc_raw/2010/longi_rev3/UDB_l10R_ver 2010-4 from 01-03-2014.csv", path.personal.data = "~/demo_data/eusilc_raw/2010/longi_rev3/UDB_l10P_ver 2010-4 from 01-03-2014.csv", path.household.register = "~/demo_data/eusilc_raw/2010/longi_rev3/UDB_l10D_ver 2010-4 from 01-03-2014.csv", path.household.data = "~/demo_data/eusilc_raw/2010/longi_rev3/UDB_l10H_ver 2010-4 from 01-03-2014.csv", output.path="~/demo_data/eusilc_merged/2010", output.filename = "both_longi_2010", level="both", format="RDS", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all") both_longi_2011 <- merge_eusilc(path.personal.register = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11R_ver 2011-2 from 01-03-2014.csv", path.personal.data = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11P_ver 2011-2 from 01-03-2014.csv", path.household.register = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11D_ver 2011-2 from 01-03-2014.csv", path.household.data = "~/demo_data/eusilc_raw/2011/longi_rev1/UDB_l11H_ver 2011-2 from 01-03-2014.csv", output.path="~/demo_data/eusilc_merged/2011", level="both", output.filename = "both_longi_2011", format="RDS", subset.vars.per.reg="all", subset.vars.per.data="all", subset.vars.hh.reg="all", subset.vars.hh.data="all", subset.countries="all")
vars.per.reg <- c("RB010", # Year "RB020", # Country "RB030", # Personal ID "RB080", # Year of birth "RB090", # Sex "RX010") # Age at the date of the interview ## Personal register data - 2008 oma lista (PL030 case) vars.per.data <- c("PB190", # Marital status "PB200", # Consensual union "PL031", # Self-defined current economic status "PL040") # Status in employment ## Household register vars.hh.reg <- c("DB110") # Household status ## Household data vars.hh.data <- c("HY010", # Total Household Gross Income "HY020", # total disposable household income "HY022", # Total Transposable Household Income before social transfers other than old-age and survivors benefits "HY023", # Total disposable Household income before social transfers including old-age and survivors benefits "HX040", # Household size "HX050", # Equivalent household size "HX090") # Equivalized disposable household income subset.vars <- c(vars.per.reg,vars.per.data, vars.hh.reg,vars.hh.data) nordic.countries <- c("SE","NO","FI","DK","IS") both_longi_2010 <- readRDS("~/demo_data/eusilc_merged/2010/both_longi_2010.RDS") both_longi_2011 <- readRDS("~/demo_data/eusilc_merged/2011/both_longi_2011.RDS") # subsetting suse_2010 <- both_longi_2010[both_longi_2010$RB020 %in% nordic.countries, subset.vars] suse_2011 <- both_longi_2011[both_longi_2011$RB020 %in% nordic.countries, subset.vars]
Now we have identical variables for both datasets and it is easy to combine them using rbind
.
As we have duplicate information from certain individual from certain year from different datas (person x information from year 2009 is in BOTH 2010 and 2011 londitudinal data), we have to remove them and only leave the first occasion of that information.
dat <- rbind(suse_2010, suse_2011) # dat <- dat[!duplicated(dat[c("RB010", # Year "RB020", # country "RB030")]),] # personal ID
By removing the duplicates we are left with 268522 individuals and we can plor their household income across the years by with following code
library(ggplot2) ggplot(data=dat, aes(x=RB010, y=HX090, color=RB020, group=RB030)) + geom_path(alpha=.1) + geom_point(alpha=.1, size=1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.