Modern data users face the need to combine very different data sources which can easily become confusing. The R-package blueprint offers the opportunity to select, combine and transform data in an easy and intuitive way by using a meta-data files (called "blueprint") that can be easily edited with a chosen Spreadsheet application or text editor capable of e.g. saving the file formats OpenOffice Calc, Excel, CSV, HTML. Merging variables that are split over dozens of different files is done by entering the absolute necessary information into a text-based meta-file and let the package do the rest. These files define which variables are imported and how they have to be (left-)joined with other files. Blueprint is also best suited for the needs of joining longitudinal data and unifying variable names that changed across different waves into one new variable. By relying on the versatile import
-function for file import it offers the opportunity to handle very different kind of data formats (for the imported data as well as the blueprint-files.). Every variable can be transformed by functions or a chain of functions (sequentially applied functions) that will transform the variable from the corresponding file before
they are included into the final data framel. Repetitive recoding of similar variables becomes very easy. By relying on meta-files data-processing becomes faster, clearer and more easy expandable. The functions that merge and transform data in the background rely mainly on functions from the brilliant and comparable fast packages dplyr
and data.table
.
In blueprint the merge process is specified by a so-called blueprint file. This file specifies rules that control from where and how data is imported. It can be edited with a standard Spreadsheet-Application like Excel or Open Office Spreadsheet or other Editors. A convenient way to create a new template for such a blueprint file is to enter:
open_blue('my.blueprint.name.csv')
Alternatively we could change the suffix of the filename to xlsx to open an Excel-File:
open_blue('my.blueprint.name.xlsx')
The package will try to create a template file with the according format, at the given path and afterwards will open it. If the file allready exists, it will be just opened. Afterwards this files can be edited. Note that you will allways have to save to bring changes into effect.
So what to enter into the template? Table 0 contains the basic structure of a blueprint file.
| newvar | var1| file1| link1| fun1| var2|file2|link2|fun2| ...| |---|---|---|---|---|---|---|---|---|---| | new name of variable 1 | {old name of var1 in first data file} | | | | {old name of var1 in second data file} | | | | | | new name of variable 2 | {old name of var2 in first data file} | | | | {old name of var2 in second data file} | | | | |
Table: Table 0: The basic structure of a blueprint file
Every row represents an unique variable of a final data.frame. The name of this variable can be chosen according to the needs of the analysis that is specified in the firs column (by entering this name into the column newvar
).
Let's test this structure by using random data from OECD's PISA (Programme for International Student Assessment) data.
First of all we have to generate an directory of example files.
blue_example()
The file blueprint_example/INT_STU12_DEC03_synth.sav
is a synthesized version of original PISA data stored in a SPSS-file (which is indicated by the suffix .sav
). It contains the variables StIDStd
(a unique id of the student), CNT
(the country where the student was tested), ST04Q01
(gender), AGE
, W_FSTUWT
(weight) and PV1MATH
(a test score in math.)
In the first example we will just load, select and rename this variables from the SPSS file by using example_blueprint1 (which exist in two versions).
To open the example in an editor you can use
open_blue('blueprint_example/example_blueprint1.csv')
or
open_blue('blueprint_example/example_blueprint1.xslx')
The content of this blueprint file is shown in Table 1. It states that 6 variables are to be imported from the file blueprint_example/INT_STU12_DEC03_synth.sav
. The variable ST04Q01
e.g. will be renamed to gender
.
| newvar| var1| file1| link1| fun1| |--- |---|---|---|---| | studid| StIDStd |blueprint_example/INT_STU12_DEC03_synth.sav ||| | country | CNT | blueprint_example/INT_STU12_DEC03_synth.sav ||| | gender | ST04Q01| blueprint_example/INT_STU12_DEC03_synth.sav ||| | age | AGE |blueprint_example/INT_STU12_DEC03_synth.sav ||| | final.weight |W_FSTUWT |blueprint_example/INT_STU12_DEC03_synth.sav ||| | math.score1 |PV1MATH |blueprint_example/INT_STU12_DEC03_synth.sav |||
Table: Table 1: example_blueprint1
The following code executes the rules in this blueprint and stores the data in a new object (example1
).
example1 <- blue('blueprint_example/example_blueprint1.xlsx') example1
# A tibble: 2,000 × 7 studid country gender age final.weight math.score1 wave <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 06880 GBR 2 15.67 142.6006 544.7188 1 2 02864 DEU 1 15.42 125.2389 522.1296 1 3 01916 DEU 2 15.33 8.8975 607.0338 1 4 04419 GBR 2 15.25 18.5177 553.2871 1 5 01823 DEU 1 15.92 158.2145 466.5135 1 6 01650 DEU 1 16.17 172.7704 316.4901 1 7 06297 GBR 1 15.33 179.7536 373.8199 1 8 04957 GBR 1 16.00 207.3544 455.2189 1 9 00393 GBR 2 15.50 9.1159 612.4864 1 10 04696 DEU 1 15.92 4.1366 576.1100 1 # ... with 1,990 more rows
Note again that the structure of blueprint files might be confusing when using the first time. For ease of editing it is transposed compared to the resulting data.frame: Rows in blueprints do represent columns (variables) in the data and as will be seen: column blocks in blueprints do e.g. represent chunks (which are merged as additional rows).
In this example we want to add school characteristics to the data of students. The file blueprint_example/INT_SCQ12_DEC03_synth.sav
contains this variables.
# open_blue('blueprint_example/example_blueprint2.csv') open_blue('blueprint_example/example_blueprint2.xlsx')
| newvar| var1| file1| link1| fun1| |---|---|---|---|--- | studid| StIDStd| blueprint_example/INT_STU12_DEC03_synth.sav||| | schid| SCHOOLID| blueprint_example/INT_STU12_DEC03_synth.sav||| | country| CNT| blueprint_example/INT_STU12_DEC03_synth.sav||| | gender| ST04Q01| blueprint_example/INT_STU12_DEC03_synth.sav||| | age| AGE| blueprint_example/INT_STU12_DEC03_synth.sav||| | final.weight| W_FSTUWT| blueprint_example/INT_STU12_DEC03_synth.sav||| | math.score| PV1MATH| blueprint_example/INT_STU12_DEC03_synth.sav||| | | | | | | | | school.size| SCHSIZE| blueprint_example/INT_SCQ12_DEC03_synth.sav |country=CNT,schid=SCHOOLID|| |class.size |CLSIZE |blueprint_example/INT_SCQ12_DEC03_synth.sav |country=CNT,schid=SCHOOLID||
Table: Table 2: example_blueprint2
# A tibble: 2,000 × 10 studid schid country gender age final.weight math.score school.size class.size wave <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 06880 0000275 GBR 2 15.67 142.6006 544.7188 1644 28 1 2 02864 0000131 DEU 1 15.42 125.2389 522.1296 1381 18 1 3 01916 0000087 DEU 2 15.33 8.8975 607.0338 400 18 1 4 04419 0000179 GBR 2 15.25 18.5177 553.2871 346 18 1 5 01823 0000083 DEU 1 15.92 158.2145 466.5135 1083 28 1 6 01650 0000075 DEU 1 16.17 172.7704 316.4901 597 28 1 7 06297 0000253 GBR 1 15.33 179.7536 373.8199 1270 28 1 8 04957 0000200 GBR 1 16.00 207.3544 455.2189 1272 28 1 9 00393 0000016 GBR 2 15.50 9.1159 612.4864 1242 23 1 10 04696 0000217 DEU 1 15.92 4.1366 576.1100 648 28 1 # ... with 1,990 more rows
Adding the additional variable '/household income/' and '/property/' from the seperate file household.csv
typically includes left joining them, which equals to adding columns for additional variables and keeping only the values of household.csv of those units that are also in individual.csv. For this we have to define conditions which rows in the files individual.csv and household.csv relate to the same unit. This is achieved by specifiying the column link for the files to add from. Relying on the syntax of /dplyr::left_join/ (while leaving out apostrophes) the condition expresses which variables must be equal: You give the new variable name (specified in newvar) and the name of the matching old variable name in the original data file that is to be added: /name.in.newvar=old.variable.name.in.data.file.to.merge/
Assuming e.g. that in file 'household.csv' every household has an id number called '/idno/' we would specify the link condition /household.id=idno/. Several link conditions can be combined by comma. The first part relates to new variable name specified as '/newvar/', the second to the original name in the data file to merge. If only single variables are given, it is assumed that the variable names are the same (name of newvar and var in a file to add are identical).
If the variables used inside of the link condition are not in the /var1.../ column they will be used for merging but dropped afterwards.
When a blueprint has been modified and saved in a external application, the merged data can be imported into R by simply stating:
my.df <- blue('my.blueprint.name.csv')
my.df is a new data.frame with the selected variables from individual.csv plus the household variables that match the link condition.
Repeated surveys typically are delivered as so-called waves. Blueprint makes it easy to merge different waves into a long format. Independent of whether the same units are measured repeatedly or the same variables are measured repeatedly for different units, the purpose will be to combine data /rowwise/. Data providers (like e.g. OECD PISA) sometimes emit wave-files with changed variable names for the same items. Harmonizing this variable is very easy with using blueprint. The /newvar/ column contains the variable names. Additional variable columns starting with var (e.g. /var1/, /var2/,...) contain the variable names from the files specific to the wave. We can join two waves by entering additional columns containing the identifier var,file,link,fun. Or we can initialize a new blueprint with the chunks argument using:
open.blue('my.blueprint.name2.xls)
Figure XXX shows a more advandced blueprint file reflecting this structure.
Entering the appropriate data will rename, transform and join the data automatically. In short: Chunks are specified /columnwise/ (with blocks of 4 columns (called "chunks") for each data source (chunk) containing the original variable name, filepath, link conditions, and transformation functions). Columns that relate to units in the same chunk are specified /rowwise/ by giving different names and setting the link condition.
By now the column fun1 has been empty. This column offers a convenient way to transform variables (e.g. for recoding categories). The specified functions are executed with the original variable which will be automatically replaced by the result of the function. Let's say we want to recode gender from /0/ to '/Male/' and from /1/ to '/Female/'. You could do this by using recode from the dplyr package like shown in figure XXX (leaving out the first argument x (variable name) which the original variable will be inserted automatically). The result of this blueprint will be a data.frame with gender containing the values '/Male/' and '/Female/'. Note that a fun entry can be used to execute a whole chain of transformations (seperated by '%>%'). Have a look at the documentation on pipes provided by /magrittr/ how to use pipes /?magrittr::%>%/. Also note that e.g. recoding different variables with the same coding scheme can be done by copy and paste of fields of fun1 to other rows.
If you want to temporarily deactivate the execution of transformtions use the argument blue(... , fun=FALSE)
.
Note that from a program point of view the fun is executed by sending it to mutate. If you selected var v12
and the fun ifelse(.>0,1,0)
blueprint will compute it as:
mutate ( newvarname = v12 %>% ifelse(.>0,1,0))
To apply more complex operations that mutate different variables at once, you can specify (.) in the var which will send the entire data frame as input to the function in the fun column.
If you build pipe chaining includes several more complex functions after another, you can always make use of enclosing {...} brackets, e.g {. -> a ; dosomethinon(a) -> b ; dosomethingelse(a) ->c ; combine(b,c) -> d ; d} to keep everything working.
Blueprint by default creates a corresponding .R code file containing all operations necessary to merge the data, by default written to /filepath_without_extension.blueprint.code.R/. If you choose to do so you can use, edit and share this standalone file without the need to have the blueprint package installed.
Blueprint is constructed to not be very verbose when called. Nonetheless it has a logging feature that can be activated by setting /blue(...,logfile=TRUE)/. In this case a extended logfile is created that contains also information on the transformation process (recode table, descriptives, distribution and information about automatic type conversions, statistics on dimension of the data). The parameter /logfile/ is set to /FALSE/ by default, only standard information will be written to this file. If you set logfile to a character string (by e.g. /blue(...,logfile='a.file.txt')/), an extended logfile will be written to this path. If you don't specify a path, the name will be resembled by the name of the blueprint file The computation of the statistics take some time and therefore there is a tradeoff between time and the comfort of additional information.
----Transformation. Variable `ST03Q01` (chunk 1): recode(`2`=0L,`1`=1L,.default=NA_integer_) ----------------------------- ============================== old 1 2 7 8 9 .. | | | | | ... v v v v v new 1 0 X.n. 115030 112128 1055 15 556 ------------------------------ !!! Type conversion from numeric to integer. Was this intended? >>> Distribution after recoding ----- variable n missing unique Info Sum Mean 227158 1626 2 0.75 115030 0.5064
It might be convenient to create new variables that are constant for every unit of the same chunk. This can be done by entering names for var1 , var2,... that are not in the original data file. To assign fixed value to a new variable you either use the integer specification or encapsulating characters into apostrophe (') . Note that since Excel has a special treatment of captioning characters using two beginning Apostrophes and one ending apostrophe probably will have to be used (''character value') . |-------------+-------------+----------+-------| | newvar | var1 | file1 | link1 | |-------------+-------------+----------+-------| | i.id | idno | i.w1.dta | | | survey.year | 2000L | i.w1.dta | | | wavec | 'PISA2000' | i.w1.dta | | |-------------+-------------+----------+-------| Note the difference between e.g. /i.id/ stemming from a column in the file i.w1.dta and /survey.year/ which will be the same ("PISA2000") for all units in file i.w1.dta.
Assume you have 80 weight variables specified by rep.weight1 to rep.weight80. You can specify these in var1 as /rep.weight[1:80]/. The rows containing brackets will be expanded to 80 additional rows with the specific name, resulting in the import (and if specified also individual transformation) of all of this variables.
When importing files the functions blue()
and open_blue()
rely on the function import
from the package rio
. This package recognises the most frequently used file formats based on their corresponding suffix. This applies to the file format of blueprint files as well as data file. The most frequent formats are listed in Table 6. For a full list of file formats that can be loaded with blueprint, have a look at [LINK RIO].
| Suffix | Assumed file format | |---|---| | R binary file | .Rdata | | Stata files | .dta | | SPSS files | .sav | | Comma seperated Text files | .csv | | HTML files | .html | | Excel files | .xlsx | | Open Spreadsheet files | .ods |
Table: Table 6: Common file formats used with blueprint.
Note that additional arguments of blue are transferred to the function import()
which can be used for additional specifications of the import process. If you e.g. want to import the specific spreadsheet "MyData" in a Excel file (default would be to use the first Spreadsheet) it can be selected by specifying the which
argument.
blue('/path/to/blueprint.file.xlx',which='MyData')
By giving the argument /export_file/, e.g. blue(..., export_file='/path/to/file.csv') the merged data.frame will be written directly to the file specified. In this case the data.frame will be returned /invisible/ - it can be used in other functions or pipes, but will not be printed and automatically deleted frome memory if not assigned to a new specifier.
Reseach often involves building a data.frame that is expanded and refined as the research questions become more complicated. This means that you likely will rely one one or few blueprint files that you build up to more complex specifications from the scratch.
It might be convenient define a blueprint file that is used as default. Most functions of blueprint that expect a path to a blueprint file look if a default option is set. This option can easily be set with the convenience function set_blue:
set_blue('/path/to/blueprint.file.xlsx')
info_blue(' searchstr='REGEX') case.insensitive by default How to implement complex searches: |-------------+-----------------------------| | Logical OR | "dog \| cat" | | Logical AND | "(?=.*trust)(?=.*neighbor)" |
When the blueprint is not changed, a cache file is used. This is saved in the current directory in the directory bluepring_cache:
Entries in the newvar column that start with a ! are extracted. They are executed /after/ the whole data.frame is merged. They can be used to easily select subsets or aggregated versions of the data. In general ! <...ANYFUNCTION...> will be evaluated as:
final.df %>% ...ANY FUNCTION... -> final.df
There is one exception that allows for easily combining blueprints: If a !-statement is followed by an entry in the link column. It is treated as a left_join e.g. blue('/path.to/blue.printfile.xlsx') it will be left_joined (meaning that this
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.