In mschwenzer/blueprint: A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File

Overview

Modern data users face the need to combine very different data sources which can easily become confusing. The R-package blueprint offers the opportunity to select, combine and transform data in an easy and intuitive way by using a meta-data files (called "blueprint") that can be easily edited with a chosen Spreadsheet application or text editor capable of e.g. saving the file formats OpenOffice Calc, Excel, CSV, HTML. Merging variables that are split over dozens of different files is done by entering the absolute necessary information into a text-based meta-file and let the package do the rest. These files define which variables are imported and how they have to be (left-)joined with other files. Blueprint is also best suited for the needs of joining longitudinal data and unifying variable names that changed across different waves into one new variable. By relying on the versatile import-function for file import it offers the opportunity to handle very different kind of data formats (for the imported data as well as the blueprint-files.). Every variable can be transformed by functions or a chain of functions (sequentially applied functions) that will transform the variable from the corresponding file before they are included into the final data framel. Repetitive recoding of similar variables becomes very easy. By relying on meta-files data-processing becomes faster, clearer and more easy expandable. The functions that merge and transform data in the background rely mainly on functions from the brilliant and comparable fast packages dplyr and data.table.

Creation and Editing of Blueprint-files

In blueprint the merge process is specified by a so-called blueprint file. This file specifies rules that control from where and how data is imported. It can be edited with a standard Spreadsheet-Application like Excel or Open Office Spreadsheet or other Editors. A convenient way to create a new template for such a blueprint file is to enter:

open_blue('my.blueprint.name.csv')

Alternatively we could change the suffix of the filename to xlsx to open an Excel-File:

open_blue('my.blueprint.name.xlsx')

The package will try to create a template file with the according format, at the given path and afterwards will open it. If the file allready exists, it will be just opened. Afterwards this files can be edited. Note that you will allways have to save to bring changes into effect.

Structure of blueprint files

So what to enter into the template? Table 0 contains the basic structure of a blueprint file.

| newvar | var1| file1| link1| fun1| var2|file2|link2|fun2| ...| |---|---|---|---|---|---|---|---|---|---| | new name of variable 1 | {old name of var1 in first data file} | | | | {old name of var1 in second data file} | | | | | | new name of variable 2 | {old name of var2 in first data file} | | | | {old name of var2 in second data file} | | | | |

Table: Table 0: The basic structure of a blueprint file

Every row represents an unique variable of a final data.frame. The name of this variable can be chosen according to the needs of the analysis that is specified in the firs column (by entering this name into the column newvar).

Let's test this structure by using random data from OECD's PISA (Programme for International Student Assessment) data.

Example 1: Import certain variables from a file

First of all we have to generate an directory of example files.

blue_example()

The file blueprint_example/INT_STU12_DEC03_synth.sav is a synthesized version of original PISA data stored in a SPSS-file (which is indicated by the suffix .sav). It contains the variables StIDStd (a unique id of the student), CNT (the country where the student was tested), ST04Q01 (gender), AGE, W_FSTUWT (weight) and PV1MATH (a test score in math.)

In the first example we will just load, select and rename this variables from the SPSS file by using example_blueprint1 (which exist in two versions).

To open the example in an editor you can use

open_blue('blueprint_example/example_blueprint1.csv')

open_blue('blueprint_example/example_blueprint1.xslx')

The content of this blueprint file is shown in Table 1. It states that 6 variables are to be imported from the file blueprint_example/INT_STU12_DEC03_synth.sav. The variable ST04Q01 e.g. will be renamed to gender.

| newvar| var1| file1| link1| fun1| |--- |---|---|---|---| | studid| StIDStd |blueprint_example/INT_STU12_DEC03_synth.sav ||| | country | CNT | blueprint_example/INT_STU12_DEC03_synth.sav ||| | gender | ST04Q01| blueprint_example/INT_STU12_DEC03_synth.sav ||| | age | AGE |blueprint_example/INT_STU12_DEC03_synth.sav ||| | final.weight |W_FSTUWT |blueprint_example/INT_STU12_DEC03_synth.sav ||| | math.score1 |PV1MATH |blueprint_example/INT_STU12_DEC03_synth.sav |||

Table: Table 1: example_blueprint1

The following code executes the rules in this blueprint and stores the data in a new object (example1).

example1 <- blue('blueprint_example/example_blueprint1.xlsx')
example1

 # A tibble: 2,000 × 7
    studid country gender   age final.weight math.score1  wave
     <chr>   <chr>  <dbl> <dbl>        <dbl>       <dbl> <dbl>
 1   06880     GBR      2 15.67     142.6006    544.7188     1
 2   02864     DEU      1 15.42     125.2389    522.1296     1
 3   01916     DEU      2 15.33       8.8975    607.0338     1
 4   04419     GBR      2 15.25      18.5177    553.2871     1
 5   01823     DEU      1 15.92     158.2145    466.5135     1
 6   01650     DEU      1 16.17     172.7704    316.4901     1
 7   06297     GBR      1 15.33     179.7536    373.8199     1
 8   04957     GBR      1 16.00     207.3544    455.2189     1
 9   00393     GBR      2 15.50       9.1159    612.4864     1
 10  04696     DEU      1 15.92       4.1366    576.1100     1
 # ... with 1,990 more rows

Note again that the structure of blueprint files might be confusing when using the first time. For ease of editing it is transposed compared to the resulting data.frame: Rows in blueprints do represent columns (variables) in the data and as will be seen: column blocks in blueprints do e.g. represent chunks (which are merged as additional rows).

Example 2: Adding variables from another file (Left-joining)

In this example we want to add school characteristics to the data of students. The file blueprint_example/INT_SCQ12_DEC03_synth.sav contains this variables.

# open_blue('blueprint_example/example_blueprint2.csv')
open_blue('blueprint_example/example_blueprint2.xlsx')

| newvar| var1| file1| link1| fun1| |---|---|---|---|--- | studid| StIDStd| blueprint_example/INT_STU12_DEC03_synth.sav||| | schid| SCHOOLID| blueprint_example/INT_STU12_DEC03_synth.sav||| | country| CNT| blueprint_example/INT_STU12_DEC03_synth.sav||| | gender| ST04Q01| blueprint_example/INT_STU12_DEC03_synth.sav||| | age| AGE| blueprint_example/INT_STU12_DEC03_synth.sav||| | final.weight| W_FSTUWT| blueprint_example/INT_STU12_DEC03_synth.sav||| | math.score| PV1MATH| blueprint_example/INT_STU12_DEC03_synth.sav||| | | | | | | | | school.size| SCHSIZE| blueprint_example/INT_SCQ12_DEC03_synth.sav |country=CNT,schid=SCHOOLID|| |class.size |CLSIZE |blueprint_example/INT_SCQ12_DEC03_synth.sav |country=CNT,schid=SCHOOLID||

Table: Table 2: example_blueprint2

# A tibble: 2,000 × 10
   studid   schid country gender   age final.weight math.score school.size class.size  wave
    <chr>   <chr>   <chr>  <dbl> <dbl>        <dbl>      <dbl>       <dbl>      <dbl> <dbl>
1   06880 0000275     GBR      2 15.67     142.6006   544.7188        1644         28     1
2   02864 0000131     DEU      1 15.42     125.2389   522.1296        1381         18     1
3   01916 0000087     DEU      2 15.33       8.8975   607.0338         400         18     1
4   04419 0000179     GBR      2 15.25      18.5177   553.2871         346         18     1
5   01823 0000083     DEU      1 15.92     158.2145   466.5135        1083         28     1
6   01650 0000075     DEU      1 16.17     172.7704   316.4901         597         28     1
7   06297 0000253     GBR      1 15.33     179.7536   373.8199        1270         28     1
8   04957 0000200     GBR      1 16.00     207.3544   455.2189        1272         28     1
9   00393 0000016     GBR      2 15.50       9.1159   612.4864        1242         23     1
10  04696 0000217     DEU      1 15.92       4.1366   576.1100         648         28     1
# ... with 1,990 more rows

Adding the additional variable '/household income/' and '/property/' from the seperate file household.csv typically includes left joining them, which equals to adding columns for additional variables and keeping only the values of household.csv of those units that are also in individual.csv. For this we have to define conditions which rows in the files individual.csv and household.csv relate to the same unit. This is achieved by specifiying the column link for the files to add from. Relying on the syntax of /dplyr::left_join/ (while leaving out apostrophes) the condition expresses which variables must be equal: You give the new variable name (specified in newvar) and the name of the matching old variable name in the original data file that is to be added: /name.in.newvar=old.variable.name.in.data.file.to.merge/ Assuming e.g. that in file 'household.csv' every household has an id number called '/idno/' we would specify the link condition /household.id=idno/. Several link conditions can be combined by comma. The first part relates to new variable name specified as '/newvar/', the second to the original name in the data file to merge. If only single variables are given, it is assumed that the variable names are the same (name of newvar and var in a file to add are identical). If the variables used inside of the link condition are not in the /var1.../ column they will be used for merging but dropped afterwards.

When a blueprint has been modified and saved in a external application, the merged data can be imported into R by simply stating:

my.df <- blue('my.blueprint.name.csv')

my.df is a new data.frame with the selected variables from individual.csv plus the household variables that match the link condition.

Merging different waves

Repeated surveys typically are delivered as so-called waves. Blueprint makes it easy to merge different waves into a long format. Independent of whether the same units are measured repeatedly or the same variables are measured repeatedly for different units, the purpose will be to combine data /rowwise/. Data providers (like e.g. OECD PISA) sometimes emit wave-files with changed variable names for the same items. Harmonizing this variable is very easy with using blueprint. The /newvar/ column contains the variable names. Additional variable columns starting with var (e.g. /var1/, /var2/,...) contain the variable names from the files specific to the wave. We can join two waves by entering additional columns containing the identifier var,file,link,fun. Or we can initialize a new blueprint with the chunks argument using:

open.blue('my.blueprint.name2.xls)

Figure XXX shows a more advandced blueprint file reflecting this structure.

Entering the appropriate data will rename, transform and join the data automatically. In short: Chunks are specified /columnwise/ (with blocks of 4 columns (called "chunks") for each data source (chunk) containing the original variable name, filepath, link conditions, and transformation functions). Columns that relate to units in the same chunk are specified /rowwise/ by giving different names and setting the link condition.

Transformation of variables

By now the column fun1 has been empty. This column offers a convenient way to transform variables (e.g. for recoding categories). The specified functions are executed with the original variable which will be automatically replaced by the result of the function. Let's say we want to recode gender from /0/ to '/Male/' and from /1/ to '/Female/'. You could do this by using recode from the dplyr package like shown in figure XXX (leaving out the first argument x (variable name) which the original variable will be inserted automatically). The result of this blueprint will be a data.frame with gender containing the values '/Male/' and '/Female/'. Note that a fun entry can be used to execute a whole chain of transformations (seperated by '%>%'). Have a look at the documentation on pipes provided by /magrittr/ how to use pipes /?magrittr::%>%/. Also note that e.g. recoding different variables with the same coding scheme can be done by copy and paste of fields of fun1 to other rows. If you want to temporarily deactivate the execution of transformtions use the argument blue(... , fun=FALSE). Note that from a program point of view the fun is executed by sending it to mutate. If you selected var v12 and the fun ifelse(.>0,1,0) blueprint will compute it as: mutate ( newvarname = v12 %>% ifelse(.>0,1,0)) To apply more complex operations that mutate different variables at once, you can specify (.) in the var which will send the entire data frame as input to the function in the fun column. If you build pipe chaining includes several more complex functions after another, you can always make use of enclosing {...} brackets, e.g {. -> a ; dosomethinon(a) -> b ; dosomethingelse(a) ->c ; combine(b,c) -> d ; d} to keep everything working.

'Extended logging

Blueprint by default creates a corresponding .R code file containing all operations necessary to merge the data, by default written to /filepath_without_extension.blueprint.code.R/. If you choose to do so you can use, edit and share this standalone file without the need to have the blueprint package installed.

Logging and descriptives of the merging / transfomation process

Blueprint is constructed to not be very verbose when called. Nonetheless it has a logging feature that can be activated by setting /blue(...,logfile=TRUE)/. In this case a extended logfile is created that contains also information on the transformation process (recode table, descriptives, distribution and information about automatic type conversions, statistics on dimension of the data). The parameter /logfile/ is set to /FALSE/ by default, only standard information will be written to this file. If you set logfile to a character string (by e.g. /blue(...,logfile='a.file.txt')/), an extended logfile will be written to this path. If you don't specify a path, the name will be resembled by the name of the blueprint file The computation of the statistics take some time and therefore there is a tradeoff between time and the comfort of additional information.

----Transformation. Variable `ST03Q01`  (chunk 1): recode(`2`=0L,`1`=1L,.default=NA_integer_)  ----------------------------- 

  ============================== 
  old    1      2     7   8   9 
  ..     |      |     |   |   | 
  ...    v      v     v   v   v 
  new    1      0 
  X.n. 115030 112128 1055 15 556 
  ------------------------------ 
 !!! Type conversion from numeric to integer. Was this intended? 


 >>> Distribution after recoding ----- 
 variable 
  n missing  unique    Info     Sum    Mean 
  227158    1626       2    0.75  115030  0.5064

Assigning fixed value

It might be convenient to create new variables that are constant for every unit of the same chunk. This can be done by entering names for var1 , var2,... that are not in the original data file. To assign fixed value to a new variable you either use the integer specification or encapsulating characters into apostrophe (') . Note that since Excel has a special treatment of captioning characters using two beginning Apostrophes and one ending apostrophe probably will have to be used (''character value') . |-------------+-------------+----------+-------| | newvar | var1 | file1 | link1 | |-------------+-------------+----------+-------| | i.id | idno | i.w1.dta | | | survey.year | 2000L | i.w1.dta | | | wavec | 'PISA2000' | i.w1.dta | | |-------------+-------------+----------+-------| Note the difference between e.g. /i.id/ stemming from a column in the file i.w1.dta and /survey.year/ which will be the same ("PISA2000") for all units in file i.w1.dta.

Selection of multiple variables

Assume you have 80 weight variables specified by rep.weight1 to rep.weight80. You can specify these in var1 as /rep.weight[1:80]/. The rows containing brackets will be expanded to 80 additional rows with the specific name, resulting in the import (and if specified also individual transformation) of all of this variables.

File formats for data import/export and blueprint files

When importing files the functions blue() and open_blue() rely on the function import from the package rio. This package recognises the most frequently used file formats based on their corresponding suffix. This applies to the file format of blueprint files as well as data file. The most frequent formats are listed in Table 6. For a full list of file formats that can be loaded with blueprint, have a look at [LINK RIO].

| Suffix | Assumed file format | |---|---| | R binary file | .Rdata | | Stata files | .dta | | SPSS files | .sav | | Comma seperated Text files | .csv | | HTML files | .html | | Excel files | .xlsx | | Open Spreadsheet files | .ods |

Table: Table 6: Common file formats used with blueprint.

Note that additional arguments of blue are transferred to the function import() which can be used for additional specifications of the import process. If you e.g. want to import the specific spreadsheet "MyData" in a Excel file (default would be to use the first Spreadsheet) it can be selected by specifying the which argument.

blue('/path/to/blueprint.file.xlx',which='MyData')

Exporting files

By giving the argument /export_file/, e.g. blue(..., export_file='/path/to/file.csv') the merged data.frame will be written directly to the file specified. In this case the data.frame will be returned /invisible/ - it can be used in other functions or pipes, but will not be printed and automatically deleted frome memory if not assigned to a new specifier.

Convenience functions

Reseach often involves building a data.frame that is expanded and refined as the research questions become more complicated. This means that you likely will rely one one or few blueprint files that you build up to more complex specifications from the scratch.

Settin a default blueprint file: set_blue()

It might be convenient define a blueprint file that is used as default. Most functions of blueprint that expect a path to a blueprint file look if a default option is set. This option can easily be set with the convenience function set_blue:

set_blue('/path/to/blueprint.file.xlsx')

Searching through variable labels in attributes

info_blue('  searchstr='REGEX')
case.insensitive by default
How to implement complex searches:
|-------------+-----------------------------|
| Logical OR  | "dog \| cat"                |   
| Logical AND | "(?=.*trust)(?=.*neighbor)" |

Caching

When the blueprint is not changed, a cache file is used. This is saved in the current directory in the directory bluepring_cache:

Aggregation of data

Entries in the newvar column that start with a ! are extracted. They are executed /after/ the whole data.frame is merged. They can be used to easily select subsets or aggregated versions of the data. In general ! <...ANYFUNCTION...> will be evaluated as:

final.df  %>% ...ANY FUNCTION... -> final.df

There is one exception that allows for easily combining blueprints: If a !-statement is followed by an entry in the link column. It is treated as a left_join e.g. blue('/path.to/blue.printfile.xlsx') it will be left_joined (meaning that this

mschwenzer/blueprint documentation built on Nov. 4, 2019, 8:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mschwenzer/blueprint
A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File

In mschwenzer/blueprint: A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File

Overview

Creation and Editing of Blueprint-files

Structure of blueprint files

Example 1: Import certain variables from a file

Example 2: Adding variables from another file (Left-joining)

Merging different waves

Transformation of variables

'Extended logging

Logging and descriptives of the merging / transfomation process

Assigning fixed value

Selection of multiple variables

File formats for data import/export and blueprint files

Exporting files

Convenience functions

Settin a default blueprint file: set_blue()

Searching through variable labels in attributes

Caching

Aggregation of data

R Package Documentation

Browse R Packages

We want your feedback!

mschwenzer/blueprint A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File

In mschwenzer/blueprint: A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File

Overview

Creation and Editing of Blueprint-files

Structure of blueprint files

Example 1: Import certain variables from a file

Example 2: Adding variables from another file (Left-joining)

Merging different waves

Transformation of variables

'Extended logging

Logging and descriptives of the merging / transfomation process

Assigning fixed value

Selection of multiple variables

File formats for data import/export and blueprint files

Exporting files

Convenience functions

Settin a default blueprint file: set_blue()

Searching through variable labels in attributes

Caching

Aggregation of data

R Package Documentation

Browse R Packages

We want your feedback!

mschwenzer/blueprint
A Tool to Auto-Merge Datasets by Rules Specified in a Meta-File