knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Tired of struggling to convert open data formats into R dataframes? Look no further and welcome the opendataformat package.
With just a few lines of code, you can convert a data package specified as opendataformat into an R dataframe (read_odf()
) or convert an R dataframe into a data package specified in the opendataformat (write_odf()
).
But wait, there's more! Our package goes beyond parsing. Accessing metadata has never been easier. Dive into the treasure trove of information stored in your R dataframes. Explore dataset labels, descriptions, and valuable details about variables such as labels and value labels with the docu_odf()
function and the getmetadata_odf()
function.
This vignette will guide you through a series of examples that demonstrate the possibilities of these functions. Let's get started!
You can download and install the package by downloading the latest version from
Zenodo.
Alternatively you can download the development version from GitHub using the
install_git()
-function from the devtools-library.
# At this point you can download and install the the latest version of the # opendataformat package from CRAN: install.packages("opendataformat")
12917713
read_odf()
The opendataformat package provides example data that is specified as 'Open Data Format'. Data in the open data format is a ZIP file containing a csv-file and a xml-file. The example data contains the data-csv (data.csv
) with 20 rows and 7 columns and the metadata-XML (metadata.xml
). The metadata file describes the dataset and its variables. If you are interested in what these files look like, you can find an example in the Open Data Format git repository: https://git.soep.de/opendata/specification/-/tree/main/external/example
To load the data, we need to specify the path to the zip-file.
Here the example data is loaded using read_odf()
Alternatively, you can set file
to a zip-file in the working directory.
library(opendataformat) path <- system.file("extdata", "data.odf.zip", package = "opendataformat") df <- read_odf(file = path)
The output of the read_odf()
function is an R-tibble object, that has the additional class odf
. It has additional metadata stored in the attributes of the tibble and the variables/columns. These include the languages of the metadata, labels, descriptions, urls, variable types, and value labels.
df
If you load the haven package, you see the variable labels in the active language .
library(haven) View(df)
If you want to import a dataset with metadata only in one or several languages. You can use the languages
-argument. To load the example data only with english labels and descriptions, set languages="en"
:
df_en <- read_odf(file = path, languages = "en")
By default languages = "all"
:
df_en <- read_odf(file = path, languages = "all")
You can also give a list of languages::
df_en <- read_odf(file = path, languages = c("en", "de"))
You can set further arguments for the read_odf()
function. With the nrows
argument you define how many rows to read excluding the header. With the skip
parameter you set how many rows to skip (excluding the header).With the select
input you determine which columns/variables to load with a vector of indices or variable/column names.
df <- read_odf(file, languages = "all", nrows = Inf, skip = 0, select = NULL)
docu_odf()
You can explore dataset information using two methods. Firstly, you can browse metadata at the record level, providing an overview of the dataset. Alternatively, you have the option to examine specific variable details, allowing you to gain insights into selected data attributes.
By default, when using the docu_odf()
function, dataset-level information is presented through the console and an HTML page. If you're utilizing RStudio, this html-page will be displayed within the RStudio viewer.
docu_odf(df)
To display the metadata only in the console, utilize the style
argument with the value set to print
(or console
). This ensures that the information is conveniently displayed on the R console, serving our specific demonstration purposes. To display metadata information only in the viewer, set style="viewer"
or style="html"
. By default style="both"
.
docu_odf(df, style = "print")
To obtain a comprehensive overview of all variables within the dataset, simply set the argument variables="yes"
.
docu_odf(df, variables = "yes", style = "print")
If you are interested in just one specific variable, you can do this:
docu_odf(df$bap9001, style = "print")
Certain datasets offer metadata such as labels, descriptions, or value labels in multiple languages. To display the metadata in all languages supported by your dataset, you can simply set the languages
argument to all
. This setting enables you to identify the range of languages available for accessing the relevant metadata within your dataset.
docu_odf(df$bap9001, style = "print", variables = "yes", languages = "all")
If you have a specific language of interest, you can easily display it by utilizing the corresponding language code. Simply specify the desired language code to retrieve the metadata in the language of your choice. This enables you to access the specific language variant of variable labels, value labels. In this example, we display the German version:
docu_odf(df$bap9001, style = "print", languages = "de")
You can apply this function to the entire dataset, allowing you to access the desired information across all variables.
docu_odf(df, style = "print", variables = "yes", languages = "de")
If you prefer another display style, you can use the datasets' metadata directly from the attributes and write your own code:
for (i in names(df)) { cat( paste0(attributes(df[[i]])$name, ": ", attributes(df[[i]])$label_de, "\n") ) }
You can also use the getmetadata_odf() function to retrieve labels and other metadata for the variables:
getmetadata_odf(df, type = "label")
or the value labels:
getmetadata_odf(df$bap87, type = "valuelabels")
setlanguage_odf()
Alternatively, you can set the current (active) language for a dataset-object. (This function tries to copy the label language function from Stata.)
df <- setlanguage_odf(df, language = "de") docu_odf(df$bap9001, style = "print")
To display which languages are available for the dataset metadata, display the languages
attribute:
attributes(df)$languages attr(df, "languages")
getmetadata_odf()
and attributes()
Browsing through datasets' metadata provides a valuable initial overview. However, when it comes time to dive into the analysis work, questions arise regarding the storage location of the metadata and the process of accessing and utilizing it. Let's explore how and where the metadata is stored, and how we can effectively access and leverage it for analysis purposes.
A easy way to retrieve metadata is to use the getmetadata_odf()
function to get metadata.
attributes()
and attr()
Another way is to retrieve metadata directly from the attributes. The metadata imported from the Open Data Format file into an R tibble (dataframe) is stored as R attributes. By using the base R functions attributes()
and attr()
, you can easily access this metadata. When providing the entire dataset to the function, R will display all the metadata describing the dataset as a whole in your console.
attributes(df)
If you provide a specific variable to the function, only the corresponding metadata for that variable will be printed.
attributes(df$bap87)
If you're interested in a particular attribute, you can access it using the dollar sign followed by the attribute name. For instance, let's consider accessing a variable label in German (language code: de) as an example.
attributes(df$bap87)$label_de
Alternatively, you can use the attr()
function to get the same result:
attr(df$bap87, "label_de")
Moreover, you have the flexibility to copy, remove, and modify these attributes to suit your needs.
attributes(df$bap87)$description_de <- NULL attributes(df$bap87)$description_de
getmetadata_odf()
You can also use the getmetadata_odf()
function to retrieve labels and other metadata for the variables.
By default, the function will return the variable labels for a dataset:
getmetadata_odf(df, type = "labels")
or for a specific variable::
getmetadata_odf(df$bap96, type = "labels")
To retrieve metadata in a specific language, use the language parameter:
getmetadata_odf(df, type = "labels", language = "en")
Or set the active language of the dataset using the setlanguage_odf()
function:
df <- setlanguage_odf(df, language = "en") getmetadata_odf(df, type = "labels")
You can also use the getmetadata_odf()
function to retrieve value labels for a specific variable by setting the argument type="valuelabels"
:
getmetadata_odf(df$bap9001, type = "valuelabels")
The value labels for each value are stored in the namespace:
names(getmetadata_odf(df$bap9001, type = "valuelabels"))
You can use the getmetadata_odf()
function to return descriptions, urls, variable types and metadata languages as well:
To retrieve variable description(s), set the argument type="description"
:
getmetadata_odf(df, type = "description")
To retrieve variable url(s), set the argument type="url"
:
getmetadata_odf(df, type = "url")
To retrieve variable type(s), set the argument type="type"
:
getmetadata_odf(df, type = "type")
write_odf()
To save a dataset as odf-file, we can use the write_odf()
function.
Let's assume we want to save the first four columns of our dataset as a new odf-file.
We use the write_odf()
function and indicate the r-dataframe and the file name (and location if it).
write_odf( x = df[, 1:4], file = "../df_1_4.odf.zip" ) #or : df_14 <- df[, 1:4] write_odf( x = df[, 1:4], file = "df_1_4.odf.zip" )
The XML file metadata.xml and the CSV file data.csv are saved within the directory 'data_rec', as well as within the ZIP file 'data_rec.zip. The dataset looks the same as before, just with fewer variables:
<?xml version='1.0' encoding='utf-8'?> <codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5" version="2.5"> <fileDscr> <fileTxt> <fileName>bap</fileName> <fileCont xml:lang="en">The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont> <fileCont xml:lang="de">Die Daten wurden im Rahmen der Studie SOEP-Core mittels des Fragebogens „Leben in Deutschland – Befragung 2010 zur sozialen Lage - Personenfragebogen für alle“ erhoben. Dieser Fragebogen richtet sich an die einzelnen Personen im Haushalt. Eine Ansicht des Erhebungsinstrumentes finden Sie hier: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont> <fileCitation> <titlStmt> <titl xml:lang="en">Data from individual questionnaires 2010</titl> <titl xml:lang="de">Daten vom Personenfragebogen 2010</titl> </titlStmt> </fileCitation> </fileTxt> <notes> <ExtLink URI="https://paneldata.org/soep-core/data/bap" /> </notes> </fileDscr> <dataDscr> <var name="bap87"> <labl xml:lang="en">Current Health</labl> <labl xml:lang="de">Gesundheitszustand gegenwärtig</labl> <txt xml:lang="en">Question: How would you describe your current health?</txt> <txt xml:lang="de">Frage: Wie würden Sie Ihren gegenwärtigen Gesundheitszustand beschreiben?</txt> <notes> <ExtLink URI="https://paneldata.org/soep-core/data/bap/bap87" /> </notes> <varFormat type="numeric" /> <catgry> <catValu>-2</catValu> <labl xml:lang="en">Does not apply</labl> <labl xml:lang="de">trifft nicht zu</labl> </catgry> <catgry> <catValu>-1</catValu> <labl xml:lang="en">No Answer</labl> <labl xml:lang="de">keine Angabe</labl> </catgry> <catgry> <catValu>1</catValu> <labl xml:lang="en">Very good</labl> <labl xml:lang="de">Sehr gut</labl> ...
The data.csv file now includes just four columns:
"bap87","bap9201","bap9001","bap9002" 4,-2,1,-1 3,5,-2,1 ,-1,-1,2 1,9,-2,2 -1,4,2,3 3,4,-1,4 1,9,2,-1 ...
If you wish to export only the metadata for documentation or archiving purposes, you can achieve this by setting the argument export_data=FALSE
. By doing so, the resulting directory or zip file will solely contain the metadata XML file, excluding the data CSV file. This allows you to specifically capture and preserve the metadata without including the actual data, providing a solution for documentation or archiving needs.
write_odf( x = df, file = "../df_metadata.odf.zip", export_data = FALSE )
If you wish to export the dataset with the metadata only in one or some languages, set the languages argument to languages=c("en")
. Default: languages="all"
write_odf( x = df, file = "../df_en.odf.odf.zip", languages = "en" )
By default, languages is set to languages="all"
. You can also define a list of languages to be exported:
write_odf( x = df, file = "../df_en_de.odf.odf.zip", languages = c("en", "de") )
as_odf_tbl()
To convert a data frame to an ODF tibble in R, the opendataformat
package provides the as_odf_tbl() function
. It transforms a dataframe (or any subclass) object to an Open Data Format tibble (an object of the odf_tbl class). The metadata in the data frame has to be stored in the attributes according to the odf_tbl framework:
Regarding the dataset metadata, the dataset name is stored in the name
-attribute, a URL can be stored in the url
-attribute. The multilingual labels and descriptions are stored in the label_tag
and description_tag
-attributes with the respective language tag. The dataset label in English language is stored in the label_en
-attribute and the English description is stored in the description_en
-attribute.
Regarding the variable metadata, the variable name is stored in the name
-attribute, a URL can be stored in the url
-attribute of each column, and the variable type (numeric
or character
) can be stored in the type
-attribute. The multilingual labels, descriptions and value labels are stored in the label_tag
, description_tag
, and labels_tag
-attributes with the respective language tag. The dataset label in English language is stored in the label_en
-attribute and the English description is stored in the description_en
-attribute. The value labels in English (a numeric vector with the labelled values and the value labels in the namespace) are stored in the labels_en
-attribute of the column.
#Create a data frame with four variables ind 5 rows exampledata <- data.frame(id = 1:5, name = c("Klaus", "Anna", "Rebecca", "Kevin", "Janina"), age = c(55, 40, 19, 25, 60), diagnosis = c(1,3,3,2,1)) # Add metadata for dataset according to ODF tibble framework. attr(exampledata, "name") <- "patientdata" attr(exampledata, "label_en") <- "Patient Data" attr(exampledata, "description_en") <- "Patient database of the practice Dr. Sommer" attr(exampledata, "url") <- "www.example.url.en" # Add metadata for diagnosis variable with label, description and value labels. attr(exampledata$id, "name") <- "id" attr(exampledata$id, "label_en") <- "Patiend ID" attr(exampledata$id, "description_en") <- "Practice Patiend ID" attr(exampledata$diagnosis, "name") <- "diagnose" attr(exampledata$diagnosis, "label_en") <- "Diagnosis" attr(exampledata$diagnosis, "description_en") <- "Diagnosis patient last visit" valuelabels_diagnosis <- 1:4 names(valuelabels_diagnosis) <- c("Covid", "Influenza", "Common cold", "Tonsillitis") attr(exampledata$diagnosis, "labels_en") <- valuelabels_diagnosis # use as_odf_bl to transform dataframe to an ODF tibble ('odf_tbl'-class object) example_odf <- as_odf_tbl(exampledata) # Display metadata of diagnosis Variable docu_odf(example_odf$diagnosis, style = "print")
Now let's see how we can use the metadata to better understand the data and make more informative plots.
table(df$bap87, useNA = "ifany")
As expected, the frequency table displays the occurrence count of each variable value. Now, let's enhance the convenience of the frequency table by utilizing the value labels associated with the variables. To access the value labels, as explained in the preceding section, you can utilize the base R function attributes()
. Let's proceed to examine them now:
attributes(df$bap87)$labels_en
attributes(df$bap87)$labels_de
table(factor(df$bap87, labels = names(attributes(df$bap87)$labels_en)))
Alternatively you can use the getmetadata_odf()-function to get the value labels:
table(factor(df$bap87, labels = names(getmetadata_odf(df$bap87, type = "valuelabels"))))
To display the data in a language other than the default one, let's try German by appending the respective language code to the attribute name. For example, you can use $labels_de to access the German language labels and present the information accordingly.
table( factor( df$bap87, labels = names(attributes(df$bap87)$labels_de) ) )
Or using getmetadata_odf()-function:
table( factor( df$bap87, labels = names(getmetadata_odf(df$bap87, type = "valuelabels", language = "de")) ) )
To merge ODF-datasets you should use the left_join(), right_join(), full_join(), and inner_join() from the dplyr-package instead of the the merge()-function to keep the attributes with the metadata of the merged datasets.
library(dplyr) #similar to merge(df[,c(1:3,6)], df[,c(4:6)], by="name", all.x=T, all.y=F) merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)], by = "name") #or merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)])
We want to display the table with only valid answers. Therefore, we set the values -2
and -1
to NA
. Because we do not want to overwrite the original variable, we generate a new one:
bap87_rec <- df$bap87
We check the attributes of the metadata and notice they are also copied from the original variable to the new one:
attributes(bap87_rec)
Now we can set the negative values to NA:
for (row in seq(1, length(bap87_rec))) { if (!is.na(bap87_rec[row]) && bap87_rec[row] <= -1) { bap87_rec[row] <- NA } } table(bap87_rec, useNA = "ifany")
We notice that the copied values and value labels do not fit anymore:
attributes(bap87_rec)$labels_en
To change that, we'll copy positions 3
to 7
, retaining the desired range of values and their respective value labels.
attributes(bap87_rec)$labels_en <- unname(attributes(df$bap87)$labels_en)[3:7] # values names(attributes(bap87_rec)$labels_en) <- names(attributes(df$bap87)$labels_en)[3:7] # labels attributes(bap87_rec)$labels_en
Do the same for the other language versions of the new recoded variable:
attributes(bap87_rec)$labels_de <- unname(attributes(df$bap87)$labels_de)[3:7] # values names(attributes(bap87_rec)$labels_de) <- names(attributes(df$bap87)$labels_de)[3:7] # labels
We do also notice that the variable name is not adequate. We replace the name copied from the original variable with the new name bap87_rec
.
attributes(bap87_rec)$name <- "bap87_rec" attributes(bap87_rec)$name
Now we generate the frequency table by using the variable as a factor variable.
table( factor( bap87_rec, labels = names(attributes(bap87_rec)$labels_en) ) )
To create a barplot, we will utilize the recoded variable from the previous section. This example will demonstrate how to leverage metadata to create a more convenient and informative graph. By incorporating the metadata into the visualization, we can enhance the graph's interpretability and provide a clearer understanding of the data.
barplot( table( factor( bap87_rec, labels = names(attributes(bap87_rec)$labels_en) ) ), main = attributes(bap87_rec)$description_en, # title xlab = paste0( attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label), # label sub = attributes(bap87_rec)$url, # subtitle cex.main = 0.9, cex.names = 0.7, cex.sub = 0.8, cex.axis = 0.6, cex.lab = 0.7 # font sizes )
Drawing a barplot with the German description becomes effortless when dealing with dates that have multiple language versions of labels and descriptions. Simply append the language code to the end of the label attributes, and you'll be able to generate the desired barplot with the German description:
barplot( table( factor( bap87_rec, labels = names(attributes(bap87_rec)$labels_de) ) ), main = attributes(bap87_rec)$description_de, # title xlab = paste0( attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label_de), # label sub = attributes(bap87_rec)$url, # subtitle cex.main = 0.7, cex.names = 0.5, cex.sub = 0.8, cex.axis = 0.7, cex.lab = 0.7 # font sizes )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.