opendataformat"



knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introducing the opendataformat package

Tired of struggling to convert open data formats into R dataframes? Look no further and welcome the opendataformat package.

With just a few lines of code, you can convert a data package specified as opendataformat into an R dataframe (read_odf()) or convert an R dataframe into a data package specified in the opendataformat (write_odf()).

But wait, there's more! Our package goes beyond parsing. Accessing metadata has never been easier. Dive into the treasure trove of information stored in your R dataframes. Explore dataset labels, descriptions, and valuable details about variables such as labels and value labels with the docu_odf() function and the getmetadata_odf() function.

This vignette will guide you through a series of examples that demonstrate the possibilities of these functions. Let's get started!

Installation

You can download and install the package by downloading the latest version from Zenodo. Alternatively you can download the development version from GitHub using the install_git()-function from the devtools-library.

# At this point you can download and install the the latest version of the
# opendataformat package from CRAN:
install.packages("opendataformat")

12917713

Functions in the openmetadata Package

Read the Open Data Format: read_odf()

The opendataformat package provides example data that is specified as 'Open Data Format'. Data in the open data format is a ZIP file containing a csv-file and a xml-file. The example data contains the data-csv (data.csv) with 20 rows and 7 columns and the metadata-XML (metadata.xml). The metadata file describes the dataset and its variables. If you are interested in what these files look like, you can find an example in the Open Data Format git repository: https://git.soep.de/opendata/specification/-/tree/main/external/example To load the data, we need to specify the path to the zip-file. Here the example data is loaded using read_odf() Alternatively, you can set file to a zip-file in the working directory.

library(opendataformat)

path <- system.file("extdata", "data.odf.zip", package = "opendataformat")
df <- read_odf(file = path)

The output of the read_odf() function is an R-tibble object, that has the additional class odf. It has additional metadata stored in the attributes of the tibble and the variables/columns. These include the languages of the metadata, labels, descriptions, urls, variable types, and value labels.

df

If you load the haven package, you see the variable labels in the active language .

library(haven)
View(df)

If you want to import a dataset with metadata only in one or several languages. You can use the languages-argument. To load the example data only with english labels and descriptions, set languages="en":

df_en <- read_odf(file = path, languages = "en")

By default languages = "all":

df_en <- read_odf(file = path, languages = "all")

You can also give a list of languages::

df_en <- read_odf(file = path, languages = c("en", "de"))

You can set further arguments for the read_odf() function. With the nrows argument you define how many rows to read excluding the header. With the skip parameter you set how many rows to skip (excluding the header).With the select input you determine which columns/variables to load with a vector of indices or variable/column names.

df <- read_odf(file, languages = "all", nrows = Inf, skip = 0, select = NULL)

Explore Dataset Information: docu_odf()

Display Medatata

You can explore dataset information using two methods. Firstly, you can browse metadata at the record level, providing an overview of the dataset. Alternatively, you have the option to examine specific variable details, allowing you to gain insights into selected data attributes.

By default, when using the docu_odf() function, dataset-level information is presented through the console and an HTML page. If you're utilizing RStudio, this html-page will be displayed within the RStudio viewer.

docu_odf(df)

To display the metadata only in the console, utilize the style argument with the value set to print (or console). This ensures that the information is conveniently displayed on the R console, serving our specific demonstration purposes. To display metadata information only in the viewer, set style="viewer" or style="html". By default style="both".

docu_odf(df, style = "print")

To obtain a comprehensive overview of all variables within the dataset, simply set the argument variables="yes".

docu_odf(df, variables = "yes", style = "print")

If you are interested in just one specific variable, you can do this:

docu_odf(df$bap9001, style = "print")

Display Metadata in a Specific Language

Certain datasets offer metadata such as labels, descriptions, or value labels in multiple languages. To display the metadata in all languages supported by your dataset, you can simply set the languages argument to all. This setting enables you to identify the range of languages available for accessing the relevant metadata within your dataset.

docu_odf(df$bap9001, style = "print", variables = "yes", languages = "all")

If you have a specific language of interest, you can easily display it by utilizing the corresponding language code. Simply specify the desired language code to retrieve the metadata in the language of your choice. This enables you to access the specific language variant of variable labels, value labels. In this example, we display the German version:

docu_odf(df$bap9001, style = "print", languages = "de")

You can apply this function to the entire dataset, allowing you to access the desired information across all variables.

docu_odf(df, style = "print", variables = "yes", languages = "de")

If you prefer another display style, you can use the datasets' metadata directly from the attributes and write your own code:

for (i in names(df)) {
  cat(
    paste0(attributes(df[[i]])$name, ": ", attributes(df[[i]])$label_de, "\n")
  )
}

You can also use the getmetadata_odf() function to retrieve labels and other metadata for the variables:

getmetadata_odf(df, type = "label")

or the value labels:

getmetadata_odf(df$bap87, type = "valuelabels")

Set the Active Language of a Data Frame: setlanguage_odf()

Alternatively, you can set the current (active) language for a dataset-object. (This function tries to copy the label language function from Stata.)

df <- setlanguage_odf(df, language = "de")

docu_odf(df$bap9001, style = "print")

To display which languages are available for the dataset metadata, display the languages attribute:

attributes(df)$languages
attr(df, "languages")

Access Metadata: getmetadata_odf() and attributes()

Browsing through datasets' metadata provides a valuable initial overview. However, when it comes time to dive into the analysis work, questions arise regarding the storage location of the metadata and the process of accessing and utilizing it. Let's explore how and where the metadata is stored, and how we can effectively access and leverage it for analysis purposes.

A easy way to retrieve metadata is to use the getmetadata_odf() function to get metadata.

Retrieve the Attributes with attributes() and attr()

Another way is to retrieve metadata directly from the attributes. The metadata imported from the Open Data Format file into an R tibble (dataframe) is stored as R attributes. By using the base R functions attributes() and attr(), you can easily access this metadata. When providing the entire dataset to the function, R will display all the metadata describing the dataset as a whole in your console.

attributes(df)

If you provide a specific variable to the function, only the corresponding metadata for that variable will be printed.

attributes(df$bap87)

If you're interested in a particular attribute, you can access it using the dollar sign followed by the attribute name. For instance, let's consider accessing a variable label in German (language code: de) as an example.

attributes(df$bap87)$label_de

Alternatively, you can use the attr() function to get the same result:

attr(df$bap87, "label_de")

Moreover, you have the flexibility to copy, remove, and modify these attributes to suit your needs.

attributes(df$bap87)$description_de <- NULL
attributes(df$bap87)$description_de

Retrieve Metadata with getmetadata_odf()

You can also use the getmetadata_odf() function to retrieve labels and other metadata for the variables. By default, the function will return the variable labels for a dataset:

getmetadata_odf(df, type = "labels")

or for a specific variable::

getmetadata_odf(df$bap96, type = "labels")

To retrieve metadata in a specific language, use the language parameter:

getmetadata_odf(df, type = "labels", language = "en")

Or set the active language of the dataset using the setlanguage_odf() function:

df <- setlanguage_odf(df, language = "en")
getmetadata_odf(df, type = "labels")

You can also use the getmetadata_odf() function to retrieve value labels for a specific variable by setting the argument type="valuelabels":

getmetadata_odf(df$bap9001, type = "valuelabels")

The value labels for each value are stored in the namespace:

names(getmetadata_odf(df$bap9001, type = "valuelabels"))

You can use the getmetadata_odf() function to return descriptions, urls, variable types and metadata languages as well:

To retrieve variable description(s), set the argument type="description":

getmetadata_odf(df, type = "description")

To retrieve variable url(s), set the argument type="url":

getmetadata_odf(df, type = "url")

To retrieve variable type(s), set the argument type="type":

getmetadata_odf(df, type = "type")

Export to the Open Data Format: write_odf()

To save a dataset as odf-file, we can use the write_odf() function. Let's assume we want to save the first four columns of our dataset as a new odf-file. We use the write_odf() function and indicate the r-dataframe and the file name (and location if it).

write_odf(
  x = df[, 1:4],
  file = "../df_1_4.odf.zip"
)

#or :
df_14 <- df[, 1:4]
write_odf(
  x = df[, 1:4],
  file = "df_1_4.odf.zip"
)

The XML file metadata.xml and the CSV file data.csv are saved within the directory 'data_rec', as well as within the ZIP file 'data_rec.zip. The dataset looks the same as before, just with fewer variables:

<?xml version='1.0' encoding='utf-8'?>
<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5" version="2.5">
    <fileDscr>
        <fileTxt>
            <fileName>bap</fileName>
            <fileCont xml:lang="en">The data were collected as part of the SOEP-Core study using the questionnaire "Living in Germany - Survey 2010 on the social situation - Personal questionnaire for all. This questionnaire is addressed to the individual persons in the household. A view of the survey instrument can be found here: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont>
            <fileCont xml:lang="de">Die Daten wurden im Rahmen der Studie SOEP-Core mittels des Fragebogens „Leben in Deutschland – Befragung 2010 zur sozialen Lage - Personenfragebogen für alle“ erhoben. Dieser Fragebogen richtet sich an die einzelnen Personen im Haushalt. Eine Ansicht des Erhebungsinstrumentes finden Sie hier: https://www.diw.de/documents/dokumentenarchiv/17/diw_01.c.369781.de/soepfrabo_personen_2010.pdf</fileCont>
            <fileCitation>
                <titlStmt>
                    <titl xml:lang="en">Data from individual questionnaires 2010</titl>
                    <titl xml:lang="de">Daten vom Personenfragebogen 2010</titl>
                </titlStmt>
            </fileCitation>
        </fileTxt>
        <notes>
            <ExtLink URI="https://paneldata.org/soep-core/data/bap" />
        </notes>
    </fileDscr>
    <dataDscr>
        <var name="bap87">
            <labl xml:lang="en">Current Health</labl>
            <labl xml:lang="de">Gesundheitszustand gegenwärtig</labl>
            <txt xml:lang="en">Question: How would you describe your current health?</txt>
            <txt xml:lang="de">Frage: Wie würden Sie Ihren gegenwärtigen Gesundheitszustand beschreiben?</txt>
            <notes>
                <ExtLink URI="https://paneldata.org/soep-core/data/bap/bap87" />
            </notes>
            <varFormat type="numeric" />
            <catgry>
                <catValu>-2</catValu>
                <labl xml:lang="en">Does not apply</labl>
                <labl xml:lang="de">trifft nicht zu</labl>
            </catgry>
            <catgry>
                <catValu>-1</catValu>
                <labl xml:lang="en">No Answer</labl>
                <labl xml:lang="de">keine Angabe</labl>
            </catgry>
            <catgry>
                <catValu>1</catValu>
                <labl xml:lang="en">Very good</labl>
                <labl xml:lang="de">Sehr gut</labl>
...

The data.csv file now includes just four columns:

"bap87","bap9201","bap9001","bap9002"
4,-2,1,-1
3,5,-2,1
,-1,-1,2
1,9,-2,2
-1,4,2,3
3,4,-1,4
1,9,2,-1
...

If you wish to export only the metadata for documentation or archiving purposes, you can achieve this by setting the argument export_data=FALSE. By doing so, the resulting directory or zip file will solely contain the metadata XML file, excluding the data CSV file. This allows you to specifically capture and preserve the metadata without including the actual data, providing a solution for documentation or archiving needs.

write_odf(
  x = df,
  file = "../df_metadata.odf.zip",
  export_data = FALSE
)

If you wish to export the dataset with the metadata only in one or some languages, set the languages argument to languages=c("en"). Default: languages="all"

write_odf(
  x = df,
  file = "../df_en.odf.odf.zip",
  languages = "en"
)

By default, languages is set to languages="all". You can also define a list of languages to be exported:

write_odf(
  x = df,
  file = "../df_en_de.odf.odf.zip",
  languages = c("en", "de")
)

Create ODF tibbles and convert data frames to odf_tbl: as_odf_tbl()

To convert a data frame to an ODF tibble in R, the opendataformatpackage provides the as_odf_tbl() function. It transforms a dataframe (or any subclass) object to an Open Data Format tibble (an object of the odf_tbl class). The metadata in the data frame has to be stored in the attributes according to the odf_tbl framework:

Regarding the dataset metadata, the dataset name is stored in the name-attribute, a URL can be stored in the url-attribute. The multilingual labels and descriptions are stored in the label_tag and description_tag-attributes with the respective language tag. The dataset label in English language is stored in the label_en-attribute and the English description is stored in the description_en-attribute.

Regarding the variable metadata, the variable name is stored in the name-attribute, a URL can be stored in the url-attribute of each column, and the variable type (numeric or character) can be stored in the type-attribute. The multilingual labels, descriptions and value labels are stored in the label_tag, description_tag, and labels_tag-attributes with the respective language tag. The dataset label in English language is stored in the label_en-attribute and the English description is stored in the description_en-attribute. The value labels in English (a numeric vector with the labelled values and the value labels in the namespace) are stored in the labels_en-attribute of the column.

#Create a data frame with four variables ind 5 rows
exampledata <- data.frame(id = 1:5,
                          name = c("Klaus", "Anna", "Rebecca",
                                   "Kevin", "Janina"),
                          age = c(55, 40, 19, 25, 60),
                          diagnosis = c(1,3,3,2,1))
# Add metadata for dataset according to ODF tibble framework.
attr(exampledata, "name") <- "patientdata"
attr(exampledata, "label_en") <- "Patient Data"
attr(exampledata, "description_en") <- "Patient database of the practice Dr. Sommer"
attr(exampledata, "url") <- "www.example.url.en"

# Add metadata for diagnosis variable with label, description and value labels.
attr(exampledata$id, "name") <- "id"
attr(exampledata$id, "label_en") <- "Patiend ID"
attr(exampledata$id, "description_en") <- "Practice Patiend ID"
attr(exampledata$diagnosis, "name") <- "diagnose"
attr(exampledata$diagnosis, "label_en") <- "Diagnosis"
attr(exampledata$diagnosis, "description_en") <- "Diagnosis patient last visit"
valuelabels_diagnosis <- 1:4
names(valuelabels_diagnosis) <- c("Covid", "Influenza", "Common cold", "Tonsillitis")
attr(exampledata$diagnosis, "labels_en") <- valuelabels_diagnosis
# use as_odf_bl to transform dataframe to an ODF tibble ('odf_tbl'-class object)
example_odf  <-  as_odf_tbl(exampledata)

# Display metadata of diagnosis Variable
docu_odf(example_odf$diagnosis, style = "print")

Let's Analyse and Make Use of the Metadata

Now let's see how we can use the metadata to better understand the data and make more informative plots.

Frequency Table

table(df$bap87, useNA = "ifany")

As expected, the frequency table displays the occurrence count of each variable value. Now, let's enhance the convenience of the frequency table by utilizing the value labels associated with the variables. To access the value labels, as explained in the preceding section, you can utilize the base R function attributes(). Let's proceed to examine them now:

attributes(df$bap87)$labels_en
attributes(df$bap87)$labels_de
table(factor(df$bap87, labels = names(attributes(df$bap87)$labels_en)))

Alternatively you can use the getmetadata_odf()-function to get the value labels:

table(factor(df$bap87,
             labels = names(getmetadata_odf(df$bap87, type = "valuelabels"))))

To display the data in a language other than the default one, let's try German by appending the respective language code to the attribute name. For example, you can use $labels_de to access the German language labels and present the information accordingly.

table(
  factor(
    df$bap87,
      labels = names(attributes(df$bap87)$labels_de)
    )
  )

Or using getmetadata_odf()-function:

table(
  factor(
    df$bap87,
      labels = names(getmetadata_odf(df$bap87, type = "valuelabels",
                                     language = "de"))
    )
  )

Merge ODF-Datasets:

To merge ODF-datasets you should use the left_join(), right_join(), full_join(), and inner_join() from the dplyr-package instead of the the merge()-function to keep the attributes with the metadata of the merged datasets.

library(dplyr)
#similar to merge(df[,c(1:3,6)], df[,c(4:6)], by="name", all.x=T, all.y=F)
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)], by = "name")
#or
merged_df <- left_join(df[, c(1:3, 6)], df[, c(4:6)])

Recoding a Variable

We want to display the table with only valid answers. Therefore, we set the values -2 and -1 to NA. Because we do not want to overwrite the original variable, we generate a new one:

bap87_rec <- df$bap87

We check the attributes of the metadata and notice they are also copied from the original variable to the new one:

attributes(bap87_rec)

Now we can set the negative values to NA:

for (row in seq(1, length(bap87_rec))) {
  if (!is.na(bap87_rec[row]) && bap87_rec[row] <= -1) {
    bap87_rec[row] <- NA
  }
}

table(bap87_rec, useNA = "ifany")

We notice that the copied values and value labels do not fit anymore:

attributes(bap87_rec)$labels_en

To change that, we'll copy positions 3 to 7, retaining the desired range of values and their respective value labels.

attributes(bap87_rec)$labels_en <-
  unname(attributes(df$bap87)$labels_en)[3:7] # values
names(attributes(bap87_rec)$labels_en) <-
  names(attributes(df$bap87)$labels_en)[3:7] # labels

attributes(bap87_rec)$labels_en

Do the same for the other language versions of the new recoded variable:

attributes(bap87_rec)$labels_de <-
  unname(attributes(df$bap87)$labels_de)[3:7] # values
names(attributes(bap87_rec)$labels_de) <-
  names(attributes(df$bap87)$labels_de)[3:7] # labels

We do also notice that the variable name is not adequate. We replace the name copied from the original variable with the new name bap87_rec.

attributes(bap87_rec)$name <- "bap87_rec"
attributes(bap87_rec)$name

Now we generate the frequency table by using the variable as a factor variable.

table(
  factor(
    bap87_rec,
      labels = names(attributes(bap87_rec)$labels_en)
    )
  )

Barplot

To create a barplot, we will utilize the recoded variable from the previous section. This example will demonstrate how to leverage metadata to create a more convenient and informative graph. By incorporating the metadata into the visualization, we can enhance the graph's interpretability and provide a clearer understanding of the data.

barplot(
  table(
  factor(
    bap87_rec,
      labels = names(attributes(bap87_rec)$labels_en)
    )
  ),
  main = attributes(bap87_rec)$description_en, # title
  xlab = paste0(
    attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label), # label
  sub = attributes(bap87_rec)$url, # subtitle
  cex.main = 0.9, cex.names = 0.7, cex.sub = 0.8, cex.axis = 0.6,
  cex.lab = 0.7 # font sizes
)

Drawing a barplot with the German description becomes effortless when dealing with dates that have multiple language versions of labels and descriptions. Simply append the language code to the end of the label attributes, and you'll be able to generate the desired barplot with the German description:

barplot(
  table(
  factor(
    bap87_rec,
      labels = names(attributes(bap87_rec)$labels_de)
    )
  ),
  main = attributes(bap87_rec)$description_de, # title
  xlab = paste0(
    attributes(bap87_rec)$name, ": ", attributes(bap87_rec)$label_de), # label
  sub = attributes(bap87_rec)$url, # subtitle
  cex.main = 0.7, cex.names = 0.5, cex.sub = 0.8, cex.axis = 0.7,
  cex.lab = 0.7 # font sizes
)


Try the opendataformat package in your browser

Any scripts or data that you put into this service are public.

opendataformat documentation built on April 3, 2025, 11:22 p.m.