In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

If you are analysing data in R, you should arrive to two basic classes: numeric (double or integer) for continuous variables, and the factor (a categorical variable with integer values.) These classes are like SPSS’s classes, but they differ in the way they handle missing values.

The SPSS classes have two attributes that are modelled into the R class labelled. They add a variable label in addition to the name of each variable (or object, in R), and they add optional value labels to each value of the that object. These are like levels of a factor variable in R, and they are different from the names of vector elements.

library(labelled)
v <- labelled(c(3,4,4,3,8, 9),
             c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9)
             )
str(v)

# The labels are stored in the 'labels' attribute.

attr(v, 'labels')

The labelled::val_label() will retrieve or modify certain labels. You can retrieve all labels with labelled::val_labels(). (See further manipulation of the labelled class in this intro.)

labelled::val_label(v, 8) <- 'WRONG'
labelled::val_labels(v)

library(eurobarometer)
data("ZA7576_sample")
eb <- ZA7576_sample
nuts <- eb$nuts
str(nuts)
trust_media <- eb$qa6a_1

str(trust_media)

The nuts variable in the ZA7579_sample has the following metadata:

label, which is the variable label;
labels, which are the value labels;
format.spss, which is the character class in SPSS;
display_width, which is the user-set display width in SPSS;

trust_media <- eb$qa6a_1
str(trust_media)

The nuts variable in the ZA7579_sample has one more attribute, the the user defined missing value. This means that apart from missing observations, the SPSS user may set ‘do not know’ as a missing value label to tell SPSS that whenever this variable is summarized or averaged, these observations should be omitted.

label, which is the variable label;
labels, which are the value labels;
format.spss, which is the character class in SPSS;
display_width, which is the user-set display width in SPSS;
na_values, which are the numeric values which should be translated to NA_real, given that the R class is dbl+lbl, i.e. double numeric with labels.

Let's have a look at the factor representation of this variable:

#The first 8 observations
print(head(labelled::to_factor(trust_media),8))

Or in summary:

#The summary of the first 8 observations
summary(head(labelled::to_factor(trust_media)),8)

The SPSS compatible extension of labelled, defined in the package haven, works with the following constructor, adding an SPSS-style missing mark to two values, 8 (WRONG) and 9 (REFUSED):

s <-labelled_spss(
  x = unclass(v),         # remove labels from earlier defined
  labels = val_labels(v), # use the labels from earlier defined
  na_values = NULL,
  na_range = 8:9,
  label = "Variable Example"
)

str(s)

print(s)

summary(s)

The labelled_spss are S3 classes see a comprehensive overview here, and a more hands-on here], which means that important generic functions, such as print or summary are already defined.

THe labelled_spss stores all the contents of the SPSS file without information loss, but these values are not yet harmonized, and not yet ready for analysis in R. For example, without further conversion, the previous expression gives a misleading mean:

mean(s)

It is always the analysts responsibility to work with a correct variable type. In R, most statistical functions accept only numeric (continous) or factor (categorical) arguments.

The 'eurobarometer_labelled' class

In our package we are defining a new class that inherits properties from the haven_labelled_spss class (we can call this haven_labelled_eurobarometer or simply eurobarometer). We aim to slightly modify various labelled classes to maintain compatibility with the haven package for exporting and importing to/from SPSS and Stata, with the package labelled to change labelling, and its popular extension, sjlabelled.

The numeric values of our variables are harmonized, for example, yes is always coded as 1, and no is always coded as 0. labels, which are harmonized value labels, whenever a harmonization function already exists in the package, i.e. their values are consistent regardless of the original SPSS file and they can be safely joined as factors (you do not have to worry if ‘not_trust’ or ‘distrust’ mean the same.) label, which contains the variable label (long description of the variable in SPSS); * na_range and/or na_values for variables labelled as 'missing'.

These attributes can be converted to labelled_spss with the haven::labelled_spss constructor. However, we are adding several further attributes, which do not interfere with the functionality of the labelled class, but document all the history of the harmonization.

labels_orig, which contains the original value labels with the original values as class labelled, and can help to restore both the original numericl values and the original value labels.
label_orig, which contains the original variable name, or var_name_orig.
var_name_orig, which contains the original SPSS variable name
conversion, which contains the name of the eurobarometer function that harmonized the values and their labels.
qb, a recognized question block with pre-defined metadata, such as id, socio-demography, protocol.

The value_orig is an integer in the range -999-999 where a natural ordering of a category is available, or there is a natural conversion to a number, such yes 1, no 0. It is in the range of 1001..9999 when there is no natural ordering, and the categories should not be ordered. We apply special values such as 99999 or 99998 for missing values or other special cases. This is helpful for both numeric and factor variables, because they allow an efficient use of various statistical functions only on the valid range.

e <-labelled_spss(
  x = c(1,2,2,1,99998,99999),
  labels =  c(yes = 1, no = 2, wrong_label = 99998, refused = 99999),
  na_values = NULL,
  na_range = 99998:99999,
  label = "Variable Example"
)

attr(e, "labels_orig") <- c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9)
attr(e, "labels_orig") <- NULL
attr(e, "var_name_orig") <- "v"
attr(e, "qb") <- "examples"
attr(e, "conversion") <- "labelled_spss"
str(e)

And now let's see the eurobarometer_labelled() constructor in work, even though normally it should not be directly called, instead the various conversion functions, such as the wrappers harmonize_to_numeric() or convert_class() should call it:

my_ebl <- eurobarometer_labelled(
   x = c(6,7,8,9,6,7,8),
  labels =c("HI" = 6, "HELLO" = 7,
          "HOWDY" = 8, "BYE" = 9),
           na_values = 9,
   qb = "Example Block",
   conversion = "eurobarometer_labelled"
)

str(my_ebl)

Conversion from a haven_labelled class:

from_v <- eurobarometer_labelled(
   x = v,
   qb = "Example Block",
   conversion = "eurobarometer_labelled"
)

str(from_v)

Conversion from a haven_labelled_spss class:

from_trust <- eurobarometer_labelled(
   x = trust_media,
   qb = "trust",
   conversion = "custom_conversion"
)

str(from_trust)

The methods of eurobarometer_labelled are designed to fall back to the classes of haven and labelled, and eventually vctrs which uses more consistent coercion rules than base R. In the very last case, they fall back to double, which is a basic type.

class(from_trust)

See the method dispatch with mean(), which falls back to the nearest to eurobarometer_labelled. In this case, it does not give a correct mean, because we do not have yet a to_numeric method that would make all user-defined missing values NA_real_:

mean (from_trust)

Not yet written: Furthermore, the use of consistent numbering helps further harmonization, for example, with other surveys where the labels are in a different language, because no is always 0, and yes is always 1, whereas in the original GESIS files, they may take the value 1,2 instead of 0,1.

These labelled classes are useful for harmonization and prosessing, but not for the actual data analysis. They serve as an importing, documenting, harmonization and re-exporting intermediate class for compatibility with SPSS or other software packages, and they can be easily and unambiguously converted to base types of R, i.e. to numeric and factor. They help to prepare the next step, data analysis, by either exporting harmonized data back to SPSS, or by converting them to base R types of numeric and factor for analysis in R.

Methods

Our labelled_eurobarometer class has further attributes which do not interfere with the haven_labelled_spss, only serve documentation and reversibility functions. In short, we will create haven_labelled_spss vectors from all harmonized variables, with the additional metadata that shows the history of the variable before and during the harmonization. These attributes can be omitted when saved back to and SPSS file, or simplified into a csv file.

We are going to implement simple methods for this new class: * summary, which will show in a simple summary display the values with their original and new labels, and the conversion applied.

to_numeric, which will convert the values to their numeric value, converting all missing values to NA_real_, i.e. resulting in a valid numeric base type R object.
to_factor, which will convert all categorical values to factors with harmonized factor levels, i.e. a well-defined, validated, factor object with consistent levels.
to_character, which is inherited from the haven_labelled_spsss class, and convert the variable to a character vector. This may be useful for data visualization. (The character base type is often used when factor levels are not harmonized, however, when factor levels are consistent, their use is desirable.)

See that from_trust works with the to_factor function (method) of the package labelled, and already has a meaningful summary() method:

class(from_trust)
summary(labelled::to_factor(from_trust))

Furthermore, we will create a helper function which will allow the user the decide which missing values to treat as missing before conversion to factor: i.e. inap only, inap+declined, or all, with other potential missing values. (This is not needed for numeric conversion, because only numbers in the valid range should be used. )

knitr::kable(
  eurobarometer_special_values()
  )

It is always the researchers', analysts' responsibility to choose between a numeric (continuous) or categorical (factor representation.) We will store all necessary information in our class to make this decision. The researcher will have to choose which variables to be converted to numeric or to factor when the analysis begins (we will create default cases for weights, protocol metadata variables, descriptive constants, and repeating, standard socio-demography, which are unlikely to be overruled by the user.)

Once the harmonization took place, and the user used the to_numeric or to_factor method, the conversion takes place, and any base R or R packages can be used for the analysis. Before this step, however, our labelled_eurobarometer can be saved to csv or SPSS files, either with the original, imported values, or with the harmonized ones. (We have to see if SPSS allows to save the additional information, i.e. allow to save in one .sav file the harmonized and original labelling.) In csv, we will create a simple save function that will save each variable in two forms, with an orig_ prefix with the original values, and with the harmonized values.

Generally, haven_labelled_eurobarometer is designed to deviate as little from the classes of the labelled package as possible, and only help the users with a memory-efficient way to keep the data and metadata consistently stored both in memory and in file.

antaldaniel/eurobarometer documentation built on Aug. 31, 2020, 10:57 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

antaldaniel/eurobarometer
Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

The 'eurobarometer_labelled' class

Methods

R Package Documentation

Browse R Packages

We want your feedback!

antaldaniel/eurobarometer Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

In antaldaniel/eurobarometer: Tidy Eurobarometer Survey Microdata For Retrospective Harmonization

The 'eurobarometer_labelled' class

Methods

R Package Documentation

Browse R Packages

We want your feedback!

antaldaniel/eurobarometer
Tidy Eurobarometer Survey Microdata For Retrospective Harmonization