knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
If you are analysing data in R, you should arrive to two basic classes: numeric (double or integer) for continuous variables, and the factor (a categorical variable with integer values.) These classes are like SPSS’s classes, but they differ in the way they handle missing values.
The SPSS classes have two attributes that are modelled into the R class labelled. They add a variable label in addition to the name of each variable (or object, in R), and they add optional value labels to each value of the that object. These are like levels of a factor variable in R, and they are different from the names of vector elements.
library(labelled) v <- labelled(c(3,4,4,3,8, 9), c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) ) str(v) # The labels are stored in the 'labels' attribute. attr(v, 'labels')
The labelled::val_label()
will retrieve or modify certain labels. You can retrieve all labels with labelled::val_labels()
. (See further manipulation of the labelled class in this intro.)
labelled::val_label(v, 8) <- 'WRONG' labelled::val_labels(v)
library(eurobarometer) data("ZA7576_sample") eb <- ZA7576_sample nuts <- eb$nuts str(nuts) trust_media <- eb$qa6a_1 str(trust_media)
The nuts
variable in the ZA7579_sample
has the following metadata:
label
, which is the variable label;labels
, which are the value labels;format.spss
, which is the character class in SPSS;display_width
, which is the user-set display width in SPSS;trust_media <- eb$qa6a_1 str(trust_media)
The nuts
variable in the ZA7579_sample
has one more attribute, the the user defined missing value. This means that apart from missing observations, the SPSS user may set ‘do not know’ as a missing value label to tell SPSS that whenever this variable is summarized or averaged, these observations should be omitted.
label
, which is the variable label;labels
, which are the value labels;format.spss
, which is the character class in SPSS;display_width
, which is the user-set display width in SPSS;na_values
, which are the numeric values which should be translated to NA_real
, given that the R class is dbl+lbl
, i.e. double numeric with labels.Let's have a look at the factor representation of this variable:
#The first 8 observations print(head(labelled::to_factor(trust_media),8))
Or in summary:
#The summary of the first 8 observations summary(head(labelled::to_factor(trust_media)),8)
The SPSS compatible extension of labelled, defined in the package haven, works with the following constructor, adding an SPSS-style missing mark to two values, 8 (WRONG)
and 9 (REFUSED)
:
s <-labelled_spss( x = unclass(v), # remove labels from earlier defined labels = val_labels(v), # use the labels from earlier defined na_values = NULL, na_range = 8:9, label = "Variable Example" ) str(s)
print(s)
summary(s)
The labelled_spss
are S3 classes see a comprehensive overview here, and a more hands-on here], which means that important generic functions, such as print or summary
are already defined.
THe labelled_spss
stores all the contents of the SPSS file without information loss, but these values are not yet harmonized, and not yet ready for analysis in R. For example, without further conversion, the previous expression gives a misleading mean:
mean(s)
It is always the analysts responsibility to work with a correct variable type. In R, most statistical functions accept only numeric (continous) or factor (categorical) arguments.
In our package we are defining a new class that inherits properties from the haven_labelled_spss class (we can call this haven_labelled_eurobarometer or simply eurobarometer). We aim to slightly modify various labelled classes to maintain compatibility with the haven package for exporting and importing to/from SPSS and Stata, with the package labelled to change labelling, and its popular extension, sjlabelled.
The numeric values of our variables are harmonized, for example, yes is always coded as 1, and no is always coded as 0.
labels
, which are harmonized value labels, whenever a harmonization function already exists in the package, i.e. their values are consistent regardless of the original SPSS file and they can be safely joined as factors (you do not have to worry if ‘not_trust’ or ‘distrust’ mean the same.)
label
, which contains the variable label (long description of the variable in SPSS);
* na_range
and/or na_values
for variables labelled as 'missing'.
These attributes can be converted to labelled_spss
with the haven::labelled_spss
constructor. However, we are adding several further attributes, which do not interfere with the functionality of the labelled
class, but document all the history of the harmonization.
labels_orig
, which contains the original value labels with the original values as class labelled, and can help to restore both the original numericl values and the original value labels.label_orig
, which contains the original variable name, or var_name_orig.var_name_orig
, which contains the original SPSS variable nameconversion
, which contains the name of the eurobarometer function that harmonized the values and their labels.qb
, a recognized question block with pre-defined metadata, such as id
, socio-demography
, protocol
. The value_orig
is an integer in the range -999-999 where a natural ordering of a category is available, or there is a natural conversion to a number, such yes
1, no
0. It is in the range of 1001..9999 when there is no natural ordering, and the categories should not be ordered. We apply special values such as 99999 or 99998 for missing values or other special cases. This is helpful for both numeric and factor variables, because they allow an efficient use of various statistical functions only on the valid range.
e <-labelled_spss( x = c(1,2,2,1,99998,99999), labels = c(yes = 1, no = 2, wrong_label = 99998, refused = 99999), na_values = NULL, na_range = 99998:99999, label = "Variable Example" ) attr(e, "labels_orig") <- c(YES = 3, NO = 4, `WRONG LABEL` = 8, REFUSED = 9) attr(e, "labels_orig") <- NULL attr(e, "var_name_orig") <- "v" attr(e, "qb") <- "examples" attr(e, "conversion") <- "labelled_spss" str(e)
And now let's see the eurobarometer_labelled()
constructor in work, even though normally it should not be directly called, instead the various conversion functions, such as the wrappers harmonize_to_numeric()
or convert_class()
should call it:
my_ebl <- eurobarometer_labelled( x = c(6,7,8,9,6,7,8), labels =c("HI" = 6, "HELLO" = 7, "HOWDY" = 8, "BYE" = 9), na_values = 9, qb = "Example Block", conversion = "eurobarometer_labelled" ) str(my_ebl)
Conversion from a haven_labelled
class:
from_v <- eurobarometer_labelled( x = v, qb = "Example Block", conversion = "eurobarometer_labelled" ) str(from_v)
Conversion from a haven_labelled_spss
class:
from_trust <- eurobarometer_labelled( x = trust_media, qb = "trust", conversion = "custom_conversion" ) str(from_trust)
The methods of eurobarometer_labelled
are designed to fall back to the classes of haven
and labelled
, and eventually vctrs
which uses more consistent coercion rules than base R. In the very last case, they fall back to double
, which is a basic type.
class(from_trust)
See the method dispatch with mean()
, which falls back to the nearest to eurobarometer_labelled
. In this case, it does not give a correct mean, because we do not have yet a to_numeric
method that would make all user-defined missing values NA_real_
:
mean (from_trust)
Not yet written: Furthermore, the use of consistent numbering helps further harmonization, for example, with other surveys where the labels are in a different language, because no is always 0, and yes is always 1, whereas in the original GESIS files, they may take the value 1,2 instead of 0,1.
These labelled classes are useful for harmonization and prosessing, but not for the actual data analysis. They serve as an importing, documenting, harmonization and re-exporting intermediate class for compatibility with SPSS or other software packages, and they can be easily and unambiguously converted to base types of R, i.e. to numeric and factor. They help to prepare the next step, data analysis, by either exporting harmonized data back to SPSS, or by converting them to base R types of numeric and factor for analysis in R.
Our labelled_eurobarometer
class has further attributes which do not interfere with the haven_labelled_spss
, only serve documentation and reversibility functions. In short, we will create haven_labelled_spss
vectors from all harmonized variables, with the additional metadata that shows the history of the variable before and during the harmonization. These attributes can be omitted when saved back to and SPSS file, or simplified into a csv file.
We are going to implement simple methods for this new class:
* summary
, which will show in a simple summary display the values with their original and new labels, and the conversion applied.
to_numeric
, which will convert the values to their numeric value, converting all missing values to NA_real_, i.e. resulting in a valid numeric base type R object.
to_factor
, which will convert all categorical values to factors with harmonized factor levels, i.e. a well-defined, validated, factor object with consistent levels.
to_character
, which is inherited from the haven_labelled_spsss class, and convert the variable to a character vector. This may be useful for data visualization. (The character
base type is often used when factor levels are not harmonized, however, when factor levels are consistent, their use is desirable.)
See that from_trust
works with the to_factor
function (method) of the package labelled
, and already has a meaningful summary()
method:
class(from_trust) summary(labelled::to_factor(from_trust))
Furthermore, we will create a helper function which will allow the user the decide which missing values to treat as missing before conversion to factor: i.e. inap
only, inap
+declined
, or all, with other potential missing values. (This is not needed for numeric conversion, because only numbers in the valid range should be used. )
knitr::kable( eurobarometer_special_values() )
It is always the researchers', analysts' responsibility to choose between a numeric (continuous) or categorical (factor representation.) We will store all necessary information in our class to make this decision. The researcher will have to choose which variables to be converted to numeric or to factor when the analysis begins (we will create default cases for weights, protocol metadata variables, descriptive constants, and repeating, standard socio-demography, which are unlikely to be overruled by the user.)
Once the harmonization took place, and the user used the to_numeric or to_factor method, the conversion takes place, and any base R or R packages can be used for the analysis. Before this step, however, our labelled_eurobarometer
can be saved to csv or SPSS files, either with the original, imported values, or with the harmonized ones. (We have to see if SPSS allows to save the additional information, i.e. allow to save in one .sav
file the harmonized and original labelling.) In csv, we will create a simple save function that will save each variable in two forms, with an orig_
prefix with the original values, and with the harmonized values.
Generally, haven_labelled_eurobarometer
is designed to deviate as little from the classes of the labelled
package as possible, and only help the users with a memory-efficient way to keep the data and metadata consistently stored both in memory and in file.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.