library(ifnBase)

platform_define_survey("intake", survey_id = 9, table ="intake", template="eu:intake", mapping=list())
platform_define_survey("weekly", survey_id = 8, table ="weekly", template="eu:weekly", mapping=list())

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

ifnBase provides two categories of functions to work with surveys.

  1. Functions to load data from the platform database (survey definition or collected data)
  2. Functions to work with survey data (variable aliases, recoding and labelling)

To work these functions need the surveys to be registered to the package by providing a survey definition.

Survey definition

Survey definition is provided by calling the platform_define_survey() function, either in the platform file or in a script before using it. To do it in the platform file will guarantee the survey definition is available everywhere once the package is loaded and only defined once.

Two examples :

# A Simple survey about vaccination
platform_define_survey(
  name="mysurvey",
  table="pollster_results_vaccination",
  survey_id = 12,
  mapping= list(
    'age'='Q1'  # Q1 = Name in the DB -> 'age' name in the R data
    'vaccination'='Q2',
    'vacc.at_risk'='Q3_1',
    'vacc.health_prof'='Q3_2'
    'vacc.work'='Q3_3'
    ),
  recodes=list(
    'vaccination'=c('yes'='1','no'='2', 'unknown'='3') # Describe recoding to meaningful levels
  ),
  labels = list(
    'vacc.reason'='vacc.*' # Define a pattern to identify all variables about vaccination reason
  )
)

# Define "intake" survey to be exactly as the european one.
platform_define_survey(name = "intake", template = "eu:intake") # That's all.

the platform_define_survey function returns a value but you dont need to care about it. The package manage a registry of all surveys. The surveys is automatically registered when you call this function !

Survey definition elements

each element is a parameter to pass to the platform_define_survey function.

name

A unique internal name has to be given to each survey. This is the name to be used when calling the packages functions. It can be different than the survey name registered in the Influenzanet platform database (but it's a good idea to have the same name ;)).

survey_id : what is the survey definition in the Influenzanet database

id of the survey in the Influenzanet platform database (if not provided some functions like survey_load_question(), survey_load_all() will not be useable).

This parameter is only useable if you work with a full featured Influenzanet platform's database (with all tables of the pollster app). You can omit it if you work offline or with a restricted database.

table, single.table : defining how data are stored

table parameter has to be set to the name of the table containing the survey data. If several table are available it must be the common table ("pollster_results_intake" for example)

This package handles 2 data models : - single table model: all data of all seasons are stored in the same table (e.g. all in "pollster_results_intake"), to use this model set parameter single.table to TRUE - multiple table: one table for each season, then each season should be described individually (see season's handling section). single.table should be FALSE (default).

mapping : define mapping between database column names and variable names

This parameters is a list of key value pairs, the key is the variable name, the value the name of the database column for this variable.

 mapping= list(
    'age'='Q1'  # Q1 = Name in the DB -> 'age' name in the R data
    'vaccination'='Q2',
    'vacc.at_risk'='Q3_1',
    'vacc.health_prof'='Q3_2'
    'vacc.work'='Q3_3'
 )

recodes : define the recoding for each variables

This parameter should be a list of list. The first key, should be the variable name, the value a list of key,value pairs, with recoded label as key and database value as value.

  recodes = list(
     smoker = c(
        'smoker.no'="0",
        'smoker.occas'="1",
        'smoker.dailyfew'="2",
        'smoker.daily'="3",
        'smoker.dkn'="4",
        'smoker.stopped'="5",
        'smoker.juststop'="6"
      ),
      pregnant = c(
        'Yes' = '0',
        'No' = '1',
        'DNK'='2'
      ),
      another_variable = recode_alias("pregnant")
  )

It is recommended to use prefixed labels ('smoker.no' instead of 'no') because this is really identify the modality of this question (by making the label unique across the surveys) and enable specific translation for this label.

It is possible to reuse an already define recoding by using the function recode_alias() with the name of the variable to copy the recoding from.

labels

Labels are a way to define group of variable names (see survey_labels()). They can be defined either with the full list of names

  labels = list(
    'vacc.reason'='vacc.*' # Define a pattern to identify all variables about vaccination reason. Will select all variable starting with 'vacc.'
  )

template

It's possible to define a survey by inheriting an existing template (with mapping, recodes & labels defined).

2 templates are built-in the package : - 'eu:weekly' : weekly Influenzanet survey - 'eu:intake' : intake Influenzanet survey

Loading functions

Survey description from database

Several functions are available to load information about a survey defined in an Influenzanet platform database.

These functions are only needed if you want to work with the definitions stored in the database (how data are collected using the web forms), they are not necessary to work with surveys responses (therefore it's possible to work only with the survey data, without the database). The survey definition provided by this package allow to work offline or with a data only database (with only "_results" tables).

Load Survey data (responses) from the database

survey_load_results()

load survey data, all options are described in R documentation ?survey_load_results. Basically you can : - load data of a season with the season parameter - for a country (in european database) - for a list of participants (survey.users parameter) - get geographical levels matching with the participant location (for intake survey) with the geo parameter - get a subset of columns (using cols) parameter, the name to use are the either the database column name or the internal aliases

# Load weekly data for the season 2017-2018, with all columns
weekly = survey_load_results("weekly", cols="*", season=2017)
# Load intake for 2019-2020 seasons, and match with the corresponding nuts2 level for each fetched survey
intake = survey_load_results("intake", cols=c('timestamp','date-birth','code_zip'), season=2019, geo="nuts2")

Results will have several mandatory variables (regardless the parameter cols) - id: unique number identifying the survey response - person_id : number of the participant (see note below)

In the resulting data.frame variable names names will be the internal aliases defined in the survey definition

Note about participants identification The Influenzanet platforms identify participants by a string usually called "global_id". This package doesn't load use it directly for memory saving reasons. We use the integer id in the survey_surveyuser table, in this package it is named "person_id" in the data and parameter of function who uses it.

Other utility functions

Working with survey data

Variable names

To analyse the data we don't work with the names of the database tables (Qxxx) because they are not meaningful and error prone but with internal variable names (called aliases) The mapping between the database column names and variable names is defined in the survey survey definition.

For "intake" and "weekly", the set of variables is available from a template for each survey, using them allow to use the same names in all analysis whatever the platform. it's possible to override a template to add platform's specific questions and coding.

In principle you would not have to use database column name (the goal of the package is to do the job for you).

Variable recoding

Survey data are encoded using integer codes to represent each variable modality. These codes are defined for each question (options in the survey data model of Influenzanet platform). Unfortunately those codes are not meaningful (e.g. '0' can represent either the response 'Yes', or 'No') and error prone. To reduce the error by manipulating data we propose a recoding of each categorical variable with a unique and meaningful set of labels (they are not perfect but better than a number).

Predefined labels are provided for Yes/No/I dont know questions with labels 'Yes', 'No', and 'DNK'. This package provides 3 constants for each corresponding level : YES, NO, DONTKNOW they are useful to test a recoded value with this label (more error proof than using the label itself)

intake = survey_load_results("intake", c('vacc.curseason','pregnant')) # Load intake data with vaccination and pregnancy responses
intake = recode_intake(intake) # Recode to labels
if(intake$vacc.curseason == YES) {
  # DO something
}
#  Also useable to select subset
not_pregnant = intake %>% filter(pregnant == NO)

recode_weekly(), recode_intake()

For weekly and intake the package provides 2 dedicated function. They will recode categorical variables to factor using labels mapping in the corresponding survey definition. They also recode some other variables (like date) and check for some inconsistencies. They are more specialized than survey_recode_all().

intake = survey_load_results("intake", "*") # Load all intake data for the current season
intake = recode_intake(intake) # Recode all variables

weekly = survey_load_results("weekly", "*") # Load all intake data for the current season
weekly = recode_weekly(weekly, all.variables = FALSE) # Recode only some specific variables
weekly = recode_weekly(weekly, all.variables = TRUE) # Recode all known variables available in weekly data.

These two functions are useable with data loaded from database or stored in a Rdata file. If some variables are already recoded (in factor) they will not be recoded. Of course before recoding, the variables should be encoded with the database code (if you change these values before recoding it will lead to unpredictable results).

survey_recode()

This function recode a vector of values using the mapping of a known variable.

x = c(0, 1, 0, 0, 1)
x = survey_recode(x, variable="pregnant", survey="intake") # Recode using the mapping of the "pregnant" variable of intake survey
print(x)

survey_recode_all(data, survey)

This function recode all known variable in data using recoding from the survey name survey Caution, they are

weekly = survey_recode_all(weekly, survey="weekly") # Recode all variables in weekly data

# If another survey is registred for example, you have defined a survey called "my-study"
data = survey_recode_all(data, survey="my-study")

survey_variable_recoding()

Get the mapping to recode a given variable of a survey

survey_variable_recoding("intake", "pregnant")
survey_variable_recoding("intake", "smoker")

The mention '(inherited)' means the recoding comes from the template the survey definition inherits.

survey_recodings()

Return the recoding mapping of all known variable of a survey

survey_recodings("intake")

Variable groups and survey_labels() [survey_labels]

Several survey variables can store the responses for a single survey question, for example the multiple choices questions (each modality response is stored in a boolean column a).

A survey definition can contains "labels", a label in a named list of values, useable to define "groups". They can be used to get the list of recoding labels but also to get list of variables corresponding to a multiple choice question.

The function survey_labels() get the group with the name of labels.

  condition.vars = survey_labels("intake", "condition") # List of variables for the "condition" question of intake

This mechanism allows to write programs without knowing the exact list of modalities of the variable (if it changes, only the definition has to be updated)

For example:

  condition.vars = survey_labels("intake", "condition") # List of variables for the "condition" question of intake
  allergy.vars = survey_labels("intake", "allergy") # List of variables for the "condition" question of intake

  intake = survey_load_results("intake", cols=c(condition.vars, allergy.vars), season=2016) # Get the response of the 2 questions

  # Frequency of condition
  condition.freq = multiple_freq(intake[, condition.vars])
  gg_barplot_percent(condition.freq) # plot it

  # Frequency of allergies
  allergy.freq = multiple_freq(intake[, allergy.vars])
  gg_barplot_percent(allergy.freq) # plot it

survey_definition()

You can get all the configuration of a survey by calling survey_definition(). It will show all variable aliases, label recoding, and other parameters of a survey.

survey_definition("intake") # Show intake survey definition

survey_aliases()

The function survey_aliases() is responsible to convert from database column name to internal variable names (called aliases). The survey definition should be passed directly

survey_aliases('code_zip', survey_definition("intake")) 
# 'Q3'


cturbelin/ifnBase documentation built on Nov. 5, 2023, 12:54 p.m.