knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
resquin
This short tutorial describe the functions in resquin
and how you can use them
on a technical level. For a more substantive introduction see the (forthcoming)
article Using resquin in practice.
Functions in resquin
calculate response quality indicators for survey
data stored in a data frame or tibble. The functions assume that the
input data frame is structured in the following way:
NA
.resp_styles()
) Reverse keyed variables are in their original form. No items were
recoded.Consider the following (fake) data set of survey responses.
# A test data set with three items and ten respondents testdata <- data.frame( var_a = c(1,4,3,5,3,2,3,1,3,NA), var_b = c(2,5,2,3,4,1,NA,2,NA,NA), var_c = c(1,2,3,NA,3,4,4,5,NA,NA)) testdata
The data set contains responses to three survey questions (var_a,var_b
and var_c) from ten respondents. All three survey question allow
responses on a scale from 1 to 5. Some respondents have missing values,
which are set to NA
.
Lets use this data set to calculate response quality indicators.
resp_styles()
: Response style indicatorsResponse styles capture systematic shifts in respondents response behavior. For example, respondents with an extreme response style may only choose the lowest and highest categories (in our example 1 and 5) while mid-point respondents only choose the midpoint of a scale (in our example 3).
To calculate response styles we can use the resp_styles()
function.
First, we need to specify our data argument x
. Then, we need to
specify the minimum and maximum of the scales used in our questionnaire
(scale_min
and scale_max
respectively). Remember that all questions
included must have the same number of response options. We will discuss
the arguments min_valid_responses
and normalize
later.
library(resquin) # Calculating response style indicators for all respondents with no missing values results_response_styles <- resp_styles( x = testdata, scale_min = 1, scale_max = 5, min_valid_responses = 1, # Excludes respondents with less than 100% valid responses normalize = T) # Presents results in percent of all responses round(results_response_styles,2)
The resulting data frame contains five columns corresponding to the
middle response style (MRS), acquiescence response style (ARS),
disaquiescence response style (DRS), extreme response style (ERS), and
non-extreme response style (NERS) - you can learn more about the
response styles in the help file of the function using ?resp_styles
.
Each respondent receives one value for each indicator, given that they
can be calculated. Because normalize
is set to TRUE
the values are
expressed as the share of responses of a respondent that can be
attributed to a response style. For example, respondent one has an ERS
value of 0.67 meaning that two out of three responses can be identified
as extreme responses. On the other hand, respondent one does not have
any mid-point response, leading to a value of 0 in the MRS column.
Instead of calculating proportions, we can extract the counts of
responses that can be attributed to a response option by setting
normalize
to FALSE
.
Finally, we can decide to include or exclude respondents from receiving
response style values by setting min_valid_responses
, which can take
values from 0 to 1. min_valid_responses
sets the share of valid
responses (i.e. non-missing responses) a respondent must have to receive
response style values. A value of 0 indicates that response style values
should be calculated for all respondents, regardless of whether or not
they have missing values. A value of 1 indicates that response styles
should only be calculated for respondents who have valid responses on
all variables. Values between 0 and 1 indicate the share of responses
that need to be valid to be included in the response style calculations.
Results from resp_*()
functions can be fed to summary()
and print()
functions
to quickly create overviews for the calculated response quality indicators.
results_response_styles |> summary()
Calculates the averages and a five point summary of the response quality indicators.
results_response_styles |> print()
Prints a boxplot of the indicators.
resp_distributions()
: Intra-individual response distribution indicatorsresp_distributions()
calculates indicators which reflect the location
and variability of responses within a respondent. resp_distributions()
works similar to resp_styles()
: We need to specify the data argument
and we can include or exclude respondents from the calculations based on
amount of missing data they exhibit (for an explanation see paragraph
above).
# Calulating response distribution indicators for all respondents with no missing values results_resp_distributions <- resp_distributions( x = testdata, min_valid_responses = 1) # Excludes respondents with less than 100% valid responses round(results_resp_distributions,2)
The resulting data frame contains eight columns:
n_na: number of intra-individual missing answers
prop_na: proportion of intra-individual missing responses
ii_mean: intra-individual mean
ii_median: intra-individual median
ii_sd: intra-individual standard deviation
mahal: Mahalanobis distance per respondent.
You can learn more about the response distribution indicators using ?resp_distributions
resp_nondifferentiation()
: Response nondifferentiation indicatorsresp_nondifferentiation
calculates indicators primarily measuring what is known
as straightlining. However, different indicators measure different aspects of straightlining.
While Simple Nondifferentiation just measures whether respondents used the same response
option for all questions, other indicators such as the Mean Root of Pairs Method or
the Scale Point Variation Method provide continuous values of nondifferentiation,
measuring how varied the choice of response options for each respondent is.
resp_nondifferentiation()
works just like resp_distributions()
and resp_styles()
.
results_resp_nondifferentiation <- resp_nondifferentiation( x = testdata, min_valid_responses = 1) # Excludes respondents with less than 100% valid responses round(results_resp_nondifferentiation,2)
The resulting dataframe contains four columns containing indicators described by [@kim_straightlining_2019]:
You can learn more about the the response nondifferentiation indicators using
?resp_nondifferentiation
.
resp_patterns()
: Response pattern indicatorsResponse patterns describe suspicious response patterns arising from careless or insufficient effort responding. Careless or insufficient effort respondents might choose the same answer option repeatedly (i.e. creating a long string of equal responses) or use pattern, such as zig-zaging, to complete a survey.
resp_patterns()
provides three columns which contain longstring analysis indicators for each respondent. These three columns are always returned. There are two more optional columns () for the analysis of other repeating response patterns: defined_patterns and arbitrary_patterns:
To obtain the three longstring analysis columns simply call resp_patterns()
:
results_resp_patterns <- resp_patterns(testdata) round(results_resp_patterns,2)
To obtain counts of defined response patterns per respondents use the defined_patterns
argument. You can either supply a single vector representing a response pattern,
or a list of vectors representing multiple response patterns.
# Single defined pattern defined_patterns_single <- resp_patterns( x = testdata, defined_patterns = c(1,2,3) ) defined_patterns_single # Multiple defined patterns defined_patterns_multiple <- resp_patterns( x = testdata, defined_patterns = list( # wrap multiple pattern vectors into a list c(1,2,3), c(2,3,4), c(4,3,2), c(3,2,1) ) ) defined_patterns_multiple # defined patterns are returned as a single list column. # One way to make the data accessible is by using tidyr::unnest_wider() # This way, each column represents the count of one defined pattern defined_patterns_multiple |> tidyr::unnest_wider(defined_patterns)
Finally, you can request the detection of arbitrary response patterns with counts
larger than one and pattern lengths longer than one using the arbitrary_patterns
argument
arbitrary_patterns_length_2 <- resp_patterns( x = testdata, arbitrary_patterns = 2 ) arbitrary_patterns_length_2 # You can also request the detection of patterns of multiple # lengths, in this case 2 and 3 arbitrary_patterns_length_2_3 <- resp_patterns( x = testdata, arbitrary_patterns = c(2,3) ) arbitrary_patterns_length_2_3
id
argument: Uniquely identify respondentsIn its default setting, all resquin
functions provide an integer id
column
running from 1 to the number of respondents in the data set supplied in x
. You can turn of the
creation of the id
column by setting id = False
or supply a vector of unique
ids (character or numeric). The latter option is useful if a data set provides its
own vector of ids.
Here are examples of the three options:
# Default. Integer ids. resp_distributions(testdata) # No id column resp_distributions(testdata,id = F) # Custom id vectors custom_ids <- letters[1:nrow(testdata)] resp_distributions(testdata,id = custom_ids)
The argument min_valid_responses
can be used to control how many
responses a respondent must have for a response quality indicator to be computed.
Per default, resquin
requires respondents to have valid responses for all questions
asked (min_valid_responses = 1)
. In other words, per default resquin
requires
respondents to have no missing values.
Of course, missing values are a common occurrence in survey data, thus it can make
sense to reduce the required share of valid responses in min_valid_responses
from 1 to a
lower one. This leads to the calculation of response quality indicator values
for more respondents. For example, reducing min_valid_responses
to 0.5 would mean
that respondents only have to have valid (i.e. non-missing) responses on 50% of
the questions asked.
However, allowing NA
values (i.e. missing values) can change the interpretation
and validity of response quality indicators.
For example, a respondent with ten identical responses on a ten item scale
(i.e. ten time response option "3") has more evidence for straightlining then a respondent with two identical responses
which are separated by eight NA
values. Both would receive a 1 on the
simple_nondifferentiation
indicator of resp_nondifferentiation()
or 1 (meaning
absolute presence of) a Middle Response Style if the response scale had five response options.
But should the respondents be considered to be identical in their extent of
straightlining or Middle Response Style? This is up to the interpretation of the
analyst.
Although an increasing number of missing values per respondent can pose validity
threats to all response quality indicators, some indicators are more affected
by NA
values than others. Indicators calculated with resp_distributions()
are
not affected as much by missing data, because the resp_distributions()
indicators
average over all response values.
For the following response indicators missing values are more of a threat to validity
because they conceptually rely more on the position of responses on the response
scale:
resp_patterns()
:
mean_string_length
- Response strings with identical values can be broken by NA
.longest_string_length
- Response strings with identical values can be broken by NA
.resp_nondifferentiation()
:
simple_nondifferentiation
- Straightlining with identical values can be broken by NA
.resp_styles()
:
NA
.In any case, it helps to report how missing values were handled in an analysis to increase confidence in and replicability of the results.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.