cj_tidy: Tidy a conjoint dataset
In cregg: Simple Conjoint Tidying, Analysis, and Visualization

Description Usage Arguments Details Value See Also Examples

Coerce a “wide” conjoint dataset into a “long”/“tidy” one for use with cregg

1	cj_tidy(data, profile_variables, task_variables, id)

`data`	A data frame containing a conjoint dataset in “wide” format (see Details).
`profile_variables`	A named list of two-element lists capturing profile-specific variables (either features, or profile-specific outcomes, like rating scales). For each element in the list, the first element contains vectors of feature variable names for the first profile in each decision task (hereafter, profile “A”) and the second element contains vectors of feature variable names for the second profile in each decision task (hereafter, profile “B”). Variables can be specified as character strings or an RHS formula. The names at the highest level are used to name variables in the long/tidy output.
`task_variables`	A named list of vectors of variables constituting task-level variables (i.e., variables that differ by task but not across profiles within a task). Variables can be specified as character strings or an RHS formula. These could be outcome variables, response times, etc.
`id`	An RHS formula specifying a variable holding respondent identifiers.

A conjoint survey typically comes to the analyst in a “wide” form, where the number of rows is equal to the number of survey respondents and columns represent choices and features for each choice task and task profile. For example, a design with 1000 respondents and five forced-choice decision tasks, with 6 features each, will have 1000 rows and 5x2x6 feature columns, plus five forced-choice outcome variable columns recording which alternative was selected for each task. To analyse these data, the data frame needs to be reshaped to “long” or “tidy” format, with 1000x5x2 rows, six feature columns, and one outcome column. Multiple outcomes or other task-specific variables would increase the number of columns in the result, as will respondent-varying characteristics which need to be replicated across each decision task and profile.

This a complex operation because variables vary at three levels: respondent, task, and profile. Thus the reshape is not a simple wide-to-long transformation. It instead requires two reshaping steps, one to create a task-level dataset and a further one to create a profile-level dataset. cj_tidy performs this tidying in two steps, through a single function with an easy-to-use API. Users can specify variable names in the wide format using either character vectors of righthand-side (RHS) formulae. They are equivalent but depending on the naming of variables, character vectors can be easier to specify (e.g., using regular expressions for pattern matching).

Particular care is needed to decide whether a particular set of “wide” columns belong in profile_variables or task_variables. This especially applies to outcomes variables. If a variable in the original format records which of the two profiles was chosen (e.g., “left” and “right”), it should go in task_variables. If it records whether a profile was chosen (e.g., for each task there is a “left_chosen” and “right_chosen” variable), then both variables should go in profile_variables as they vary at the profile level. Similarly, one needs to be careful with the output of cj_tidy to ensure that a task-level variable is further recoded to encode which alternative was selected (see examples).

Users may find that it is easier to recode features after using cj_tidy rather than before, as it requires recoding only a number of variables equal to the number of features in the design, rather than recoding all “wide” feature columns before reshaping. Again, however, care should be taken that these variables encode information in the same way so that stacking does not produce a loss of information.

Finally, data should not use the variable names “task”, “pair”, or “profile”, which are the names of metadata columns created by reshaping.

A data frame with rows equal to the number of respondents times the number of tasks times the number of profiles (fixed at 2), to be fed into any other function in the package. The columns will include the names of elements in profile_variables and task_variables, and id, along with an indicator task (from 1 to the number of tasks), pair (an indicator for each task pair from 1 to the number of pairs), profile (a fator indicator for profile, either “A” or “B”), and any other respondent-varying covariates not specified. As such, respondent-varying variables do not need to be specified to cj_tidy at all.

The returned data frame carries an additional S3 class (“cj_df”) with methods that preserve column attributes. See cj_df.

cj, cj_df

## Not run: 
data("wide_conjoint")

# character string interface
## profile_variables
list1 <- list(
 feature1 = list(
     names(wide_conjoint)[grep("^feature1.{1}1", names(wide_conjoint))],
     names(wide_conjoint)[grep("^feature1.{1}2", names(wide_conjoint))]
 ),
 feature2 = list(
     names(wide_conjoint)[grep("^feature2.{1}1", names(wide_conjoint))],
     names(wide_conjoint)[grep("^feature2.{1}2", names(wide_conjoint))]
 ),
 feature3 = list(
     names(wide_conjoint)[grep("^feature3.{1}1", names(wide_conjoint))],
     names(wide_conjoint)[grep("^feature3.{1}2", names(wide_conjoint))]
 ),
 rating = list(
     names(wide_conjoint)[grep("^rating.+1", names(wide_conjoint))],
     names(wide_conjoint)[grep("^rating.+2", names(wide_conjoint))]
 )
)
## task variables
list2 <- list(choice = paste0("choice_", letters[1:4]),
              timing = paste0("timing_", letters[1:4]))

# formula interface
## profile_variables
list1 <- list(
   feature1 = list(
       ~ feature1a1 + feature1b1 + feature1c1 + feature1d1,
       ~ feature1a2 + feature1b2 + feature1c2 + feature1d2
   ),
   feature2 = list(
       ~ feature2a1 + feature2b1 + feature2c1 + feature2d1,
       ~ feature2a2 + feature2b2 + feature2c2 + feature2d2
   ),
   feature3 = list(
       ~ feature3a1 + feature3b1 + feature3c1 + feature3d1,
       ~ feature3a2 + feature3b2 + feature3c2 + feature3d2
   ),
   rating = list(
       ~ rating_a1 + rating_b1 + rating_c1 + rating_d1,
       ~ rating_a2 + rating_b2 + rating_c2 + rating_d2
   )
)
# task variables
list2 <- list(choice = ~ choice_a + choice_b + choice_c + choice_d,
              timing = ~ timing_a + timing_b + timing_c + timing_d)


# perform reshape
str(long <- cj_tidy(wide_conjoint,
                    profile_variables = list1,
                    task_variables = list2,
                    id = ~ respondent))
stopifnot(nrow(long) == nrow(wide_conjoint)*4*2)

# recode outcome so it is coded sensibly
long$chosen <- ifelse((long$profile == "A" & long$choice == 1) | 
                       (long$profile == "B" & long$choice == 2), 1, 0)
# use for analysis
cj(long, chosen ~ feature1 + feature2 + feature3, id = ~ respondent)

## End(Not run)

Classes ‘cj_df’ and 'data.frame':	800 obs. of  12 variables:
 $ respondent: int  1 2 3 4 5 6 7 8 9 10 ...
 $ covariate1: num  0.59 0.731 0.735 0.656 -0.302 ...
 $ covariate2: int  2 1 2 2 1 2 2 1 1 1 ...
 $ task      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ profile   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ feature1  : Factor w/ 4 levels "Feature1_levela",..: 3 1 4 2 1 3 2 4 4 3 ...
 $ feature2  : Factor w/ 6 levels "Feature2_levele",..: 1 1 4 1 3 2 4 5 3 3 ...
 $ feature3  : Factor w/ 4 levels "Feature3_levelk",..: 1 2 2 1 2 3 3 2 3 4 ...
 $ rating    : int  1 4 2 1 7 7 3 2 3 3 ...
 $ choice    : num  2 1 2 2 2 1 1 2 1 2 ...
 $ timing    : num  5.28 4.1 3.81 2.82 2.99 ...
 $ pair      : int  1 2 3 4 5 6 7 8 9 10 ...
   outcome statistic  feature           level     estimate  std.error
1   chosen      amce feature1 Feature1_levela  0.000000000         NA
2   chosen      amce feature1 Feature1_levelb -0.045705703 0.04943489
3   chosen      amce feature1 Feature1_levelc -0.010061419 0.04548105
4   chosen      amce feature1 Feature1_leveld -0.004395694 0.05541028
5   chosen      amce feature2 Feature2_levele  0.000000000         NA
6   chosen      amce feature2 Feature2_levelf  0.059991162 0.06068171
7   chosen      amce feature2 Feature2_levelg  0.027369323 0.05681540
8   chosen      amce feature2 Feature2_levelh  0.017917407 0.05949787
9   chosen      amce feature2 Feature2_leveli  0.041574226 0.05826477
10  chosen      amce feature2 Feature2_levelj -0.024090727 0.06040213
11  chosen      amce feature3 Feature3_levelk  0.000000000         NA
12  chosen      amce feature3 Feature3_levell  0.069166978 0.05433517
13  chosen      amce feature3 Feature3_levelm  0.084225986 0.04726189
14  chosen      amce feature3 Feature3_leveln  0.046708276 0.04729943
             z          p        lower      upper
1           NA         NA           NA         NA
2  -0.92456376 0.35519287 -0.142596298 0.05118489
3  -0.22122223 0.82491940 -0.099202635 0.07907980
4  -0.07932995 0.93677019 -0.112997840 0.10420645
5           NA         NA           NA         NA
6   0.98862018 0.32284901 -0.058942802 0.17892513
7   0.48172368 0.63000225 -0.083986811 0.13872546
8   0.30114367 0.76330494 -0.098696276 0.13453109
9   0.71353968 0.47551187 -0.072622627 0.15577108
10 -0.39883902 0.69001182 -0.142476728 0.09429527
11          NA         NA           NA         NA
12  1.27296876 0.20302913 -0.037328004 0.17566196
13  1.78211206 0.07473096 -0.008405618 0.17685759
14  0.98750193 0.32339664 -0.045996898 0.13941345
Warning message:
In logLik.svyglm(x) : svyglm not fitted by maximum likelihood.