In maxheld83/pensieveR: Tools for the Scientific Study of Operant Subjectivity

pensieve is an open-source extension to the R project for statistical computing, distributed via the Comprehensive R Archive Network (CRAN) as one of thousands of packages.

It is accompanied by accio, a web-based, closed-source front-end, for which pensieve serves as the backend. If you do not want to use hosted or closed-source software, not to worry: accio is meant as a convenient way of accessing pensieve, for users who may not want to install and learn R. Almost all features accessible in accio are also available in pensieve, with the exception (for now) of deploying a web-based Q-Sort.

Though formally a package like many others, pensieve is in some ways less and more than mainstream packages that you may be familiar with and on which it is largely based. For now, at least, pensieve offers little advanced algorithms by itself, but merely wrangles data and interface other, powerful packages. This makes it somewhat less than a "real" package, and more of a wrapping layer. At the same time, pensieve is vastly more specialized than many other packages, tailored to the needs of one domain: the scientific study of human subjectivity. This makes it akin to a Domain Specific Language or a mini-language.

pensieve, is, in other words, a very nich enterprise, much like Q Methodology itself: endowed with great hopes by its creator, and perhaps some yet-unrealised potential.

Software for Open Science and Reproducible Research

Good software for the following Q methodological analyses should fulfill the following criteria:

Reproducibility. It should be easy to document and reproduce all the steps undertaken from raw data to the final factor interpretation. This will be especially important during the factor extraction and (indeterminate!) rotation phases, where consequential and sometimes controversial methodological decisions must be made.
Open Access. A lot of Q research is funded mostly by tax money, all of this scientific work should be conveniently available to the public. Research on deliberation and citizen participation, especially, should be Open Science.
Specialized for Q Methodology. At its heart, Q methodology involves factor analyses and similar data reduction techniques, procedures that are widely used and available in all general-purpose statistics programs. However, Q methodology also requires some specialized operations, not easily accomplished in general-purpose programs. The transposed correlation matrix, flagging participants and compositing weighted factor scores in particular, are hard or counter-intuitive to do in mainstream software.
Programmatic Extensibility. It should be easy to programmatically extend this software.

Do We Need another program?

The following free / open source and closed / proprietary programs are available:

| General-Purpose | Specialized -|----|---- Closed Source | SPSS, STATA | PCQ Open Source | R | R: qmethod package, PQMethod

Table: Overview of Q Methodology Software

The closed-source, general-purpose programs offer some programmatic extensibility, but are poorly suited for Q methodology. These commercial, and often expensive offerings also make it hard to reproduce and open up research. The special-purpose PQMethod was originally written for mainframe in 1992 by John Atkinson at Kent State University in Kent, OH, and has since been ported for PCs and generously maintained by Peter Schmolck at the University of the Armed Forces in Munich, Germany. While nominally free and open-source, PQMethod consists largely of legacy FORTRAN code, which few people today can read or write. PQMethod also offers little programmatic extensibility, because it is a standalone program running within layers of emulators.

The qmethod package [@Zabala-2014], implementing Q methodology in the free and open source R programming language and software environment for statistical computing [@R-2014] is the best fit for the above criteria. R is freely available for all platforms, and supported by a vibrant community of developers. Qmethod, while released only recently, has been thoroughly validated [@Zabala-2014-a], leverages existing packages for extraction and rotation and conveniently wraps specialized Q functions. Running inside the R environment, qmethod can be easily extended, now including several functions developed for this dissertation and contributed back to the open source project. Most importantly, using R, every step from raw data to final factor interpretation can be traced back, using publicly available and vetted code, down to base functions.

Literate Programming

Machine-readable code, its results, and human-readable explanation of what that code does are often written and stored separately. Not only does such separation invite mistakes when code, result and explanation diverge, but it also makes research harder to reproduce and is conceptually flawed. At least in the context of statistical analysis, what we want a machine to do, why we want it, and to what effect are one intellectual operation, and should be presented as such. Programs must not be considered black boxes that "do" things of their own accord, to be explained ex-post, but code should be a near equivalent of the same thought expressed in prose.

Donald Knuth has formulated this central tenet of his Literate Programming approach thus:

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do. --- @Knuth-1984

pensieve supports this approach by offering plot and print methods suitable for inclusion in prose via the knitr R package [@knitr].

The OO System

One of the dirty secrets of quantitative work is that a good 80\% of the time is often spent on "cleaning and preparing" messy data [@dasu_exploratory_2003 as cited in @Wickham_2014_Tidy: 1]. This tedium affects the quantitative stages of studying human subjectivity as much as other analyses, and is best addressed head-on and as early as possible, to avoid downstream complications.

To that end, pensieve prescribes a large number of standardized data formats, in which otherwise messy data can be tidily stored.[^not-tidy] These standardized data formats come in the form of S3 classes, which is the simplest (and oldest) system for object-oriented (OO) programming in R. There is no need to know anything about S3 or OO in general to use and benefit from this system in pensieve: Suffice it to say that S3 is a clever way for data to know what kind of data class they are, and what kind of methods can be applied to them. For example, psLoadings "know" that they are loadings, that they must be between -1 and 1, and that they can be rotate()d.

It is easy to recognize theses classes in pensieve: they are all in CamelCase and start with a ps, as, for instance psItems. They all come with a constructor function, always by the same name (psItems()), which allows you to create such an object from appropriate inputs. These constructor functions do not do very much; they simply validate the data and assign the class.

Each class (such as psLoadings) is also accompanied by a method for the check() generic, which means that you can, well, check() any object to make sure that it conforms to the standardized data format. For example, check()ing a psLoadings object will assert that all values are between -1 and 1, as loadings must be by definition, as well as a number of other criteria. To find out about these criteria for valid data, consult the documentation for each of the classes, by entering, say help("psLoadings"). The arguments section will list all of the criteria that are also tested. If your data is partly invalid, the check()s will give you (hopefully) precise and instructive error messages for you to fix.

These check()s are run on whenever pensieve touches any of its known classes, which ensures that data are always in their proper form. This is a simple way to shoehorn type validation into S3, which ordinarily does not have such a facility.

In addition to the check() functions (which return TRUE or the error message), you can also use assert() (which only returns a message in case of an error), test() (which only returns TRUE or FALSE) or expect() (for internal usage in testing via @wickham_testthat's testthat). This follows the framework set by @lang_checkmate's checkmate, which pensieve extends.

If you only use the functions provided by pensieve, you will probably never need check() or any of its siblings: pensieve will do all necessary checking for you. If, however, you add some custom code of your own, you may want to throw in an occasional assert()ion into your script to ensure that everything is still kosher.

Aside from check()s, there are some other methods defined for common generics, defined for some of the objects, such as customized print()ing or plot()ing functions. You can find out about all these special "abilities" of the classes by reading their help.

A word of warning: you can always assign or remove classes by hand, rather than using the provided constructor functions and their included check()s, but I strongly recommend against this. Because S3 is a relatively simple and ad-hoc system, it is easy to "shoot yourself in the foot" [@wickham_advanced_2014]. Downstream pensieve functions may fail, or worse, produce wrong results if data is not provided in the precise format specified in the included classes.

This design may strike users as involved, or overly restrictive, but there is little need to worry. Most average users of pensieve will never be exposed to the OO system "behind the scenes". More advanced users with custom extensions should also not feel stifled; rather than restricting the extensibility of R, this validation infrastructure merely provides a well-defined interface for extending pensieve. With such a system in place, all users can rest assured that all reasonable precautions are taken to ensure the reliability and reproducibility of their results, as would be expected from any scientific study of human subjectivity.

[^not-tidy]: @Wickham_2014_Tidy has suggested a formal definition of tidy datasets where each row is an observation, each column a variable and each table an observational unit. While much of pensieve follows this philosophy, not all Q objects are best represented as tidy data.frames or tibbles. Where rows and columns can be meaningfully transposed or where linear algebra operations are readily applicable, as is the case for Q-sorts and downstream results, data is instead stored in matrices and higher dimensional arrays. This is in line with recent recommendations for such non-tidy data by [@leek_non-tidy_2016].

The `QStudy` List

In addition to atomic (or vector) data formats, such as psLoadings which cannot be divided up further, pensieve also provides a lot of composite (or list) data formats, which comprise of atomic classes, or other list classes. This is easy to illustrate with psItems, which must always consist of a psItemConcourse with the full text, but may also contain a psItemSample, selecting a few items from the concourse.

Higher-level classes such as psItems also come with their constructor (psItems()) and check() functions, but these expect the lower level, atomic classes as inputs. Instead of checking their validity, they check the consistency between them. For example, we know that psItemSample may only contain items for which there is also an entry in psItemConcourse; it would not make sense to have an item that is part of the sample, but not the concourse.

This structure of nested classes goes all the way up to a complete psStudy object, which includes all data relevant for, and produced in any given study.

The top-level list psStudy looks like this:

psStudy: The entire study object.
- psItems: Additional information on the items, including their full text.
- psDesign: Parameters of a study, such as the shape of the sorting distribution.
- psPeople: Additional information about the participants in the study.
- psSorts: The sorts.
- psCors: Correlations.
- ...

This is a fairly long list, although most users will not use all classes, and leave many entries blank. In fact, only very few of the elements are necessary to use the central analytical functions of the package (typeset in bold).

In addition to the hierarchical structure of the nested classes, you will also find some subclasses (typeset in italics). Subclasses are used, when there is a variant of a parent class, such as when the psConcourse consists of images (psConcourseImage).

The order of this list approximately follows the progress of most Q studies, and is also reflected in the organisation of this book. Each of the high-level list objects, including their elements, functions and substantive considerations is covered in the following chapters.

However, there is no technical reason to follow this order or structure of classes; users can organise their data in whichever form they prefer. Some functions accept higher-level, nested lists, but they always also accept the atomic form.

In addition, not all Q studies will have use for all of these objects, or their elements: the nested data structures offered in pensieve are, for the most part, strictly optional. It is, in fact, possible to run pensieve with only a minimum of these objects, emphasized in the above.