Crunch Internals
In crunch: Crunch.io Data Tools

Previous: subtotals

The Crunch web app and the crunch package are both built on a REST API. Users can interact with very large datasets because most of the heavy computation is done on the Crunch servers and not on the users computer. In most cases, you don't need to know how the API works---the R package handles the HTTP requests and responses and presents meaningful objects and methods to you. To go beyond the basics, though, it can be useful to understand how the API works so that you can make more complex or more efficient (read: faster) operations.

Catalogs

When you open a dataset in the Crunch web app, what happens is that the app sends an HTTP request to the Crunch server for information about that dataset. It gets the dataset description and a list of variables contained within that dataset, but the actual data is stored on the server and not sent to the app. The app only loads information about variables that are actually going to be displayed on the screen, which is why very large datasets load so quickly. You might have millions of rows in your dataset, but the web app only asks the server for the summary statistics it needs to display the variable card.

This kind of minimally necessary information are stored in objects called catalogs. There is a VariableCatalog, which has information about a dataset's variables; a DatasetCatalog, which has information about your datasets, and many others. You can think of catalogs just like the Sears Catalog: they are lists of things and descriptions about those things, but they are not the things themselves. You might flip through a catalog looking for the couch, but you need to make a special order if you actually want the couch delivered.

The R package relates to the Crunch API in more or less the same manner as the web app. When you load a Crunch dataset from the server you are not typically loading the actual data, but instead you are loading a catalog representation of that dataset which is stored in a list. This object includes things like the dataset identifier, the variable names and types, and the dataset dimensions, but the actual data stays on the server. When calculations are performed on that data, an HTTP request is sent to the server, the server calculates the answer, and the results are sent back to the R session. This is what allows you to calculate statistics on objects which don't fit in memory: the calculations are done remotely and only the result of that calculation is sent back to you.

Getters and Setters

Many Crunch functions both retrieve information from the server and allow the user to set information. For instance if you want to retrieve a list of datasets associated with a project you could call the [datasets()] function like this:

proj <- projects()[["Project name"]]
datasets(proj)

What happens under the hood when you run this code is that R sends an HTTP request to the server asking for the datasets associated with a particular project. This is the "getter" side of the datasets() function. However, the function also allows the user to change the datasets associated with that function using the assignment operator.

ds <- loadDataset(mydatasets[["Dataset name"]])
datasets(proj) <- ds

Internally there is actually a second method datasets<- that takes the value on the right hand side of the <- operator and posts that value to the datasets attribute of the project catalog. The projects catalog will then update to reflect that a dataset belongs to a particular catalog, and that will be reflected in the web app. Similar patterns happen when you get and set attributes on objects, like "names".

Cube Subsetting

Crunch cubes are generally used to cross tabulate different variables in a crunch dataset. For instance if you cross tabulate two variables with crtabs(~cyl + gear, data = ds), you get a 2 dimensional cube which looks just like a matrix. Cross tabbing three variables leads to a 3d cube, etc. This simple case gets quite a bit more complicated when you add array variables to the cube:

Multiple response

Multiple response variables are themselves 2d arrays with the items or responses on one dimension and the selection status on the second dimension. The result is that a categorical-by-MR cube ends up with three dimensions: the categorical variable categories, the MR items/responses, and the MR selection status.

Categorical array Categorical arrays are also represented cubes as 2d arrays. With the items/responses (the subvariable labels) on one dimension, and their categories on another dimension.

Cube to Array

The main feature of representing these counts as high dimensional arrays is that it makes a number of computational tasks a lot easier, but the problem is that you can quickly end up with cubes which are difficult for humans to conceptualize. For example if you created a cube with one MR variable which had 5 categories, one MR variable with 6 categories, and a categorical variable with 10 categories it would have 5 dimensions with 2700 (5*3 * 6*3 * 10) entries. This is hard to understand and to print, and as a result we try to figure out the parts of the cube which the user is interested in and print that sub-cube. For the purposes of this document we need three terms for cubes:

Real cube: The cube with all of the underlying data User cube: The cube the user thinks they’re interacting with. This is the same as the real cube without MR selection dimensions Printed cube The cube which prints to the console. The same as the the user cube but with missing categories suppressed.

The user cube differs from the real cube in that it doesn’t include MR selection dimensions. We assume that when the user is including an MR in a cube, they really only care about the parts of the cube where the MR is selected. So the 5d real cube listed above would become a 3d user cube with 300 entries (5 * 6 * 10) because we are selecting the slices of the cube where each MR selection dimension is equal to “Selected” and are able to drop the selections dimensions

The printed cube differs from the user cube in that it doesn’t include missing categories. We assume that the user mostly wants to look at categories which are not-missing, and that they don’t care that much about the parts of the cube which represent missing categories. If the categorical variable above had two missing categories: “No Data” and “Not Answered” it would reduce the number of entries to 240 (5 * 6 * 8) . This behavior can be changed by setting the cube@useNA slot to “always”. The function setCubeNa() handles setting this slot on the cube.

Cube subsetting

This structure gets a little tricky when we subset the cube. For instance the cube we’ve been working with has 5 dimensions, but to the user appears to only have 3, so should they subset it as a 5 dimensional array like cube[1, 1, 2:3, 1, 5] or a 3 dimensional array cube[1, 2:3, 5]. Similarly, if a dimension is missing, should that be included in the subset. For example if a cube has three entries A, B, C, along one dimension, and B is hidden, should cube[ , 1:2] return A, A & B, or A &C? Similarly should dim(cube) show the dimensions of the real cube or the printed cube?

We resolve this problem by allowing the user to select whether they want the cube to represent the user cube or the printed cube, and make all of the behavior of those cubes consistent. The cube@useNA controls whether the user wanted to interact with the printed cube or the user cube and they can showMissing() and hideMissing() functions to control which cube they are interacting with

> dim(cube)
[1] 5 6 8
> dim(showMissing(cube))
[1] 5 6 10

This makes subsetting a bit easier because we can enforce the dimensions of the printed cube. For instance

cube[5, 6, 10]

Errors because the user supplied the wrong number of dimensions while

showMissing(cube)[cube[5, 6, 10]

Returns the correct data they were looking for.

Subsetting the printed cube

There are two cases for dropping dimensions from the printed and user cubes which need special handling.

Dropping MR variable

MR dimensions are always dropped together. Consider a cube composed of a multiple response variable and a categorical variable. The real cube has three dimensions, and the printed cube has two, the first of which represents the multiple response variable. If the user asks for cube[1, ] both MR dimensions are dropped and a one dimensional cube is returned.

Subsetting dimensions with missing values

When the user is interacting with the the printed cube, they are relating to a cube without missing categories. As a result when they subset this cube, missing categories should not be carried forward. For instance if a cube has two categorical dimensions with one missing category each, dim(cube) will return c(2, 2) dim(showMissing(cube)), will return c(3, 3). If the user subsets either of these dimensions, for instance with cube[ , 2] the missing categories of the second dimension will be dropped (because the user is explicitly not selecting that category), and that dimension will be dropped. If the user wants to subset a cube to preserve missing categories, they need to subset the user cube with showMissing(cube)[, c(2, 4)]

Next: Category objects