Fork and Merge a Dataset
In crunch: Crunch.io Data Tools

One of the main benefits of Crunch is that it lets analysts and clients work with the same datasets. Instead of emailing datasets to clients, you can update the live dataset and ensure that they will see see the most up-to-date information. The potential problem with this setup is that it can become difficult to make provisional changes to the dataset without publishing it to the client. Sometimes an analyst wants to investigate or experiment with a dataset without the risk of sending incorrect or confusing information to the end user This is why we implemented a fork-edit-merge workflow for Crunch datasets.

"Fork" originates in computer version control systems and just means to take a copy of something with the intention of making some changes to the copy, and then incorporating those changes back into the original. A helpful mnemonic is to think of a path which forks off from the main road, but then rejoins it later on. To see how this works lets first upload a new dataset to Crunch.

library(crunch)

library(crunch)
set_crunch_opts("crunch.api" = "https://team.crunch.io/api/")
library(httptest)
run_cleanup <- !dir.exists("fork-and-merge")
ds_to_delete <- c()
httpcache::clearCache()
start_vignette("fork-and-merge")


# The usual redactor does not work, because the end of the UUIDs for variables
# are stable between datasets (the first variable ends with "000000")
# and so we have a clash when using variables from merged dataests
# So this redactor should be the same as the other except for the variable UUID handling
# (For context, we have to redact so we don't get paths that are too long to subtmit to CRAN)
#
# Eventually this should probably be the default, but it only really affects
# this vignette and I don't feel like fixing everything (other vignettes, as well as crplyr?)
set_redactor(function(response) {
    ## Remove multipart form fields because POST sources/ sends a tmpfile path
    ## that's different every time, so the request will never match.
    response$request$fields <- NULL
    response %>%
        redact_auth() %>%
        gsub_response("([0-9a-f]{6})[0-9a-f]{26}", "\\1") %>% ## Prune UUIDs
        httptest::gsub_response("([0-9A-Za-z]{4})[0-9A-Za-z]{18}[0-9]{4}([0-9]{2})", "\\1\\2") %>% # UUIDs in variables now too
        gsub_response(
            "https.//(app|team).crunch.io/api/progress/[^\"].*?/",
            "https://\\1.crunch.io/api/progress/"
        ) %>% ## Progress is meaningless in mocks
        gsub_response("https.//(app|team).crunch.io", "") %>% ## Shorten URL
        gsub_response("https%3A%2F%2F(app|team).crunch.io", "") ## Shorten encoded URL
})
set_requester(function(request) {
    request$fields <- NULL
    request %>%
        gsub_request("https.//(app|team).crunch.io", "") ## Shorten URL
})

my_project <- newProject("examples/rcrunch vignette data")
ds <- newDataset(SO_survey, "stackoverflow_survey", project = my_project)

# Force catalog so that catalog is loaded at a consistent time
ds <- forceVariableCatalog(ds)
ds_to_delete[length(ds_to_delete) + 1] <- self(ds)

Imagine that this dataset is shared with several users, and you want to update it without affecting their usage. You might also want to consult with other analysts or decision makers to make sure that the data is accurate before sharing it with clients. To do this you call forkDataset() to create a copy of the dataset. The fork is placed in a folder following the same logic as new datasets, requiring you to specify a project folder unless you have option R_CRUNCH_DEFAULT_PROJECT set.

forked_ds <- forkDataset(ds, project = "examples/rcrunch vignette data")

# Force catalog so that catalog is loaded at a consistent time
forked_ds <- forceVariableCatalog(forked_ds)
ds_to_delete[length(ds_to_delete) + 1] <- self(forked_ds)

You now have a copied dataset which is identical to the original, and are free to make changes without fear of disrupting the client's experience. You can add or remove variables, delete records, or change the dataset's organization. These changes will be isolated to your fork and won't be visible to the end user until you decide to merge the fork back into the original dataset. This lets you edit the dataset with confidence because your work is isolated.

In this case, let's create a new categorical array variable.

change_state()

forked_ds$ImportantHiringCA <- makeArray(forked_ds[, c("ImportantHiringTechExp", "ImportantHiringPMExp")],
    name = "importantCatArray")

# Force catalog so that catalog is loaded at a consistent time
forked_ds <- forceVariableCatalog(forked_ds)

Our forked dataset has diverged from the original dataset. Which we can see by comparing their names.

all.equal(names(forked_ds), names(ds))

You can work with the forked dataset as long as you like, if you want to see it in the web App or share it with other analysts by you can do so by calling webApp(forked_ds). You might create many forks and discard most of them without merging them into the original dataset.

If you do end up with changes to the forked dataset that you want to include in the original dataset you can do so with the mergeFork() function. This function figures out what changes you made the fork, and then applies those changes to the original dataset.

change_state()

ds <- mergeFork(ds, forked_ds)

After merging the original dataset includes the categorical array variable which we created on the fork.

ds$ImportantHiringCA

It's possible to to make changes to a fork which can't be easily merged into the original dataset. For instance if, while we were working on this fork someone added another variable called ImportantHiringCA to the original dataset the merge might fail because there's no safe way to reconcile the two forks. This is called a "merge conflict" and there are a couple best practices that you can follow to avoid this problem:

1) Make minimal changes to dataset forks. Instead of making lots of changes to a fork, make a couple of small change to the fork, merge it back into the original dataset, and a create a new fork for the next set of changes 2) Have other analysts work on their own forks. It's easier to avoid conflicts if each member of the team makes changes to their own fork and then periodically merges those changes back into the original dataset. This lets you coordinate the order that you want to apply changes to the original dataset, and so avoid some merge conflicts.

Appending data

Another good use of the fork-edit-merge workflow is when you want to append data to an existing dataset. When appending data you usually want to check that the append operation completed successfully before publishing the data to users. This might come up if you are adding a second wave of a survey, or including some additional data which came in after the dataset was initially sent to clients. The first step is to upload the second survey wave as its own dataset.

change_state()

wave2 <- newDataset(SO_survey, "SO_survey_wave2", project = my_project)

We then fork the original dataset and append the new wave onto the forked dataset.

change_state()
ds_to_delete[length(ds_to_delete) + 1] <- self(wave2)

ds_fork <- forkDataset(ds, project = "examples/rcrunch vignette data")
ds_fork <- appendDataset(ds_fork, wave2)

ds_fork now has twice as many rows as ds which we can verify with nrow:

nrow(ds)
nrow(ds_fork)

Once we've confirmed that the append completed successfully we can merge the forked dataset back into the original one.

change_state()
ds_to_delete[length(ds_to_delete) + 1] <- self(ds_fork)

ds <- mergeFork(ds, ds_fork)

ds now has the additional rows.

nrow(ds)

Merging datasets

Merging two datasets together can often be the source of unexpected behavior like misaligning or overwriting variables, and so it's a good candidate for this workflow. Let's create a fake dataset with household size to merge onto the original one.

change_state()

house_table <- data.frame(Respondent = unique(as.vector(ds$Respondent)))
house_table$HouseholdSize <- sample(
    1:5,
    nrow(house_table),
    TRUE
)
house_ds <- newDataset(house_table, "House Size", project = my_project)

There are a few reasons why we might not want to merge this new table onto our user facing data. For instance we might make a mistake in constructing the table, or have some category names which don't quite match up. Merging the data onto a forked dataset again gives us the safety to make changes and verify accuracy without affecting client-facing data.

ds_fork <- forkDataset(ds, project = "examples/rcrunch vignette data")
ds_fork <- merge(ds_fork, house_ds, by = "Respondent")

Before merging the fork back into the original dataset, we can check that everything went well with the join.

change_state()
ds_to_delete[length(ds_to_delete) + 1] <- self(house_ds)
ds_to_delete[length(ds_to_delete) + 1] <- self(ds_fork)

crtabs(~ TabsSpaces + HouseholdSize, ds_fork)

And finally once we're comfortable that everything went as expected we can send the data to the client by merging the fork back to the original dataset.

change_state()

ds <- mergeFork(ds, ds_fork)
ds$HouseholdSize

Conclusion

Forking and merging datasets is a great way to make changes to the data. It allows you to verify your work and get approval before putting the data in front of clients, and gives you the freedom to make changes and mistakes without worrying about disrupting production data.

end_vignette()
if (run_cleanup) {
    for (ds_id in ds_to_delete) with_consent(deleteDataset(ds_id))
    with_consent(delete(my_project))
}

Any scripts or data that you put into this service are public.

crunch documentation built on Aug. 21, 2025, 5:45 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

crunch
Crunch.io Data Tools

Fork and Merge a Dataset
In crunch: Crunch.io Data Tools

Appending data

Merging datasets

Conclusion

Try the crunch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

crunch Crunch.io Data Tools

Fork and Merge a Dataset In crunch: Crunch.io Data Tools

Appending data

Merging datasets

Conclusion

Try the crunch package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

crunch
Crunch.io Data Tools

Fork and Merge a Dataset
In crunch: Crunch.io Data Tools