In VectorPosse/datafinder: Find And Explore Currently Installed Data Sets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Thanks for choosing to use datafinder as an exploratory tool for your next project! As with any package, there are many functions and the best way to acquaint yourself with them is through our documentation and this vignette. Here we will discuss the entire datafinder pipeline from project definition to hypothesis testing.

Let's make sure that the package is loaded and attached with:

library(datafinder)
library(mosaic)     #For prop.test later on

Also, we expect that you have a project in mind. For this example, we are going to be looking for data that will give us a 2 proportion test that has a significant p value. This might be useful for statistical educators but if your project is more research based then make sure that the data you're using satisfies all necessary conditions and you're not just cherry-picking data that fits your assumption.

Discovering Loaded Pacakges

The first thing we need to do is see which packages we have installed have data in them. We do this by running the discover_packages() command from the terminal.

tail(discover_packages(), 10)

Running this command on my machine yields a large list of packages with the amount of data sets in each. The command above grabs just the last 10 of them. For this example I'm going to explore the MASS package but for your project you may choose another. Additionally, you may have to explore a few before you find the right data.

Exploring A Single Package

Now let's explore the datasets package. It's really easy to look at the data in the package with the explore_package() function.

explore_package(MASS)[7:13,]

Again, this function gives a large list of data sets so we're just showing some of them. For this project we're going to choose the Melanoma data set.

Visualizing The Data Frame

Now it's time to look at the whole dataframe and get a sense of the data contained within. to do this, we can use the visualize_dataframe() command. This command can sometimes take a while to run but rest assured it's running.

visualize_dataframe(Melanoma)

This plot is giving us a 7 by 7 grid of each variable plotted against each ohter. If you have more than 10 variables in a dataset it might be best to use the overide to plot more variables. The syntax for the override would be visualize_dataset(MASS, override = TRUE). In this case, overriding does nothing but it would for a wider (more variables) dataframe.

Additionally, we can see that some of these variables are actually factor variables, not numerical. As such, let's make them into factors and visualize the data frame again. The status variable is not bianry but we'd prefer it to be for our later t test. Let's combine "alive" and "died from other causes" into a category called other.

cols <- c("status", "sex", "ulcer")
Melanoma[cols] <- lapply(Melanoma[cols], factor)
rm(cols)
levels(Melanoma$status) <- c("died from melanoma", "other", "other")
visualize_dataframe(Melanoma)

Now we get plots that are better for visualizing categorical data. Since our plan is to run a two proportion test I think it would be best if we chose the status and sex variables. The question we'd like to ask is: Is there a difference between the rate at which women and men in Denmark die from malignant melanoma?

Running The Test

Now that we have or variables, we first need to check the conditions of the model we plan to use. For a 2 proportion test the conditions are:

Random
- We have no information about how these samples were obtained. We hope the 126 female patients and 79 male patients are representative of other Danish patients with malignant melanoma.
10%
- We don't know exactly how many people in Denmark suffer from malignant melanoma, but we could imagine over time it's more than 1260 females and 790 males.
Success/Failure
- Checking the contingency table above (the one with counts), we see the numbers 28 and 98 (the successes and failures among females), and 29 and 50 (the successes and failures among males). These are all larger than 10.

Now that the conditions are suitably checked we can see if our data gives us what we set out to find, a suitable, statistically significant 2 proportion test.

status_sex_test <- prop.test(status ~ sex, data = Melanoma)
status_sex_test$statistic
status_sex_test$p.value

Thus we see that we yield a chi-squared statistic of 4.38 yielding a p value of 0.036. Therefore we found exactly what we were looking for, a statistically significant p value from a 2 proportion test.

Conclusion

This vignette shows how a user of the data finder package might start with the seed of an idea and work all the way through to finding a suitable example. In reality, this process is a lot of trial and error but our functions help speed up the process by allowing users to discover packages, explore a package, and visualize a dataframe. We hope that this was a useful resource and we're always open to feedback at our github project page.

VectorPosse/datafinder documentation built on May 9, 2019, 9:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com