In k-doering-NOAA/pkgsearchsandbox: What the Package Does (One Line, Title Case)

devtools::install_local(".", force = TRUE, upgrade = "never")
library(pkgsearchsandbox)
library(pkgsearch)
library(ggplot2)
devtools::install_github("rsquaredacademy/pkginfo")
library(pkginfo)

What do we think is going on with popularity, number of authors, and number of tests?

In our discussions, we though that more popular packages likely have more authors and also more tests. It would be interesting to see if this holds in the R universe.

Number of Authors and popularity

The package {pkgsearch} allows you to obtain metadata about packages on CRAN. We can use this to look at how number of authors and popularity relates.

To test this out, we'll first start by querying for 100 packages that relate to ecology:

df <- pkgsearch::advanced_search("ecology", size = 200)
df

Number of Authors can be counted from Author tags within the metadata.

df <- pkgsearchsandbox::count_authors(df)

For popularity, there are many ways that this could be indexed. The simplest way given the {pkgsearch} results is looking at last months downloads.

Now, we will plot number of downloads last month vs number of authors, with a smoother.

ggplot(df, aes(x = author_count, y = downloads_last_month))+
  geom_point()+
  geom_smooth()+
  theme_classic()

TODO: learn more about tests

It would be great to learn things about how many tests are done for the packages. Maybe there is some information in the metadata. If not, could maybe look for this info by looking on github; either in the CRAN mirror, or maybe more appropriately the github repo. This would allows us to look at if testthat is being used; How many tests there are; If codecov is being used and what the code cov is; If any github actions are being used. Perhaps this would give us some idea for ecology packages.

Hypotheses

Expect more tests, more automation, and higher coverage for packages that have more downloads.

Code coverage

Here, we use get_code_coverage() from the pkginfo package to get code coverage values for individual packages. We are able to get code coverage values for 32 packages. However, the packages that do not have a code coverage value may have tests in the repositories. Now, we will plot code coverage as a function of number of authors, number of downloads last month, and date, respectively.

df$code_cov <- NA
for (i in 1:nrow(df)) {
  tryCatch(
    {
      df$code_cov[i] <- as.numeric(pkginfo::get_code_coverage(df$package[i]))
    },
    error = function(e) {}
  )
}

subdf <- subset(df, !is.na(code_cov))
ggplot(subdf, aes(x = author_count, y = code_cov)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()

ggplot(subdf, aes(x = downloads_last_month, y = code_cov)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()

ggplot(subdf, aes(x = date, y = code_cov)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()

Histogram of code coverage:

ggplot(subdf, aes(x = code_cov)) +
  geom_histogram() +
  theme_classic()

It looks like code coverage is all over the place, at least for this subset of packages.

TODO: Is there a better search term besides ecology?

Other topics: statistics (maybe too broad), marine, population dynamics (too narrow?)?

We could also consider trying to take a random subset of R packages from CRAN (not sure how to do this with pkgsearch, though) and looking at stats to just get a general idea of what the trend is with R packages more generally. There may be a way to do this through the metacran. Basically, just need a list of all cran pkg names, take a random subset of a certain size or percentage, query to find count of authors and number of downloads.

The function tools::CRAN_package_db() may be helpful for selecting a random sample without needing to specify a search tearm. The authors are listed, but may need to come up with a creative way to look at things like popularity. In addition, may need to filter by github URL first (maybe either using the BugReports link in or URL).

k-doering-NOAA/pkgsearchsandbox documentation built on May 28, 2022, 8:27 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com