devtools::install_local(".", force = TRUE, upgrade = "never") library(pkgsearchsandbox) library(pkgsearch) library(ggplot2) devtools::install_github("rsquaredacademy/pkginfo") library(pkginfo)
In our discussions, we though that more popular packages likely have more authors and also more tests. It would be interesting to see if this holds in the R universe.
The package {pkgsearch} allows you to obtain metadata about packages on CRAN. We can use this to look at how number of authors and popularity relates.
To test this out, we'll first start by querying for 100 packages that relate to ecology:
df <- pkgsearch::advanced_search("ecology", size = 200) df
Number of Authors can be counted from Author tags within the metadata.
df <- pkgsearchsandbox::count_authors(df)
For popularity, there are many ways that this could be indexed. The simplest way given the {pkgsearch} results is looking at last months downloads.
Now, we will plot number of downloads last month vs number of authors, with a smoother.
ggplot(df, aes(x = author_count, y = downloads_last_month))+ geom_point()+ geom_smooth()+ theme_classic()
It would be great to learn things about how many tests are done for the packages. Maybe there is some information in the metadata. If not, could maybe look for this info by looking on github; either in the CRAN mirror, or maybe more appropriately the github repo. This would allows us to look at if testthat is being used; How many tests there are; If codecov is being used and what the code cov is; If any github actions are being used. Perhaps this would give us some idea for ecology packages.
Expect more tests, more automation, and higher coverage for packages that have more downloads.
Here, we use get_code_coverage() from the pkginfo package to get code coverage values for individual packages. We are able to get code coverage values for 32 packages. However, the packages that do not have a code coverage value may have tests in the repositories. Now, we will plot code coverage as a function of number of authors, number of downloads last month, and date, respectively.
df$code_cov <- NA for (i in 1:nrow(df)) { tryCatch( { df$code_cov[i] <- as.numeric(pkginfo::get_code_coverage(df$package[i])) }, error = function(e) {} ) } subdf <- subset(df, !is.na(code_cov)) ggplot(subdf, aes(x = author_count, y = code_cov)) + geom_point() + geom_smooth(method = "lm") + theme_classic() ggplot(subdf, aes(x = downloads_last_month, y = code_cov)) + geom_point() + geom_smooth(method = "lm") + theme_classic() ggplot(subdf, aes(x = date, y = code_cov)) + geom_point() + geom_smooth(method = "lm") + theme_classic()
Histogram of code coverage:
ggplot(subdf, aes(x = code_cov)) + geom_histogram() + theme_classic()
It looks like code coverage is all over the place, at least for this subset of packages.
Other topics: statistics (maybe too broad), marine, population dynamics (too narrow?)?
We could also consider trying to take a random subset of R packages from CRAN (not sure how to do this with pkgsearch, though) and looking at stats to just get a general idea of what the trend is with R packages more generally. There may be a way to do this through the metacran. Basically, just need a list of all cran pkg names, take a random subset of a certain size or percentage, query to find count of authors and number of downloads.
The function tools::CRAN_package_db()
may be helpful for selecting a random sample without needing to specify a search tearm. The authors are listed, but may need to come up with a creative way to look at things like popularity. In addition, may need to filter by github URL first (maybe either using the BugReports link in or URL).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.