‘packageRank’ is an R
package that helps put package download counts into context. It does so
via two functions, `cranDownloads()`

and `packageRank()`

.
`cranDownloads()`

extends
`cranlogs::cran_downloads()`

by
adding a `plot()`

method and a more user-friendly interface.
`packageRank()`

uses rank percentiles, a nonparametric statistic that
tells you the percentage of packages with fewer downloads, to help you
see how your package is doing compared to all other
CRAN packages.

NOTE: ‘packageRank’ requires an active internet connection, and relies on the ‘cranlogs’ package and on RStudio’s logs. The latter record traffic to what was previously called RStudio’s CRAN mirror and which is now called the “0-Cloud” mirror “sponsored by RStudio”. The logs for the previous day are generally posted the following day at 18:00 (GMT+1) or 17:00 UTC (GMT+2) (daylight saving time). Results for functions that rely on ‘cranlogs’ are generally available soon thereafter.

To install ‘packageRank’ from CRAN:

```
install.packages("packageRank")
```

To install the development version from GitHub:

```
# You may need to first install 'remotes' via install.packages("remotes").
remotes::install_github("lindbrook/packageRank", build_vignettes = TRUE)
```

`cranDownloads()`

uses all the same arguments as
`cranlogs::cran_downloads()`

:

```
cranlogs::cran_downloads(packages = "HistData")
```

```
> date count package
> 1 2020-05-01 338 HistData
```

```
cranDownloads(packages = "HistData")
```

```
> date count cumulative package
> 1 2020-05-01 338 338 HistData
```

The only difference is that `cranDownloads()`

adds four features:

```
cranDownloads(packages = "GGplot2")
```

```
## Error in cranDownloads(packages = "GGplot2") :
## GGplot2: misspelled or not on CRAN.
```

```
cranDownloads(packages = "ggplot2")
```

```
> date count cumulative package
> 1 2020-05-01 56357 56357 ggplot2
```

This also works for inactive or “retired” packages in the Archive:

```
cranDownloads(packages = "vr")
```

```
## Error in cranDownloads(packages = "vr") :
## vr: misspelled or not on CRAN/Archive.
```

```
cranDownloads(packages = "VR")
```

```
> date count cumulative package
> 1 2020-05-01 11 11 VR
```

With `cranlogs::cran_downloads()`

, you specify a time frame using the
`from`

and `to`

arguments. The downside of this is that you *must* use
the “yyyy-mm-dd” date format. For convenience’s sake and to reduce
typing, `cranDownloads()`

also allows you to use “yyyy-mm” or “yyyy”
(yyyy also works).

Let’s say you want the download counts for
‘HistData’ for the
month of February 2020. With `cranlogs::cran_downloads()`

, you’d have to
type out the whole date and remember that 2020 was a leap year:

```
cranlogs::cran_downloads(packages = "HistData", from = "2020-02-01",
to = "2020-02-29")
```

With `cranDownloads()`

, you can just specify the year and month:

```
cranDownloads(packages = "HistData", from = "2020-02", to = "2020-02")
```

Let’s say you want the year-to-date download counts for
‘rstan’. With
`cranlogs::cran_downloads()`

, you’d type something like:

```
cranlogs::cran_downloads(packages = "rstan", from = "2020-01-01",
to = Sys.Date() - 1)
```

With `cranDownloads()`

, you can just type:

```
cranDownloads(packages = "rstan", from = "2020")
```

`cranDownloads()`

checks for valid dates:

```
cranDownloads(packages = "HistData", from = "2019-01-15",
to = "2019-01-35")
```

```
## Error in resolveDate(to, type = "to") : Not a valid date.
```

```
cranDownloads(packages = "HistData", when = "last-week")
```

```
> date count cumulative package
> 1 2020-05-01 338 338 HistData
> 2 2020-05-02 259 597 HistData
> 3 2020-05-03 321 918 HistData
> 4 2020-05-04 344 1262 HistData
> 5 2020-05-05 324 1586 HistData
> 6 2020-05-06 356 1942 HistData
> 7 2020-05-07 324 2266 HistData
```

`cranDownloads()`

makes visualizing package downloads easy. Just use
`plot()`

:

```
plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"))
```

If you pass a vector of package names for a single day, `plot()`

will
return a dotchart:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020-03-01", to = "2020-03-01"))
```

If you pass a vector of package names for multiple days, `plot()`

defaults to using `ggplot2`

facets:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"))
```

If you want to plot those data in a single frame, set ```
multi.plot =
TRUE
```

:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), multi.plot = TRUE)
```

If you want separate plots, use `graphics = "base"`

and you’ll be
prompted for each plot:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), graphics = "base")
```

If you want those plots independently, set `same.xy = FALSE`

:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), graphics = "base", same.xy = FALSE)
```

`packages = NULL`

`cranlogs::cran_download(packages = NULL)`

computes the total number of
package downloads from CRAN. You can plot these data by using:

```
plot(cranDownloads(from = 2019, to = 2019))
```

`packages = "R"`

`cranlogs::cran_download(packages = "R")`

computes the total number of
downloads of the R application (note that you can only use “R” or a
vector of packages names, not both!). You can plot these data by using:

```
plot(cranDownloads(packages = "R", from = 2019, to = 2019))
```

To add a lowess smoother to your plot, use `smooth = TRUE`

:

```
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
smooth = TRUE)
```

With graphs that use ‘ggplot2’, `se = TRUE`

will add confidence
intervals:

```
plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
from = "2020", to = "2020-03-20"), smooth = TRUE, se = TRUE)
```

To annotate a graph with a package’s release dates:

```
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
package.version = TRUE)
```

To annotate a graph with R release dates:

```
plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
r.version = TRUE)
```

To plot growth curves, set `statistic = "cumulative"`

:

```
plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
from = "2020", to = "2020-03-20"), statistic = "cumulative",
multi.plot = TRUE, points = FALSE)
```

To visualize a package’s downloads relative to “all” other packages over time:

```
plot(cranDownloads(packages = "HistData", from = "2020", to = "2020-03-20"),
population.plot = TRUE)
```

This longitudinal view of package downloads plots the date (x-axis) against the logarithm of a package’s downloads (y-axis). In the background, the same variable are plotted (in gray) using a stratified random sample of packages: within each 5% interval of rank percentiles (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked. This graphically approximates the “typical” pattern of downloads on CRAN for the selected time period.

Looking at nominal download count data leads one to the “compared to what?” question. For instance, consider the data for the first week of March 2020:

```
plot(cranDownloads(packages = "cholera", from = "2020-03-01",
to = "2020-03-07"))
```

Do Wednesday and Saturday reflect surges of interest in the package or surges of traffic to CRAN? To put it differently, how can we know if a given download count is typical or unusual?

One way to answer these questions is to locate your package in the frequency distribution of download counts. Below are the distributions for Wednesday and Saturday with the location of ‘cholera’ highlighted:

As you can see, the frequency distribution of package downloads
typically has a heavily skewed, exponential shape. On the Wednesday, the
most “popular” package had 177,745 downloads while the least “popular”
package(s) had just one. This is why the left side of the distribution,
where packages with fewer downloads are located, *looks* like a vertical
line.

To see what’s going on, I take the log of download counts (x-axis) and redraw the graph. In these plots, the location of a vertical segment along the x-axis represents a download count and its height represents a download count’s frequency:

```
plot(packageDistribution(package = "cholera", date = "2020-03-04"))
```

```
plot(packageDistribution(package = "cholera", date = "2020-03-07"))
```

While these plots give us a better picture of where ‘cholera’ is located, comparisons between Wednesday and Saturday are impressionistic at best: all we can confidently say is that the download counts for both days were greater than the mode.

To facilitate interpretation and comparison, I use the *rank percentile*
of a download count in place of the nominal download count. This
nonparametric statistic tells you the percentage of packages with fewer
downloads. In other words, it gives you the location of your package
relative to the locations of all other packages. More importantly, by
rescaling download counts to lie on the bounded interval between 0 and
100, rank percentiles make it easier to compare packages within and
across distributions.

For example, we can compare Wednesday (“2020-03-04”) to Saturday (“2020-03-07”):

```
packageRank(package = "cholera", date = "2020-03-04", size.filter = FALSE)
> date packages downloads rank percentile
> 1 2020-03-04 cholera 38 5,556 of 18,038 67.9
```

On Wednesday, we can see that ‘cholera’ had 38 downloads, came in 5,556th place out of 18,038 unique packages downloaded, and earned a spot in the 68th percentile.

```
packageRank(package = "cholera", date = "2020-03-07", size.filter = FALSE)
> date packages downloads rank percentile
> 1 2020-03-07 cholera 29 3,061 of 15,950 80
```

On Saturday, we can see that ‘cholera’ had 29 downloads, came in 3,061st place out of 15,950 unique packages downloaded, earned a spot in the 80th percentile.

So contrary to what the nominal counts tell us, one could say that the interest in ‘cholera’ was actually greater on Saturday than on Wednesday.

To compute rank percentiles, I do the following. For each package, I tabulate the number of downloads and then compute the percentage of packages with fewer downloads. Here are the details using ‘cholera’ from Wednesday as an example:

```
pkg.rank <- packageRank(packages = "cholera", date = "2020-03-04",
size.filter = FALSE)
downloads <- pkg.rank$crosstab
round(100 * mean(downloads < downloads["cholera"]), 1)
> [1] 67.9
```

To put it differently:

```
(pkgs.with.fewer.downloads <- sum(downloads < downloads["cholera"]))
> [1] 12250
(tot.pkgs <- length(downloads))
> [1] 18038
round(100 * pkgs.with.fewer.downloads / tot.pkgs, 1)
> [1] 67.9
```

In the example above, 38 downloads puts ‘cholera’ in 5,556th place among the 18,038 packages downloaded. This rank is “nominal” because it’s possible that multiple packages can have the same number of downloads. As a result, a package’s nominal rank (but not its rank percentile) can be affected by its name: packages with the same number of downloads are sorted in alphabetical order. Thus, ‘cholera’ benefits from the fact that it is 31st in the list of 263 packages with 38 downloads:

```
pkg.rank <- packageRank(packages = "cholera", date = "2020-03-04",
size.filter = FALSE)
downloads <- pkg.rank$crosstab
which(names(downloads[downloads == 38]) == "cholera")
> [1] 31
length(downloads[downloads == 38])
> [1] 263
```

To visualize `packageRank()`

, use `plot()`

.

```
plot(packageRank(packages = "cholera", date = "2020-03-04"))
```

```
plot(packageRank(packages = "cholera", date = "2020-03-07"))
```

These graphs, customized to be on the same scale, plot the *rank order*
of packages’ download counts (x-axis) against the logarithm of those
counts (y-axis). It then highlights a package’s position in the
distribution along with its rank percentile and download count (in red).
In the background, the 75th, 50th and 25th percentiles are plotted as
dotted vertical lines. The package with the most downloads,
‘magrittr’ in both
cases, is at top left (in blue). The total number of downloads is at the
top right (in blue).

`packageDistribution()`

, `packageRank()`

and `packageLog()`

have a
‘size.filter’ argument that removes downloads smaller than 1000 bytes.
This can provide a more accurate count of package downloads. For
example, here is a raw download count:

```
packageRank(packages = "HistData", date = "2019-10-30", size.filter = FALSE)
> date packages downloads rank percentile
> 1 2019-10-30 HistData 403 794 of 17,396 95.4
```

Below is a filtered count.

```
packageRank(packages = "HistData", date = "2019-10-30", size.filter = TRUE)
> date packages downloads rank percentile
> 1 2019-10-30 HistData 382 796 of 15,330 94.8
```

Besides a difference of 21 downloads, notice that the number of unique packages downloaded falls from 17,396 to 15,330.

By default, `size.filter = TRUE`

for `packageRank()`

while ```
size.filter
= FALSE
```

for `packageDistribution()`

and `packageLog()`

. For details
about “small” downloads see the “Inflationary Bias of Download Counts”
section of this
post
on the R-hub blog.

To avoid the bottleneck of downloading multiple log files,
`packageRank()`

is currently limited to individual days. However, to
reduce the need to re-download logs, ‘packageRank’ makes use of
memoization via the ‘memoise’ package.

Here’s relevant code:

```
fetchLog <- function(url) data.table::fread(url)
mfetchLog <- memoise::memoise(fetchLog)
if (RCurl::url.exists(url)) {
cran_log <- mfetchLog(url)
}
# Note that data.table::fread() relies on R.utils::decompressFile().
```

If you use `fetchLog()`

, the log file, which can be upwards of 50 MB,
will be downloaded each time you call the function. But if you use
`mfetchLog()`

, the logs are intelligently cached; those that have
already been downloaded, in your current R session, will not be
downloaded again.

lindbrook/packageRank documentation built on Sept. 22, 2020, 3:39 p.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.