library(knitr)
opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(scales)
library(ggplot2)
library(data.table)
library(hutils)
library(hutilscpp)
library(fst)
library(magrittr)
library(PSMA)
psma_env <- getOption("PSMA_env", new.env())
dt <- "ADDRESS_DETAIL_ID__by__LATLON"
if (exists(dt, envir = psma_env)) {
  x <- get(dt, envir = psma_env, inherits = FALSE)
} else {
  x <- fst::read_fst(system.file("extdata", "address2.fst",
                                 package = "PSMA",
                                 mustWork = TRUE),
                     as.data.table = TRUE)
  x[, "LATITUDE" := lat_int + lat_rem / 10^7]
  x[, "LONGITUDE" := lon_int + lon_rem / 10^7]
  assign(dt,
         value = x,
         envir = psma_env)
}
ADDRESS_DETAIL_ID__by__LATLON <- x

Naively reverse geocoding is a bad idea

Consider the problem of geocoding the City of Melbourne's parking spaces.

latlon_by_bayid <- fread(system.file("extdata", "MEL-latlon_by_bayid.tsv", package = "PSMA"))

There are r comma(nrow(ADDRESS_DETAIL_ID__by__LATLON)) addresses in the PSMA data base. With r nrow(latlon_by_bayid) parking spaces to geocode, a naive cross join requires r round(nrow(ADDRESS_DETAIL_ID__by__LATLON) * as.double(nrow(latlon_by_bayid)) / 1e9, 1) billion distance calculations.

The package hutils provides a lightweight haversine_distance function that only uses primitive functions. It's essentially as fast as it can be. Yet, if we extrapolate from the cost of a naive way to reverse geocode the first parking bay,

bench_naive_distance_cj <- 
   bench::system_time({
     haversine_distance(latlon_by_bayid$lat[1], 
                        latlon_by_bayid$lon[1], 
                        ADDRESS_DETAIL_ID__by__LATLON$LATITUDE,
                        ADDRESS_DETAIL_ID__by__LATLON$LONGITUDE) %>%
       which.min
   })
bench_naive_distance_cj * nrow(latlon_by_bayid)

we see that this is not very satisfactory.

Trim and trim

We can do better with little work.

A priori, we know that these parking spots are in Melbourne. In particular, we know that all the possible addresses are going to be in a small, compact area. So let's restrict the search to addresses within 6 minutes of a degree around the parking locations.

bench::system_time({
  min_lat <- latlon_by_bayid[, min(lat)] - 0.1
  max_lat <- latlon_by_bayid[, max(lat)] + 0.1
  min_lon <- latlon_by_bayid[, min(lon)] - 0.1
  max_lon <- latlon_by_bayid[, max(lon)] + 0.1

  addresses_near_MEL <- 
    ADDRESS_DETAIL_ID__by__LATLON %>%
    .[LATITUDE %between% c(min_lat, max_lat)] %>%
    .[LONGITUDE %between% c(min_lon, max_lon)]
})
bench_nearby_distance_cj <- 
   bench::system_time({
     haversine_distance(latlon_by_bayid$lat[1], 
                        latlon_by_bayid$lon[1], 
                        addresses_near_MEL$LATITUDE,
                        addresses_near_MEL$LONGITUDE) %>%
       which.min
   })
bench_nearby_distance_cj * nrow(latlon_by_bayid)

There is another potential for performance improvement that may be peculiar to the PSMA data: Many of the addresses have identical latitude and longitude:

addresses_near_MEL %>%
  .[, lapply(.SD, round, 5)] %>%
  .[, lapply(.SD, uniqueN), .SDcols = -1] %>%
  .[, lapply(.SD, "/", nrow(addresses_near_MEL))] %>%
  .[, lapply(.SD, percent)] %>%
  kable

Given a point in a compact set of addresses, what point in the rectangle has the largest supremum distance?

find the emptiest portion of the rectangle; the smallest distance to this point is the R to use in match_min_Haversine

Idea: calculate midpoint of all pairs of points, and the radius associated with each midpoint

for each midpoint, find the largest radius that contains none of the original points (nope think of a square with 3 points)

x <- sort(runif(10))
y <- sort(runif(10))
N <- 200
DT <- data.table(x = runif(N, -1, 1) + cumsum(rt(N, 2) / 5) + cumsum(rnorm(N)),
                 y = runif(N, -1, 1) + cumsum(rt(N, 2) / 5) + cumsum(rnorm(N)))
setkey(DT, x)
identify_ball <- function(DT) {
  res <- DT[, hutilscpp:::inacessibleBall(x, y, min(y), max(y))]
  res <- as.data.table(res)

  ggplot(NULL) +
    geom_point(data = DT, mapping = aes(x, y)) + 
    geom_point(data = res,
               aes(x = x_centre, y = y_centre),
               color = "red") +
    geom_rect(data = res,
              aes(xmin = xmin,
                  xmax = xmax,
                  ymin = ymin,
                  ymax = ymax),
              fill = NA,
              color = "red")
}
identify_ball(DT)

Radix sorting on an L1 metric?

R* tree?

Construct partition:

author_rstar_pages <- function(DT, lat, lon, shallow = FALSE, verbose = FALSE) {
  LATITUDE <- as.character(substitute(lat))
  LONGITUDE <- as.character(substitute(lon))

  .DT <- hutils::selector(DT,
                          cols = c(LATITUDE,
                                   LONGITUDE),
                          shallow = shallow)

  as.character(Sys.time())
  DT0 <- unique(DT0, by = c("LATITUDE", "LONGITUDE"))
  npoints <- DT0[, .N]
  xrange <- DT0[, range_rcpp(LONGITUDE)]
  yrange <- DT0[, range_rcpp(LATITUDE)]
  for (i in 1:31) {  # 2^31 maximum integer
    if (npoints < 2^i) {
      break
    }
    hutilscpp:::cut_DT(DT0,
                       depth = i,
                       x_range = xrange,
                       y_range = yrange)
  }
  as.character(Sys.time())

  # About a minute

  Ns <- integer(length(DT0))
  DT1 <- copy(DT0)
  L1_20 <- lapply(1:20, function(x) data.table())
  for (P in 1:20) {
    cat(P, "\n")
    cat(as.character(Sys.time()), "\n")
    DTp <- DT1[, N := .N, keyby = c(paste0("xbreaks", P),
                                    paste0("ybreaks", P))]
    if (DTp[, min(N)] < 4096L) {
      out <- DTp[N < 4096L][, .(minLATITUDE = min(LATITUDE),
                                maxLATITUDE = max(LATITUDE),
                                minLONGITUDE = min(LONGITUDE),
                                maxLONGITUDE = max(LONGITUDE),
                                theP = P,
                                xbreaks13_min = min(xbreaks13),
                                xbreaks13_max = max(xbreaks13),
                                ybreaks13_min = min(ybreaks13),
                                ybreaks13_max = max(ybreaks13)),
                            keyby = c(paste0("xbreaks", P),
                                      paste0("ybreaks", P))]
      L1_20[[P]] <- out
      DT1 <- DT1[!out, on = c(key(DTp))]
      cat(as.character(Sys.time()), "\t", nrow(DT1), "\n")
    } else {
      data.table()
    }
    if (nrow(DT1) == 0L) {
      break
    }

    cat(as.character(Sys.time()), "\n\n")
  }
}

Parallelize

GPU?

Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The html_vignette output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The html_vignette format:

Vignette Info

Note the various macros within the vignette section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title field and the \VignetteIndexEntry to match the title of your vignette.

Styles

The html_vignette template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:

output: 
  rmarkdown::html_vignette:
    css: mystyles.css

Figures

The figure sizes have been customised so that you can easily put two images side-by-side.

plot(1:10)
plot(10:1)

You can enable figure captions by fig_caption: yes in YAML:

output:
  rmarkdown::html_vignette:
    fig_caption: yes

Then you can use the chunk option fig.cap = "Your figure caption." in knitr.

More Examples

You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable().

knitr::kable(head(mtcars, 10))

Also a quote using >:

"He who gives up [code] safety for [code] speed deserves neither." (via)



HughParsonage/PSMA documentation built on May 21, 2022, 10:16 p.m.