knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(canny.tudor)
In R, you cannot easily or conveniently unzip a zipped file from a raw object and access their raw zipped members. Yet this is a useful requirement. Zip files comprise member pairs where each member has a name and a sequence of bytes. The name itself is just a string defining the file name and any subdirectory paths using forward-slash delimiters. Note, the zip files does not directly model a sub-directory tree.
Take any zip file, a.zip
for example. It has one text file called a.txt
. The
following statement creates the zip file.
zip("a.zip", "a.txt")
You can read it entirely using xfun::read_bin
, short for:
readBin(x, what = "raw", n = file.info(x)$size)
where x
is the name of the file. The result is an object of raw bytes. We want
to transform this single raw sequence of bytes to a list of raw bytes with one
list item for each member.
Unzipping disregards the member modification times.
You can derive a connection from raw bytes using base::rawConnection
then
you can access a zipped member connection using base:unz
. This only works if
you can access the member file names. The excerpt below requires that we know
that the zip file has a member called a.txt
.
xfun::read_bin("a.zip") |> rawConnection() |> unz(filename = "a.txt")
Trouble is, there is no obvious way to determine the members, except from the original file. The following lists the zip file contents by name, length and date.
unzip("a.zip", list = TRUE)
Naturally, this implies that you have to save raw bytes before you can list the zip.
If we accept the unzip from file limitation, then we only need to first write the raw bytes to a temporary file then apply the unzip-list to read back the resulting file information. This is not ideal but beats unzipping all the members from raw to disk. An actual unzip maps the member names to read file names which might not always prove reliable. File systems treat names case-sensitively or not, for example, typically for macOS and Windows. This would create a problem if two members share the same case-insensitive path name.
Hence we define a simple function that lists members from a raw zip. It creates a temporary file, unlinking it before exit. At some future point, our implementation will replace the temporary file with some other, more in-memory clean, method of listing zipper file information.
unzip_list <- function(object) { zipfile <- tempfile() writeBin(object, zipfile, useBytes = TRUE) list <- unzip(zipfile, list = TRUE) unlink(zipfile) list }
Should the unzip_list
implementation apply useBytes = TRUE
? The
documentation explains that this option avoids re-coding.
This only leaves us with another problem. The base:unz
function creates a text
connection from which you cannot read binary. The following fails.
readBin(unz("a.zip", "a.txt"), what = "raw")
Not only that, neither can you size the connection in order to determine how many raw bytes to read. This approach leads the research at present to a dead end.
Since we have to unzip to disk, why not write the raw bytes then unzip and read members one by one?
unzip_raw <- function(object) { zipfile <- tempfile() writeBin(object, zipfile, useBytes = TRUE) list <- unzip(zipfile, list = TRUE) bins <- lapply(list$Name, function(name) { junk <- unzip(zipfile, name, junkpaths = TRUE, exdir = tempdir()) bin <- xfun::read_bin(junk) unlink(junk) bin }) unlink(zipfile) names(bins) <- list$Name bins }
Successful. It throws away the date. Our implementation makes some simplifying assumptions:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.