The zlib
package for R aims to offer an R-based equivalent of Python's built-in zlib
module for data compression and decompression. This package provides a suite of functions for working with zlib compression, including utilities for compressing and decompressing data streams, manipulating compressed files, and working with gzip
, zlib
and deflate formats.
This example demonstrates how to use the zlib
Rcpp module for chunked compression and decompression. We will take a string, write it to a temporary file, and then read it back into a raw vector. Then we will compress and decompress the data using the zlib
Rcpp module.
To install the zlib
package, you can use the following command:
install.packages("zlib") # Uncomment this line if zlib is hosted on CRAN or a similar repo```
First, make sure to load the zlib
package:
library(zlib)
# Create a temporary file
temp_file <- tempfile(fileext = ".txt")
# Generate example data and write to the temp file
example_data <- "This is an example string. It contains more than just 'hello, world!'"
writeBin(charToRaw(example_data), temp_file)
# Read data from the temp file into a raw vector
file_con <- file(temp_file, "rb")
raw_data <- readBin(file_con, "raw", file.info(temp_file)$size)
close(file_con)
# Create a Compressor object
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
# Initialize variables for chunked compression
chunk_size <- 1024
compressed_data <- raw(0)
# Compress the data in chunks
for (i in seq(1, length(raw_data), by = chunk_size)) {
chunk <- raw_data[i:min(i + chunk_size - 1, length(raw_data))]
compressed_chunk <- compressor$compress(chunk)
compressed_data <- c(compressed_data, compressed_chunk)
}
# Flush the compressor buffer
compressed_data <- c(compressed_data, compressor$flush())
# Create a Decompressor object
decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)
# Initialize variable for decompressed data
decompressed_data <- raw(0)
# Decompress the data in chunks
for (i in seq(1, length(compressed_data), by = chunk_size)) {
chunk <- compressed_data[i:min(i + chunk_size - 1, length(compressed_data))]
decompressed_chunk <- decompressor$decompress(chunk)
decompressed_data <- c(decompressed_data, decompressed_chunk)
}
# Flush the decompressor buffer
decompressed_data <- c(decompressed_data, decompressor$flush())
# Convert decompressed raw vector back to string
decompressed_str <- rawToChar(decompressed_data)
# Should print TRUE
print(decompressed_str == example_data)
By following these steps, you can successfully compress and decompress data in chunks using the zlib
Rcpp module.
You might wonder, "Why do we need another zlib package when R already has built-in methods for compression and decompression?" Let me clarify why I decided to develop this package.
R's built-in functions like memDecompress
and memCompress
are good for simple tasks, but they lack robustness and flexibility for advanced use-cases:
Handling Corrupt Data: Functions like memDecompress
are unstable when dealing with gzip bytes that may be corrupt. If the data has multiple header blocks or too small a chunk, the function doesn't just fail—it can even crash your computer.
Incomplete Data Stream: A too-small chunk can hang your system when using
R
compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))
rawToChar(memDecompress(compressed_data[1:300], type="gzip")) # Caused a hang-up
readLines(gzcon(rawConnection(compressed_data[1:300])))
# Warnung in readLines(gzcon(rawConnection(compressed_data[1:300])))
# unvollständige letzte Zeile in 'gzcon(compressed_data[1:300])' gefunden
# [1] "x\x9c\...."
Multiple Header Blocks: R's memDecompress
doesn't handle gzip data with multiple headers well.
R
multi_header_compressed_data_15wbits <- c(memCompress(charToRaw("Hello World"), type="gzip"),memCompress(charToRaw("Hello World"), type="gzip"))
rawToChar(memDecompress(multi_header_compressed_data_15wbits, type="gzip")) # Returns only one compressed header
Whereas, using pigz
or gzip
with pipes returns the expected concatenated strings:
R
tmp_file <- tempfile(fileext=".gzip")
writeBin(multi_header_compressed_data_15wbits, tmp_file)
readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null", tmp_file), open = "rb")) # working correct but with warning
# Warnung in readLines(pipe(sprintf("pigz -c -d %s --verbose 2>/dev/null",
# unvollständige letzte Zeile in 'pigz -c -d /tmp/Rtmp5IRHIZ/file44665a0e2eb2.gzip --verbose 2>/dev/null' gefunden
# [1] "Hello WorldHello World"
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # not working becouse of the wrong wbits (see next point)
# character(0)
GZIP File Format Specification: R's memCompress
doesn't adhere strictly to the GZIP File Format Specification, particularly regarding the usage of window bits.
R
memCompress("Hello World", type="gzip") # Only 15 wbits -> without header checksum
# [1] 78 9c f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 18 0b 04 1d
Official GZIP File Format Specification
Incorrect Behavior with Different wbits
: The behavior of memCompress
is inconsistent when different wbits
are used for compression and decompression.
R
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
multi_header_compressed_data_31wbits <- c(c(compressor$compress(charToRaw("Hello World")), compressor$flush()), c(compressor$compress(charToRaw("Hello World")), compressor$flush()))
readLines(gzcon(rawConnection(multi_header_compressed_data_31wbits))) # returns a single line if the gzip wbits are correct!
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data_31wbit)))
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data_31wbit)' gefunden
# [1] "Hello World"
readLines(gzcon(rawConnection(multi_header_compressed_data_15wbits)))
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))
# Zeile 1 scheint ein nul Zeichen zu enthalten
# Warnung in readLines(gzcon(rawConnection(wrong_compressed_data)))
# unvollständige letzte Zeile in 'gzcon(wrong_compressed_data)' gefunden
# [1] "x\x9c\xf3H\xcd\xc9\xc9W\b\xcf/\xcaI\001"
tmp_file <- tempfile(fileext=".gzip")
writeBin(multi_header_compressed_data_31wbits, tmp_file)
readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file), open = "rb")) # gzip pipe works with correct wbits
# Warnung in readLines(pipe(sprintf("gzip -d %s --verbose --stdout", tmp_file),
# unvollständige letzte Zeile in 'gzip -d /tmp/RtmpPIZPMP/file6eed29844dc4.gzip --verbose --stdout' gefunden
# [1] "Hello WorldHello World"
No Streaming Support: There's no nati ve way to handle Gzip streams from REST APIs or other data streams without creating temporary files or implementing cumbersome workarounds (e.g. with pipes and tmp files).
R
compressed_data <- memCompress(charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = ", ")))
decompressor <- zlib$decompressobj(zlib$MAX_WBITS)
rawToChar(c(decompressor$decompress(compressed_data[1:300]), decompressor$flush())) # Still working
R
compressor <- zlib$compressobj(zlib$Z_DEFAULT_COMPRESSION, zlib$DEFLATED, zlib$MAX_WBITS + 16)
c(compressor$compress(charToRaw("Hello World")), compressor$flush()) # Correct 31 wbits (or custom wbits you provide)
# [1] 1f 8b 08 00 00 00 00 00 00 03 f3 48 cd c9 c9 57 08 cf 2f ca 49 01 00 56 b1 17 4a 0b 00 00 00
```R # Byte-Range Request and decompression in chunks
# Initialize the decompressor decompressor <- zlib$decompressobj(zlib$MAX_WBITS + 16)
# Define the URL and initial byte ranges url <- "https://example.com/api/data.gz" range_start <- 0 range_increment <- 5000 # Adjust based on desired chunk size
# Placeholder for the decompressed content decompressed_content <- character(0)
# Loop to make multiple requests and decompress chunk by chunk for (i in 1:5) { # Adjust the loop count based on the number of chunks you want to retrieve range_end <- range_start + range_increment
# Make a byte-range request
response <- httr::GET(url, httr::add_headers(Range
= paste0("bytes=", range_start, "-", range_end)))
# Check if the request was successful if (httr::http_type(response) != "application/octet-stream" || httr::http_status(response)$category != "Success") { stop("Failed to retrieve data.") }
# Decompress the received chunk compressed_data <- httr::content(response, "raw") decompressed_chunk <- decompressor$decompress(compressed_data) decompressed_content <- c(decompressed_content, rawToChar(decompressed_chunk))
# Update the byte range for the next request range_start <- range_end + 1 }
# Flush the decompressor after all chunks have been processed final_data <- decompressor$flush() decompressed_content <- c(decompressed_content, rawToChar(final_data)) ```
In summary, while R’s built-in methods could someday catch up in functionality, my zlib package for now fills an important gap by providing a more robust and flexible way to handle compression and decompression tasks.
The following benchmark compares the performance of the zlib
package with the built-in memCompress
and memDecompress
functions.
The benchmark was run on a Latitude 7430 with a 12th Gen Intel(R) Core(TM) i5-1245U (12) @ 4.4 GHz Processor and 32 GB of RAM.
library(zlib)
example_data <- charToRaw(paste0(rep("This is an example string. It contains more than just 'hello, world!'", 1000), collapse = "\n"))
microbenchmark::microbenchmark({
compressed_data <- memCompress(example_data, type="gzip")
decompressed_data <- memDecompress(compressed_data, type="gzip")
}, {
compressor <- zlib$compressobj()
compressed_data <- c(compressor$compress(example_data), compressor$flush())
decompressor <- zlib$decompressobj()
decompressed_data <- c(decompressor$decompress(compressed_data), decompressor$flush())
}, times = 5000)
# min lq mean median uq max neval
# 277.041 323.6640 408.4731 363.7165 395.5025 7931.280 5000
# 203.626 255.7815 308.8654 297.2095 337.2320 6864.512 5000
We've identified some exciting opportunities for extending the capabilities of this library. While these features are not currently planned for immediate development, we're open to collaboration or feature requests to bring these ideas to life.
Gztool specializes in indexing, compressing, and data retrieval for GZIP files. With Gztool integration, you could create lightweight indexes for your gzipped files, enabling you to extract data more quickly and randomly. This would eliminate the need to decompress large gzip files entirely just to access specific data at the end of the file.
Pugz offers parallel decompression of gzipped text files. It employs a truly parallel algorithm that works in two passes, significantly accelerating the decompression process. This could be a valuable addition for users dealing with large datasets and seeking more efficient data processing.
If any of these feature enhancements interest you, or if you have other suggestions for improving the library, feel free to reach out for collaboration.
sudo apt-get update
sudo apt-get install cmake ninja-build r-base libblas-dev liblapack-dev build-essential
sudo yum update
sudo yum install cmake ninja-build R libblas-devel liblapack-devel gcc-c++
bash
git clone https://github.com/yourusername/zlib.git
bash
make install
bash
make build
This project is licensed under the MIT License - see the LICENSE file for details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.