tarExtract: Extract the contents of entries in a gzipped tar file

Description Usage Arguments Value Note Author(s) References See Also Examples

View source: R/Rtar.R

Description

The initial version of this function provides a mechanism to extract entries in a gzipped tar file directly into R. By default, this returns the contents of each specified entry as a raw vector. However, the caller can specify a function that will process each entry when its entire contents are available such as to convert the RAW vector to a character, or even to read data from the files. This allows one to then discard the results.

The function now supports reading from RAW data rather than a file. For example, one can read the contents of a bzip2 or gz archive obtained from a file or from a stream such as via an HTTP query via RCurl. Then one can extract the contents of the “files” from the memory representation of the archive and there is no need to deal with the file system. This avoids cleanup and makes “security” issues simpler.

Usage

1
2
3
4
tarExtract(filename, entries = character(),
            op = collectContents(entries),
             convert = NULL, data = NULL,
              workBuf = raw(10000), ...)

Arguments

filename

the name of the gzipped tar file or alternatively a raw vector containing the uncompressed archive contents, e.g. when read from a gz or bzip2 stream.

entries

a character vector giving the precise names of the files to extract (see tarInfo to find the names). If this is empty (the default), all entries are extracted and returned.

In the future, also a function that takes a single entry name and returns TRUE or FALSE indicating whether to extract the contents of the specified file. This dynamic matching is not yet implementd and is not necessary as the names of the desired files can be determined via a two-pass procedure of getting the table of contents for the archive and then applying the function. In different cases, there may be different performance gains. If we use a matching function, there is the overhead of a function call from C. However, the two passes of a large archive might be expensive if it is very large.

op

an R function that is invoked when the entire contents of a particular entry are available. This is called with the the contents which are given in a raw vector and the name of the entry, in that order.

data

a user-defined data value that is passed to the call to the native routine specified in op, if that is not an R function.

convert

a function or list of functions which if provided are used to convert the raw vectors after they have been collected. This is done when the result is fetched.

workBuf

a raw vector or NULL, or a number which is used to create a raw vector of that length. This used as a buffer to copy the contents of the entire file as each chunk is delivered from the extraction. By making this a long raw vector, we reduce the number of times we need to enlarge the vector to store the entire entry's contents. Of course, the larger it is, the more memory we need. If one wants to optimize the speed of extraction, one can create a raw vector with length equal to the largest file size to be extracted. One can use tarInfo to find this information.

...

additional arguments passed on to the call to fetch the result and to the convert function if specified. (More details needed.)

Value

By default, a list with an element for each entry specified. The content of each element is a raw vector. If it is NULL, then the entry was not found in the archive.

Note

The details may change a little in future versions.

Author(s)

Duncan Temple Lang

References

zlib/contrib/untgz

See Also

tarInfo

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
  filename = system.file("sampleData", "OmegahatXSL_0.2-0.tar.gz", package = "Rcompression")

     # Get the contents of two files.
  raws = tarExtract(filename, c("OmegahatXSL/XSL/env.xsl", "OmegahatXSL/XSL/Todo.xsl"))
     # Now convert the raw vectors to text since we know what we are
     # dealing with.
  sapply(raws, rawToChar)

    # or in one step
  raws = tarExtract(filename, c("OmegahatXSL/XSL/env.xsl", "OmegahatXSL/XSL/Todo.xsl"), convert = rawToChar)


     # Extract files in a directory.
  filename = system.file("sampleData", "OmegahatXSL_0.2-0.tar.gz", package = "Rcompression")
  i = tarInfo(filename)

     # Check there is such a directory
  i$type == "DIRTYPE" & i$file == "OmegahatXSL/XSL/"

  files = i$file[dirname(i$file) ==  "OmegahatXSL/XSL"]
  z = tarExtract(filename, files, convert = rawToChar)
  nchar(z)

    # This example illustrates how we can process the contents of each
    # file as it is extracted.
    # The particular computation is uninteresting but the approach is intended
    # to illustrate that we can extract some information from the
    # contents and put it somewhere and move on to the next file. This
    # is useful if the archive has data across multiple files that can
    # be dymaically merged into a single R data structure.
 
  filename = system.file("sampleData", "OmegahatXSL_0.2-0.tar.gz", package = "Rcompression")
  lineCounts = numeric()
  countLines =
     function(contents, fileName = "", verbose = TRUE) {

        if (verbose) cat(fileName, "\n")
        numLines = length(strsplit(rawToChar(contents), "\\n")[[1]])
        lineCounts[fileName] <<- numLines
        numLines
     }
  i = tarInfo(filename)
  files = i$file[!( i$type %in% "DIRTYPE")]

    # Now we are ready to run the code.
  tarExtract(filename, files,  countLines)


    # Alternatively, collect all the information and then
    # convert each one in turn at the end.
    # This is only marginally faster, if at all and consumes
    # a lot more memory as when we perform the conversion
    # we have all of the contents in memory.
    # One measurment of speed was 38 seconds to 39.

    # With the changes to avoid the accordion growth of the raw
    # vector for each chunk of file, the comparison
    # is .969 versus .537.  So much faster overall, and this
    # version becomes relatively quicker.  But consumes more memory.

  tarExtract(filename, files,  convert = countLines, verbose = FALSE)

  max(i$size)

  # Dealing with raw data rather than a file.
 filename = system.file("sampleData", "OmegahatXSL_0.2-0.tar.bz2", package = "Rcompression")
 f = bzfile(filename, "rb")
 data = readBin(f, "raw", 1000000)
 close(f)

 tarInfo(data)

 targetFiles = c("OmegahatXSL/XSL/env.xsl", "OmegahatXSL/XSL/Todo.xsl")
 raws = tarExtract(data, targetFiles, convert = rawToChar)


 filename = system.file("sampleData", "OmegahatXSL_0.2-0.tar.gz", package = "Rcompression")
 f = gzfile(filename, "rb")
 data = readBin(f, "raw", 1000000)
 close(f)

 tarInfo(data)

statwonk/Rcompression documentation built on May 30, 2019, 10:43 a.m.