read_ndjson: JSON Data Input
In corpus: Text Corpus Analysis

Description Usage Arguments Details Value Memory mapping See Also Examples

Read data from a file in newline-delimited JavaScript Object Notation (NDJSON) format.

1	read_ndjson(file, mmap = FALSE, simplify = TRUE, text = NULL)

`file`	the name of the file which the data are to be read from, or a connection (unless `mmap` is `TRUE`, see below). The data should be encoded as UTF-8, and each line should be a valid JSON value.
`mmap`	whether to memory-map the file instead of reading all of its data into memory simultaneously. See the ‘Memory mapping’ section.
`simplify`	whether to attempt to simplify the type of the return value. For example, if each line of the file stores an integer, if `simplify` is set to `TRUE` then the return value will be an integer vector rather than a `corpus_json` object.
`text`	a character vector of string fields to interpret as `text` instead of `character`, or `NULL` to interpret all strings as `character`.

This function is the recommended means of reading data for processing by the corpus package.

When the text argument is non-NULL string data fields with names indicated by this argument are decoded as text values, not as character values.

In the default usage, with argument simplify = TRUE, when the lines of the file are records (JSON object literals), the return value from read_ndjson is a data frame with class c("corpus_frame", "data.frame"). With simplify = FALSE, the result is a corpus_json object.

When you specify mmap = TRUE, the function memory-maps the file instead of reading it into memory directly. In this case, the file argument must be a character string giving the path to the file, not a connection object. When you memory-map the file, the operating system reads data into memory only when it is needed, enabling you to transparently process large data sets that do not fit into memory.

In terms of memory usage, enabling mmap = TRUE reduces the footprint for corpus_json and corpus_text objects; native R objects (character, integer, list, logical, and numeric) get fully deserialized to memory and produce identical results regardless of whether mmap is TRUE or FALSE. To process a large text corpus with a text field named "text", you should set text = "text" and mmap = TRUE. Or, to reduce the memory footprint even further, set simplify = FALSE and mmap = TRUE.

One danger in memory-mapping is that if you delete the file after calling read_ndjson but before processing the data, then the results will be undefined, and your computer may crash. (On POSIX-compliant systems like Mac OS and Linux, there should be no ill effects to deleting the file. On recent versions of Windows, the system will not allow you to delete the file as long as the data is active.)

Another danger in memory-mapping is that if you serialize a corpus_json object or derived corpus_text object using saveRDS or another similar function, and then you deserialize the object, R will attempt create a new memory-map using the file argument passed to the original read_ndjson call. If file is a relative path, then your working directory at the time of deserialization must agree with your working directory at the time of the read_ndjson call. You can avoid this situation by specifying an absolute path as the file argument (the normalizePath function will convert a relative to an absolute path).

as_corpus_text, as_utf8.

# Memory mapping
lines <- c('{ "a": 1, "b": true }',
           '{ "b": false, "nested": { "c": 100, "d": false }}',
           '{ "a": 3.14, "nested": { "d": true }}')
file <- tempfile()
writeLines(lines, file)
(data <- read_ndjson(file, mmap = TRUE))

data$a
data$b
data$nested.c
data$nested.d

rm("data")
invisible(gc()) # force the garbage collector to release the memory-map
file.remove(file)

     a     b nested.c nested.d
1 1.00  TRUE       NA       NA
2   NA FALSE      100    FALSE
3 3.14    NA       NA     TRUE
[1] 1.00   NA 3.14
[1]  TRUE FALSE    NA
[1]  NA 100  NA
[1]    NA FALSE  TRUE
[1] TRUE