read_wordcounts: Convert DfR wordcount files to a long-format data frame

read_wordcountsR Documentation

Convert DfR wordcount files to a long-format data frame

Description

Reads in a bunch of wordcounts*.CSV files and stacks them up in a single long-format dataframe. These counts can be optionally manipulated, then passed on to wordcounts_texts and thence to make_instances. If the readr package is available, it will be used to speed up file loading.

Usage

read_wordcounts(files, ids = dfr_filename_id(files), reader = NULL)

Arguments

files

individual filenames to read.

ids

a vector of document IDs corresponding to files. By default, dfr_filename_id is applied to files. ids=files would work fine too (to use whole filenames as IDs).

reader

a function or the name of one that takes a filename and returns a two-column data frame with words (terms) in the first column and counts in the second. If NULL (by default), a CSV-reading function is used—read_csv from readr if available, read.csv otherwise. For TSV's, setting reader=readr::read_tsv should Just Work.

Details

Empty documents are skipped; DfR supplies wordcounts files for documents that have no wordcount data. These will be in DfR's metadata but not in the output dataframe here.

Even with readr and dplyr's fast row-binding, this is not altogether fast. An outboard script in python or Perl is faster, but this keeps us in R and does everything in memory.

Memory usage: for N typical journal articles, the resulting dataframe seems to need about 20N K of memory. So R on a laptop will hit its limits somewhere below 100,000 articles of typical length.

Value

A data frame with three columns: id, the document ID; word, a word type or term (called WORDCOUNTS in DfR source data files); weight, the count.

See Also

wordcounts_texts, instances_Matrix for word counts after stopword removal (etc.).


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.