read_wordcounts: Convert DfR wordcount files to a long-format data frame
In agoldst/dfrtopics: Tools for exploring topic models of text

read_wordcounts

R Documentation

Convert DfR wordcount files to a long-format data frame

Description

Reads in a bunch of wordcounts*.CSV files and stacks them up in a single long-format dataframe. These counts can be optionally manipulated, then passed on to wordcounts_texts and thence to make_instances. If the readr package is available, it will be used to speed up file loading.

Usage

read_wordcounts(files, ids = dfr_filename_id(files), reader = NULL)

Arguments

`files`	individual filenames to read.
`ids`	a vector of document IDs corresponding to `files`. By default, `dfr_filename_id` is applied to `files`. `ids=files` would work fine too (to use whole filenames as IDs).
`reader`	a function or the name of one that takes a filename and returns a two-column data frame with words (terms) in the first column and counts in the second. If NULL (by default), a CSV-reading function is used—`read_csv` from readr if available, `read.csv` otherwise. For TSV's, setting `reader=readr::read_tsv` should Just Work.

Details

Empty documents are skipped; DfR supplies wordcounts files for documents that have no wordcount data. These will be in DfR's metadata but not in the output dataframe here.

Even with readr and dplyr's fast row-binding, this is not altogether fast. An outboard script in python or Perl is faster, but this keeps us in R and does everything in memory.

Memory usage: for N typical journal articles, the resulting dataframe seems to need about 20N K of memory. So R on a laptop will hit its limits somewhere below 100,000 articles of typical length.

Value

A data frame with three columns: id, the document ID; word, a word type or term (called WORDCOUNTS in DfR source data files); weight, the count.

agoldst/dfrtopics
Tools for exploring topic models of text

read_wordcounts: Convert DfR wordcount files to a long-format data frame
In agoldst/dfrtopics: Tools for exploring topic models of text

Convert DfR wordcount files to a long-format data frame

Description

Usage

Arguments

Details

Value

See Also

Related to read_wordcounts in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics Tools for exploring topic models of text

read_wordcounts: Convert DfR wordcount files to a long-format data frame In agoldst/dfrtopics: Tools for exploring topic models of text

Convert DfR wordcount files to a long-format data frame

Description

Usage

Arguments

Details

Value

See Also

Related to read_wordcounts in agoldst/dfrtopics...

R Package Documentation

Browse R Packages

We want your feedback!

agoldst/dfrtopics
Tools for exploring topic models of text

read_wordcounts: Convert DfR wordcount files to a long-format data frame
In agoldst/dfrtopics: Tools for exploring topic models of text