read_wordcounts | R Documentation |
Reads in a bunch of wordcounts*.CSV
files and stacks them up in a
single long-format dataframe. These counts can be optionally manipulated,
then passed on to wordcounts_texts
and thence to
make_instances
. If the readr package is available, it
will be used to speed up file loading.
read_wordcounts(files, ids = dfr_filename_id(files), reader = NULL)
files |
individual filenames to read. |
ids |
a vector of document IDs corresponding to |
reader |
a function or the name of one that takes a filename and returns
a two-column data frame with words (terms) in the first column and counts
in the second. If NULL (by default), a CSV-reading function is
used— |
Empty documents are skipped; DfR supplies wordcounts files for documents that have no wordcount data. These will be in DfR's metadata but not in the output dataframe here.
Even with readr and dplyr's fast row-binding, this is not altogether fast. An outboard script in python or Perl is faster, but this keeps us in R and does everything in memory.
Memory usage: for N typical journal articles, the resulting dataframe seems to need about 20N K of memory. So R on a laptop will hit its limits somewhere below 100,000 articles of typical length.
A data frame with three columns: id
, the document ID;
word
, a word type or term (called WORDCOUNTS
in DfR source
data files); weight
, the count.
wordcounts_texts
, instances_Matrix
for
word counts after stopword removal (etc.).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.