export_browser_data: Output data files for dfr-browser

export_browser_dataR Documentation

Output data files for dfr-browser

Description

Transform and save modeling results in a format suitable for use by dfr-browser, the web-browser based model browser. For a quick export and immediate viewing, see also dfr_browser.

Usage

export_browser_data(
  m,
  out_dir,
  zipped = TRUE,
  n_top_words = 50,
  n_scaled_words = 1000,
  supporting_files = FALSE,
  overwrite = FALSE,
  internalize = FALSE,
  info = NULL,
  proper = FALSE,
  digits = getOption("digits"),
  permute = NULL,
  metadata_header = FALSE
)

Arguments

m

mallet_model object from train_model or load_mallet_model

out_dir

directory for output. If supporting_files is TRUE, the exported data files will go in a "data" directory under out_dir.

zipped

should the larger data files be zipped? (If TRUE, uses zip, which requires the zip utility be available.)

n_top_words

how many top words per topic to save?

n_scaled_words

how many word types to use in scaled coordinates calculation?

supporting_files

if TRUE (FALSE is default), all the files needed to run the browser are copied to out_dir, with the exported data placed appropriately. From a shell in out_dir, run bin/server to launch a local web server.

overwrite

if TRUE, this will clobber existing files

internalize

always set to FALSE. If TRUE, model data is in the browser home page rather than separate files, but this behavior is deprecated. See Details.

info

a list of dfr-browser parameters. Converted to JSON with toJSON and stored in info.json. If omitted, default values (getOption("dfrtopics.browser_info")) are used. No file is written if info=FALSE.

proper

if TRUE, the document-topic and topic-word matrices will be smoothed by the hyperparameters alpha and beta (respectively) and normalized before export, instead of the "raw" sampling weights (which is the default). For MALLET models, moothed and normalized weights then give the maximum a posteriori estimates of the corresponding probabilities, which is "properly" what the modeling process yields (but may disguise the effects of variations in document length—and increase the storage space required).

digits

if proper is TRUE, probabilities are rounded to this decimal place, yielding a somewhat sparser doc-topics matrix (the topic-word matrix is more aggressively truncated anyway). Set to NULL for no rounding. Rounded weights are renormalized within dfr-browser itself.

permute

if non-NULL, specifies a renumbering of the topics: the new topic k is old topic permute[k]. (If you have the inverse, use order(permute) to invert it back.)

metadata_header

if TRUE (FALSE is default), the exported metadata CSV will have a header row (not expected by dfr-browser by default)

Details

This routine reports on its progress. By default, it saves zipped versions of the document-topics matrix and metadata files; dfr-browser supports client-side unzipping. This function compresses files using R's zip command. If that fails, set zipped=F (and, if you wish, zip the files using another program).

A detailed description of the output files can be found in the dfr-browser technical notes at http://github.com/agoldst/dfr-browser.

This package includes a copy of the dfr-browser files necessary to run the browser. By default, this routine only exports data files. To also copy over the dfr-browser source (javascript, HTML, and CSS), pass supporting_files=T.

Metadata format

If you are working with non-JSTOR documents, the one file that will reflect this is the exported metadata. dfr-browser expects seven metadata columns by default: id,title,author,journaltitle,volume,issue,pubdate,pagerange. This function looks for these seven columns and, if it finds them, writes the metadata with these columns in this order. Any remaining columns are pushed all the way to the right of the output. (dfr-browser ignores them unless you customize it.) If any these columns is not present in metadata(m), then export_browser_data will simply save all the metadata as is, adjusting only the CSV format to match the baseline expectation of dfr-browser (namely, a headerless CSV conforming to RFC 4180.).

If your metadata does not match these expectations, an alternative is to set dfr-browser's configuration parameters VIS.metadata.type and VIS.bib.type to "base" (using the info parameter) and to write out a metadata file with a header by passing metadata_header=T to this function or dfr_browser. For polished results more customization of dfr-browser might be necessary.

Note that you can adjust the metadata held on the model object by assigning to metadata(m) before exporting the browser data. In particular, if you have many documents, you may wish to conserve space by eliminating metadata columns that are not used by the visualization: for example, metadata(m)$publisher <- NULL. Earlier versions of dfrtopics tried to eliminate such columns automatically, but this more conservative approach aims to allow you more flexibility about what gets exported.

Deprecated option

To insert the data directly into the main index.html file, pass internalize=T. This behavior is now deprecated and will be removed in a future version.

See Also

dfr_browser, model_dfr_documents, train_model, topic_scaled_2d, and the functions for outputting individual custom files: export_browser_topic_words, export_browser_doc_topics, export_browser_metadata, export_browser_topic_scaled, export_browser_info.

Examples


## Not run: 
m <- model_dfr_documents("citations.CSV", "wordcounts",
    "stoplist.txt", n_topics=40)

# export all files needed for browser program
export_browser_data(m, out_dir="browser", supporting_files=T)

# or: overwrite model data only for an already-existing browser
export_browser_data(m, out_dir="browser/data",
    supporting_files=F, overwrite=T)

## End(Not run)


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.