create_queries: Automatically infer queries from combinations of terms in a...

View source: R/feature_preparation.r

create_queriesR Documentation

Automatically infer queries from combinations of terms in a dtm

Description

This function was designed for the task of matching short event descriptions to news articles, but can more generally be used for document matching tasks. However, it should be noted that it will require exponentially more memory for dtms with more unique terms, which is why it is less suitable for matching larger documents. This only applies to the dtm, not the ref_dtm. Thus, if your goal is to match smaller documents such as event descriptions to news, this function might be usefull.

Usage

create_queries(
  dtm,
  ref_dtm = NULL,
  min_docfreq = 2,
  max_docprob = 0.01,
  weight = c("tfidf", "binary"),
  norm_weight = c("max", "doc_max", "dtm_max", "none"),
  min_obs_exp = NA,
  union_sim_thres = NA,
  combine_all = T,
  only_dtm_combs = T,
  use_dtm_and_ref = F,
  verbose = F
)

Arguments

dtm

A quanteda dfm

ref_dtm

Optionally, another quanteda dfm. If given, the ref_dtm will be used to calculate the docfreq/docprob scores.

min_docfreq

The minimum frequency for terms or combinations of terms

max_docprob

The maximum probability (document frequency / N) for terms or combinations of terms

weight

Determine how to weight the queries (if ref_dtm is used, uses the idf of the ref_dtm, or of both the dtm and ref dtm if use_dtm_and_ref is T). Default is "binary" (does/does not occur). "tfidf" uses common tf-idf weighting (actually just idf, since scores are binary).

norm_weight

Normalize the weight score so that the highest value is 1. If "max" is used, max is the highest possible value. "doc_max" uses the highest value within each document, and "dtm_max" uses the highest observed value in the dtm.

min_obs_exp

The minimum ratio of the observed and expected frequency of a term combination

union_sim_thres

If given, a number between 0 and 1, used as the cosine similarity threshold for combining clusters of terms

combine_all

If True, combine all terms. If False (default), terms that are included as unigrams (i.e. that are within the min_docfreq and max_docprob) are not combined with other terms.

only_dtm_combs

Only include term combinations that occur in dtm. This makes sense (and saves a lot of memory) if you are only interested in assymetric similarity measures based on the query

use_dtm_and_ref

if a ref_dtm is used, the weight is computed based only on the document frequencies in the ref dtm. If use_dtm_and_ref is set to TRUE, both the dtm and ref_dtm are used.

verbose

If true, report progress

Details

The main purpose of the function is that it intersects the terms in a dtm based to increase sparsity. This can improve certain document matching tasks, but at the cost of creating a bigger dtm. If all terms are combined this would be a quadratic increase of columns. However, only term combinations that occur in dtm (not ref_dtm) will be used. This is not a problem as long as the similarity of the documents in dtm to documents in dtm_y is calculated as an assymetric similarity measure (i.e. in which the sum of terms in dtm_y is not used).

To emphasize that this feature preparation step is geared towards the task of 'looking up' documents, we use the terminolog of a 'query'. The output of the function is a list of two dtm: query_dtm and ref_dtm. Both dtms have the exact same columns that contain the query terms. The values in query_dtm are by default tfidf weighted, and the values in ref_dtm are binary.

Several options are given to only create term combinations that are informative. Firstly, a minimum and maximum document frequency of term combinations can be defined. Secondly, a minimum observed/expected ratio can be given. The expected probability of a combination of term A and term B is the joint probability. If the observed probability is not higher, the combination is not more informative than chance. Thirdly, before intersecting terms, one can first cluster very similar terms together as single columns to reduce the number of possible combinations.

Value

a list with a query dtm and ref_dtm. Designed for use in compare_documents using the special 'query_lookup' measure

Examples

 q = create_queries(rnewsflow_dfm, min_docfreq = 2, union_sim_thres = 0.9, 
                    max_docprob = 0.05, verbose = FALSE)
 head(colnames(q$query_dtm),100)

RNewsflow documentation built on May 31, 2023, 6:53 p.m.