extract_tfidf: Extracting TF-IDF Values for Ngrams

View source: R/extract_tfidf.R

extract_tfidfR Documentation

Extracting TF-IDF Values for Ngrams

Description

[Experimental]

This function takes as input a tibble graph (from tidygraph), a list of tibble graphs or a data frame, extract the ngrams from the text column(s) of your choice, and calculates the Term-Frequency Inverse-Document-Frequency value of each ngram for each grouping variables you have chosen.

Usage

extract_tfidf(
  data,
  text_columns,
  grouping_columns,
  grouping_across_list = FALSE,
  n_gram = 2L,
  stopwords_type = "smart",
  stopwords_vector = NULL,
  clean_word_method = c("lemmatize", "stemming", "none"),
  ngrams_filter = 5L,
  nb_terms = 5L
)

Arguments

data

A tibble graph from tidygraph, a list of tibble graphs or a data frame.

text_columns

The columns with the text you want to analyze. If you give multiple columns, they will be united to extract the terms.

grouping_columns

The column(s) you want to use to calculate the tf-idf. These columns will become your "document" unit in the tidytext::bind_tf_idf() function. For instance, if you run the function on a unique tibble graph, you may want to compute the tf-idf depending on the clusters your nodes are belonging. You have to take care that the identifier of the variable you are using to compute the tf-idf is unique for each group (see the details for more information).

grouping_across_list

Set to TRUE if you want to compute tf-idf on the whole list of tibble graphs and that you have no unique identifier for them (see the details for more information).

n_gram

The maximum n you want for tokenizing your ngrams (see tidytext::unnest_tokens() for more information). 2 by default, i.e. only unigrams and bigrams will be extracted.

stopwords_type

The type of stopwords list you want to use to remove stopwords from your ngrams. The "smart" list is chosen by default, but see other possilities with stopwords::stopwords_getsources.

stopwords_vector

Use your own stopwords list, in a vector of strings format.

clean_word_method

Choose the method to clean and standardized your ngrams. You can lemmatize or stem words through the textstem package. Choose "none" if you don't want to apply any cleaning method.

ngrams_filter

You can exclude from tf-idf computation the ngrams that does not appear a certain number of time in the whole corpus.

nb_terms

The functions extracts the nb_terms (5 by default) highest TF-IDF for each grouping variables.

Details

This functions extract TF-IDF values for various types of input, from multiple text columns and with grouping of multiple columns. The most simple case is to use this function with a data frame or a unique tibble graph with an easily identifiable grouping variable (like a cluster). But it also allows more complex uses in the case of a list of tibble graphs.

If you enter as an input a list of tibble graphs, the function extracts TF-IDF on the binded graphs, and not graph after graph. If your want to extract TF-IDF for each graphs separately, then use lapply() and apply extract_tfidf() for each graph: the input will be a unique tibble graph, and the operation will be repeated for each tibble graphs of your list.

As the extraction of TF-IDF is made on the whole aggregated list, you have to choose carefully your grouping_columns. Indeed, your grouping columns must identify variables that are unique. For instance, in the case you have used add_clusters(), each node in each of your graph is associated to a cluster. But the identifier of the clusters ("01", "02", "03", etc.) are the same across tibble graphs. It means that all the "01" clusters will be grouped together, and it is something you don't want. In this case, set grouping_across_list to TRUE: the identifier of the cluster will be merged with the name of the corresponding tibble_graph in the list. However, you don't need to use this possibility if you have a unique identifier across your tibble graphs. That is the case, for instance, if you have use merge_dynamic_clusters(), you have a column of clusters merged across your different tibble graphs. These new inter-networks clusters constitute a unique identifier.

TF-IDF are calculated from the number of occurrence of a term in each document. The terms which occur only once are removed to avoid too rare terms to appear at the top of your grouping variables.

Value

A data.table with the terms (i.e. ngrams) appearing in each "document" (that is your grouping_columns) with the number of time they appear per document (n), their term frequency (tf), their inverse document frequency (idf), and their term-frequency inverse-document-frequency (tf_idf). The terms are those with the highest tf_idf value for each value of the grouping columns, depending on the nb_words value you set. For instance, if nb_words is set to 5 (default valuet), and that you compute the TF-IDF on the cluster variable, the function extracts the 5 terms with the highest TF-IDF value for each cluster.

Examples

nodes <- Nodes_stagflation |>
dplyr::rename(ID_Art = ItemID_Ref) |>
dplyr::filter(Type == "Stagflation")

references <- Ref_stagflation |>
dplyr::rename(ID_Art = Citing_ItemID_Ref)

temporal_networks <- build_dynamic_networks(nodes = nodes,
directed_edges = references,
source_id = "ID_Art",
target_id = "ItemID_Ref",
time_variable = "Year",
cooccurrence_method = "coupling_similarity",
time_window = 10,
edges_threshold = 1,
overlapping_window = TRUE,
filter_components = TRUE)

temporal_networks <- add_clusters(temporal_networks,
objective_function = "modularity",
clustering_method = "leiden")

tfidf <- extract_tfidf(temporal_networks,
n_gram = 4,
text_columns = "Title",
grouping_columns = "cluster_leiden",
grouping_across_list = TRUE,
clean_word_method = "lemmatise")

tfidf[[1]]


agoutsmedt/networkflow documentation built on March 15, 2023, 11:51 p.m.