Nothing
Please check the latest news (change log) and keep this package updated.
⚠️ All users should update the package to version ≥ 0.3.2. Old versions may have slow processing speed and other problems.
text_*()
functions.\donttest{}
in more examples to avoid unnecessary errors.text_unmask()
, though it has been deprecated.text_unmask()
since I have developed a new package FMAT as an integrative toolbox of the Fill-Mask Association Test (FMAT).packageStartupMessage()
so that the messages can be suppressed.text_unmask()
, but a new package (currently not publicly available) has been developed for a more general purpose of using masked language models to measure conceptual associations. Please wait for the release of this new package and the publication of a related methodological article.normalized
attribute when using data_wordvec_load()
.[
method for embed
, see new examples in as_embed()
.unique()
method to delete duplicate words.str()
method to print the data structure and attributes.pattern()
function designed for S3 [
method of embed
: Users can directly use regular expression like embed[pattern("^for")]
to extract a subset of embedding matrix.plot_network()
function: Visualize a (partial correlation) network graph of words. Very useful for identifying potential semantic clusters from a list of words and even useful for disentangling antonyms from synonyms.targets
argument of text_unmask()
: Return specific fill-mask results for certain target words (rather than the top n results).tab_similarity()
, most_similar()
, dict_expand()
, dict_reliability()
, test_WEAT()
, test_RND()
.print()
method for embed
and wordvec
.pair_similarity()
has been improved by using matrix operation tcrossprod(embed, embed)
to compute cosine similarity, with embed
normalized.data_wordvec_load()
has got two wrapper functions load_wordvec()
and load_embed()
for faster use.data_wordvec_normalize()
(deprecated) has been renamed to normalize()
.get_wordvecs()
(deprecated) has been integrated into get_wordvec()
.tab_similarity_cross()
(deprecated) has been integrated into tab_similarity()
.test_WEAT()
and test_RND()
: Warning if T1
and T2
or A1
and A2
have duplicate values.embed
or wordvec
, and too many words to be printed to console. Now all related functions have been substantially improved so that they would not take unnecessarily long time.embed
(an extended class of matrix) rather than wordvec
in order to enhance the speed!text_*
functions for contextualized word embeddings! Based on the R package text
(and using the R package reticulate
to call functions from the Python module transformers
), a series of new functions have been developed to (1) download HuggingFace Transformers pre-trained language models (PLM; thousands of options such as GPT, BERT, RoBERTa, DeBERTa, DistilBERT, etc.), (2) extract contextualized token (roughly word) embeddings and text embeddings, and (3) fill in the blank mask(s) in a query (e.g., "Beijing is the [MASK] of China.").text_init()
: set up a Python environment for PLMtext_model_download()
: download PLMs from HuggingFace to local ".cache" foldertext_model_remove()
: remove PLMs from local ".cache" foldertext_to_vec()
: extract contextualized token and text embeddingstext_unmask()
: fill in the blank mask(s) in a queryorth_procrustes()
function: Orthogonal Procrustes matrix alignment. Users can input either two matrices of word embeddings or two wordvec
objects as loaded by data_wordvec_load()
or transformed from matrices by as_wordvec()
.dict_expand()
function: Expand a dictionary from the most similar words, based on most_similar()
.dict_reliability()
function: Reliability analysis (Cronbach's α) and Principal Component Analysis (PCA) of a dictionary. Note that Cronbach's α may be misleading when the number of items/words is large.sum_wordvec()
function: Calculate the sum vector of multiple words.plot_similarity()
function: Visualize cosine similarities between word pairs in a style of correlation matrix plot.tab_similarity_cross()
function: A wrapper of tab_similarity()
to tabulate cosine similarities for only n1 * n2 word pairs from two sets of words (arguments: words1
, words2
).print.wordvec()
, print.embed()
, rbind.wordvec()
, rbind.embed()
, subset.wordvec()
, subset.embed()
as_matrix()
has been renamed to as_embed()
: Now PsychWordVec
supports two classes of data objects -- wordvec
(data.table) and embed
(matrix). Most functions now use embed
(or transform wordvec
to embed
) internally so as to enhance the speed. Matrix is much faster!data_wordvec_reshape()
: Now use as_wordvec()
and as_embed()
.data_wordvec_subset()
, get_wordvecs()
, tab_similarity()
, and plot_similarity()
: If neither words
nor pattern
are specified (NULL
), then all words in data
will be extracted.print.weat()
and print.rnd()
.test_WEAT()
and test_RND()
: Users can specify the number of permutation samples and choose to calculate either one-sided or two-sided p value. It can well reproduce the results in Caliskan et al.'s (2017) article.pooled.sd
argument for test_WEAT()
: Users can choose the method used to calculate the pooled SD for effect size estimate in WEAT. However, the original approach proposed by Caliskan et al. (2017) is the default and highly suggested.as_matrix()
and as_wordvec()
for data_wordvec_reshape()
, which can make it easier to reshape word embeddings data from matrix
to "wordvec" data.table
or vice versa.test_WEAT()
and test_RND()
now have changed the element names and S3 print method of their returned objects (of new class weat
and rnd
, respectively): The elements $eff.raw
, $eff.size
, and $eff.sum
are now deprecated and replaced by $eff
, which is a data.table
containing the overall raw/standardized effects and permutation p value. The new S3 print methods print.weat()
and print.rnd()
can make a tidy report of the test results when you directly type and print the returned object (see code examples).cli
package.library(PsychWordVec)
.wordvec
as the primary class of word vectors data: Now the data classes contain wordvec
, data.table
, and data.frame
, which actually perform as a data.table
.train_wordvec()
function: Train word vectors using the Word2Vec, GloVe, or FastText algorithm with multi-threading.tokenize()
function: Tokenize raw texts for training word vectors.data_wordvec_reshape()
function: Reshape word vectors data from dense (a data.table
of new classs wordvec
with two variables word
and vec
) to plain (a matrix
of word vectors) or vice versa.test_RND()
function, and tab_WEAT()
is renamed to test_WEAT()
: These two functions serve as convenient tools of word semantic similarity analysis and conceptual association test.plot_wordvec_tSNE()
function: Visualize 2-D or 3-D word vectors with dimensionality reduced using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method.data_wordvec_subset()
function.unique
argument for tab_similarity()
.test_WEAT()
.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.