Getting started

  collapse = TRUE,
  comment = "#>"

The text-package uses natural language processing and machine learning methods to examine text and numerical variables.

Central text functions are described below. The data and methods come from the Kjell et al., 2019 (pre-print), which show how individuals' open-ended text answers can be used to measure, describe and differentiate psychological constructs.

In short the workflow includes to first transform text variables into word embeddings. These word embeddings are then used to, for example, predict numerical variables, compute semantic similarity scores, statistically test difference in meaning between two sets of texts and plot words in the word embedding space.

textEmbed(): mapping text to numbers

The textEmbed() function automatically transforms character variables in a given tibble to word embeddings. The example data that will be used in this tutorial comes from participants that have described their harmony in life and satisfaction with life with a text response, 10 descriptive words or rating scales. For a more detailed description please see the word embedding tutorial


# Get example data including both text and numerical variables
sq_data <- Language_based_assessment_data_8

# Transform the text data to BERT word embeddings
wordembeddings <- textEmbed(sq_data)

# See how word embeddings are structured

# Save the word embeddings to avoid having to import the text every time
saveRDS(wordembeddings, "wordembeddings.rds")

# Get the word embeddings again
wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds")

textTrain(): Examine the relationship between text and numeric variables

The textTrain() is used to examine how well the word embeddings from a text can predict a numeric variable. This is done by training the word embeddings using ridge regression and 10-fold cross-validation (where the word embeddings are pre-processed using pca). In the example below we examine how well the harmony text responses can predict the rating scale scores from the Harmony in life scale.

# Load data that has already gone through textEmbed
# The previous example only imported 10 participants; 
# whereas below we load data from 100 participants
wordembeddings <- rio::import("")
# Load corresponding numeric variables
numeric_data <-   rio::import("")

# Examine the relationship between harmonytext and the corresponding rating scale
model_htext_hils <- textTrain(wordembeddings$harmonytexts, 
                              penalty = 1)

# Examine the correlation between predicted and observed Harmony in life scale scores

textSimilarityTest(): Test the difference in meaning between to sets of texts

The textSimilarityTest() function provides a permutation based test to examine whether two sets of texts significantly differ in meaning. It produces a p-value and estimate as an effect size. Below we examine whether the harmony text and satisfaction text responses differ in meaning.


# Compare the meaning between individuals' harmony in life and satisfaction with life answers
         Npermutations = 100, 
         output.permutations = FALSE)

Plot statistically significant words

The plotting is made in two steps: First the textProjection function is pre-processing the data, including computing statistics for each word to be plotted. Second, textProjectionPlot() is visualizing the words, including many options to set color, font etc for the figure. Dividing this procedure into two steps makes the process more transparent (since the user naturally get to see the output that the words are plotted according to) and quicker since the more heavy computations are made in the first step, the last step goes quicker so that one can try different design settings.

textProjection(): Pre-process data for plotting


# Pre-process word data to be plotted with textPlotViz-function
# wordembeddings4 and Language_based_assessment_data_8  contain example data provided with the package.

# Pre-process data
df_for_plotting <- textProjection(Language_based_assessment_data_8$harmonywords, 
  Language_based_assessment_data_8$hilstotal, Language_based_assessment_data_8$swlstotal

textProjectionPlot(): A two-dimensional word plot

# Used data (DP_projections_HILS_SWLS_100) has
# been pre-processed with the textProjection function
plot_projection <- textProjectionPlot(
  word_data = DP_projections_HILS_SWLS_100,
  k_n_words_to_test = FALSE,
  plot_n_words_square = 5,
  plot_n_words_p = 5,
  plot_n_word_extreme = 1,
  plot_n_word_frequency = 1,
  plot_n_words_middle = 1,
  y_axes = TRUE,
  p_alpha = 0.05,
  title_top = " Supervised Bicentroid Projection of Harmony in life words",
  x_axes_label = "Low vs. High HILS score",
  y_axes_label = "Low vs. High SWLS score",
  p_adjust_method = "bonferroni",
  points_without_words_size = 0.4,
  points_without_words_alpha = 0.4

Relevant References

Text is new and has not been used in a publication yet. therefore, the below list consists of papers analyzing human language in a similar fashion that is possible text.

Methods Articles
Gaining insights from social media language: Methodologies and challenges.
Kern et al., (2016). Psychological Methods.

Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Pre-print
Kjell et al., (2019). Psychological Methods.

Clinical Psychology
Facebook language predicts depression in medical records
Eichstaedt, J. C., ... & Schwartz, H. A. (2018). PNAS.

Social and Personality Psychology
Personality, gender, and age in the language of social media: The open-vocabulary approach
Schwartz, H. A., … & Seligman, M. E. (2013). PloSOne.

Automatic Personality Assessment Through Social Media Language
Park, G., Schwartz, H. A., ... & Seligman, M. E. P. (2014). Journal of Personality and Social Psychology.

Health Psychology
Psychological language on Twitter predicts county-level heart disease mortality
Eichstaedt, J. C., Schwartz, et al. (2015). Psychological Science.

Positive Psychology
The Harmony in Life Scale Complements the Satisfaction with Life Scale: Expanding the Conceptualization of the Cognitive Component of Subjective Well-Being
Kjell, et al., (2016). Social Indicators Research

Computer Science: Python Software
DLATK: Differential language analysis toolkit
Schwartz, H. A., Giorgi, et al., (2017). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations


Try the text package in your browser

Any scripts or data that you put into this service are public.

text documentation built on Dec. 17, 2020, 9:08 a.m.