knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

LimpiaR is a package built to expedite pre-processing of text variables, with some handy functions in Spanish and R. To get started, we'll load few helpful libraries.

Loading Everything Into R

library(magrittr)
library(dplyr)
library(stringr)
library(LimpiaR)

LimpiaR's functions begin with limpiar_. Once the library has been loaded, typing limpiar_ in an Rstudio script or Rmarkdown code block, will produce a drop down menu of all LimpiaR functions, which should help you to find the name of the function you're looking for.

data <- data.frame(
  `Mention Url` = c("www.twitter.com/post1",
                  "www.twitter.com/post2",
                  "www.facebook.com/post1",
                  "www.facebook.com/post2",
                  "www.twitter.com/post3",
                  "www.youtube.com/post1",
                  "www.instagram.com/post1",
                  "www.youtube.com/post2",
                  "www.youtube.com/post3",
                  "www.youtube.com/post4"),
  `Mention Content` = c(  "holaaaaaa! cóómo    estás @magdalena   ?!",
                      "  han visto este articulo!? Que horror! https://guardian.com/emojisbanned NO SE PUEDE!!",
                      "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor",
                      "jajajajaja eres un wn!",
                      "RT dale un click a ver una mujer baila con su perro",
                      "grax ntonces q?",
                      "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 ",
                      "grax ntonces q?",
                      "grax ntonces q?",
                      "grax ntonces q?"))%>%
  tibble()%>%
  relocate(Mention.Content)
data

Column Names

We created a data frame of posts and URLs. After loading libraries and data, the first part of any workflow should be to clean the column names, this makes using tab completion, and accessing column names much faster (which in the long run = big productivity gains). For this, we'll use the janitor package. You can uncomment the code to install janitor if it is not already installed on your machine.

# ifelse(!"janitor" %in% installed.packages(),
#    install.packages("janitor"), library(janitor))

(data <- data %>% 
   janitor::clean_names())

Lower Case Text Variable

For most workflows, the next step will be to make the text variable lower case. We do this to make tokens like 'AMAZING' or 'Amazing' -> 'amazing'. You do not need a LimpiaR function for this, as the base R function tolower() works just fine.

(data <- data %>%
  mutate(mention_content = tolower(mention_content)))

Limpiar Functions

limpiar_accents

Now we're going to look at LimpiaR's functions individually. The first function is limpiar_accents, this will replace the accents most common in Spanish words from the text variable, with their Latin-alphabet equivalents e.g. 'é -> e' . We will use the assignment operator to make sure these changes are saved.

Tip: you can type ?limpiar_accents to access the documentation, and see which arguments you need to fill in.

(data <- data %>%
  limpiar_accents(text_var = mention_content))

limpiar_duplicates

Now we'll remove duplicate posts, notice that we don't actually need to type text_var = mention_content, because the default argument for text_var is already mention_content.

(data <- data %>%
  limpiar_duplicates())

Note: If the text column in our data frame were called 'text' we would have to specify text_var = text, or call:

data %>% rename(mention_content = text) 

limpiar_retweets

If you need to remove retweets, for example to create a bigram network, limpiar has a function just for that.

(data <- data %>% 
   limpiar_retweets())

limpiar_url

We generally don't want URLs appearing in our charts or analyses, so we can remove them with the limpiar_url function.

(data <- data %>%
   limpiar_url())

limpiar_spaces

Next we'll look at how to use LimpiaR to remove annoying white spaces, like those at the beginning of a sentence, or between punctuation, or multiple white spaces for no reason; as is common in the messy data we often encounter.

(data <- data %>%
  limpiar_spaces())

limpiar_tags

We can also remove user handles (e.g. @magdalena) and hashtags with the limpiar_tags function. Remember, you can type ?limpiar_tags to access documentation.

Replace only hashtags:

data %>%
  limpiar_tags(user = FALSE, hashtag = TRUE)

Replace only user tags:

data %>%
  limpiar_tags(user = TRUE, hashtag = FALSE)

Replace both hashtags and user handles:

data %>%
  limpiar_tags()

Quick recap - we've looked at:

limpiar_shorthands

One of the biggest problems with the messy data we encounter, are shorthands. Generally, algorithms have been trained on clean, standard language, so they do not encounter shorthands and abbreviations. Shorthands also change all the time, making it impractical to continuously train algorithms as new shorthands arise. This function attempts to bridge that gap, by normalising the most common shorthands.

(data <- data %>%
   limpiar_shorthands())

limpiar_repeated_chars

We don't want our algorithm to have to learn the difference between 'ajajaj' and 'jaja' or 'ay' and 'ayyyy' as practically speaking, there is none. We also don't want to introduce unnecessary tokens, so we normalise the most common occurrences of repeated or additional characters.

(data <- data %>%
   limpiar_repeat_chars())

Generally, the steps we've taken so far will be used in each and every analysis/project to help clean the data. We will now look at some of the more circumstantial functions, i.e. they will not be used in every analysis.

limpiar_emojis

We don't need to use limpiar_emojis for every analysis, as many ParseR & SegmentR functions ignore them implicitly. However, if we know that we need to for a special purpose, we can can use limpiar_emojis(). One problem with this, and the reason it is for special cases only, is the emoji's encodings are in English. We may, at some point, translate them to Spanish, but it seems unlikely.

data %>%
  limpiar_emojis(text_var = mention_content, with_emoji_tag = FALSE)

or with the default settings:

data %>%
  limpiar_emojis()

limpiar_stopwords

Stop words are common words that do not provide us with much information as to an utterance's meaning. For example, in the sentence: 'the man is in prison for theft', if we knew only one word from this sentence, and that word was 'is', 'in', 'the', or 'for' then we wouldn't have much idea what the sentence is about. However, 'prison' or 'theft', would give us a lot more information.

For many analyses, we remove stop words to help us see the 'highest information' words, to get a high-level understanding of large bodies of texts (such as in topic modelling and bigram networks.) For virtually all scenarios, you will want to use the limpiar_stopwords() with the argument stop_words = "topics" like so:

data %>%
  limpiar_stopwords(stop_words = "topics") %>%
  limpiar_spaces() #to clear the spaces of words that were removed

Sentences can look quite strange without stopwords, and a lot of social posts are virtually meaningless altogether!

It's also worth pointing out, that a lot of information can be lost when removing stop words. Many phrases in English and Spanish have very different meanings when a stop word is removed, and some stopwords lists contain negatives, which can drastically change the meaning of a sentence!

Utility Functions

We are nearly at the end of this introduction to LimpiaR, but before we finish, let's look at two utility functions which may be useful. We've conjoured up a new data frame called df, which we will use to show the last two functions and how to chain everything together.

df <- data.frame(
  `Mention Url` = c("www.twitter.com/post1",
                  "www.twitter.com/post2",
                  "www.facebook.com/post1",
                  "www.facebook.com/post2",
                  "www.twitter.com/post3",
                  "www.youtube.com/post1",
                  "www.instagram.com/post1",
                  "www.youtube.com/post2",
                  "www.youtube.com/post3",
                  "www.youtube.com/post4"),
  `Mention Content` = c(  "holaaaaaa! cóómo    estás @magdalena   ?!",
                      "  han visto este articulo!? Que horror! https://guardian.com/emojisbanned NO SE PUEDE!!",
                      "ayyyyyy a mi me   gustaria ir a londres yaaa #llevame #porfavor",
                      "jajajajaja eres un wn!",
                      "RT dale un click a ver una mujer baila con su perro",
                      "grax ntonces q?",
                      "yo soy el mejor 😂😂😂, no eres nada!! 🤣🤣 ",
                      "grax ntonces q?",
                      "grax ntonces q?",
                      "grax ntonces q?"),
  na_col = c(NA, NA, NA, NA, NA, NA, NA, NA, "tadaa", NA))%>%
  tibble()%>%
  relocate(Mention.Content)%>%
  janitor::clean_names()
df

limpiar_inpsect

So, imagine that we see a strange pattern, and we want to check what's going on with that specific pattern. We can use limpiar_inspect to view all posts which contain that pattern in an interactive frame!

limpiar_inspect(df, pattern = "ntonces", text_var = mention_content,
                url_var = mention_url, title = "ntonces")
limpiar_inspect(df, pattern = "ntonces", text_var = mention_content,
                url_var = mention_url, title = "ntonces")%>%
  knitr::kable() 

df

Whilst it's pretty obvious that all of the 'grax ntonces q?' posts are exactly the same, in the real world we're going to have 10,000 times as many posts, and searching for suspicious patterns will take up a lot of our time. The vie

limpiar_na_cols

This final function is useful when we want to remove 'mostly NA' columns of a data frame. We may want to do this to save memory, for example if we have 400,000 posts and 80 columns. In this case we'll get rid of all columns for which 25% or more of their values are NA.

limpiar_na_cols(df,threshold =  0.25)

Putting It All Together

To speed things up, we can call the functions together in one big long pipe.

df %>%
  limpiar_na_cols(threshold = 0.25)%>%
  limpiar_accents()%>%
  limpiar_retweets()%>%
  limpiar_shorthands()%>%
  limpiar_repeat_chars()%>%
  limpiar_url()%>%
  limpiar_emojis()%>%
  limpiar_shorthands()%>%
  limpiar_spaces()%>%
  limpiar_duplicates()

However, on very large data frames, it will often be better to prioritise one, or a few functions at a time, assigning your data frame as you go.



jpcompartir/LimpiaR documentation built on May 22, 2024, 4:19 a.m.