knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
LimpiaR is a package built to expedite pre-processing and cleaning of text data, with some handy functions in Spanish and R. To get started, we'll load few helpful libraries.
library(magrittr) library(dplyr) library(stringr) library(LimpiaR)
LimpiaR's functions begin with limpiar_. Once the library has been loaded, typing limpiar_ in an Rstudio script or Rmarkdown code block, will produce a drop down menu of all LimpiaR functions, which should help you to find the name of the function you're looking for - you can then use tab to autocomplete the function. Once inside the function, RStudio should give you a popover which shows the argument the function expects. You can also type 'control + space' when your cursor is inside the function's brackets to force extra help.
data <- data.frame( `Mention Url` = c("www.twitter.com/post1", "www.twitter.com/post2", "www.facebook.com/post1", "www.facebook.com/post2", "www.twitter.com/post3", "www.youtube.com/post1", "www.instagram.com/post1", "www.youtube.com/post2", "www.youtube.com/post3", "www.youtube.com/post4"), `Mention Content` = c( "holaaaaaa! c贸贸mo est谩s @magdalena ?!", " han visto este articulo!? Que horror! https://guardian.com/emojisbanned NO SE PUEDE!!", "ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor", "jajajajaja eres un wn!", "RT dale un click a ver una mujer baila con su perro", "grax ntonces q?", "yo soy el mejor 馃槀馃槀馃槀, no eres nada!! 馃ぃ馃ぃ ", "grax ntonces q?", "grax ntonces q?", "grax ntonces q?"))%>% tibble()%>% relocate(Mention.Content)
data
We created a data frame of posts and URLs. After loading libraries and data, the first part of any workflow should be to clean the column names, this makes using tab completion, and accessing column names much faster (which in the long run = big productivity gains). For this, we'll use the janitor package. You can uncomment the code to install janitor if it is not already installed on your machine.
# ifelse(!"janitor" %in% installed.packages(), # install.packages("janitor"), library(janitor)) (data <- data %>% janitor::clean_names())
For most workflows, the next step will be to make the text variable lower case. We do this to make tokens like 'AMAZING' or 'Amazing' -> 'amazing'. You do not need a LimpiaR function for this, as the base R function tolower() works just fine.
(data <- data %>% mutate(mention_content = tolower(mention_content)))
Now we're going to look at LimpiaR's functions individually. The first function is limpiar_accents, this will replace the accents most common in Spanish words from the text variable, with their Latin-alphabet equivalents e.g. '茅 -> e' . We will use the assignment operator to make sure these changes are saved.
Tip: you can type ?limpiar_accents to access the documentation, and see which arguments you need to fill in.
(data <- data %>% limpiar_accents(text_var = mention_content))
Now we'll remove duplicate posts, notice that we don't actually need to type text_var = mention_content, because the default argument for text_var is already mention_content.
(data <- data %>% limpiar_duplicates())
Note: If the text column in our data frame were called 'text' we would have to specify text_var = text, or call:
data %>% rename(text = mention_content)
If you need to remove retweets, for example to create a bigram network, limpiar has a function just for that.
(data <- data %>% limpiar_retweets())
We generally don't want URLs appearing in our charts or analyses, so we can remove them with the limpiar_url function.
(data <- data %>% limpiar_url())
Next we'll look at how to use LimpiaR to remove annoying white spaces, like those at the beginning of a sentence, or between punctuation, or multiple white spaces for no reason; as is common in the messy data we often encounter.
(data <- data %>% limpiar_spaces())
We can also remove user handles (e.g. @magdalena) and hashtags with the limpiar_tags function. Remember, you can type ?limpiar_tags to access documentation.
Replace only hashtags:
data %>% limpiar_tags(user = FALSE, hashtag = TRUE)
Replace only user tags:
data %>% limpiar_tags(user = TRUE, hashtag = FALSE)
Replace both hashtags and user handles:
data %>% limpiar_tags()
One of the biggest problems with the messy data we encounter, are shorthands. Generally, algorithms have been trained on clean, standard language, so they do not encounter shorthands and abbreviations. Shorthands also change all the time, making it impractical to continuously train algorithms as new shorthands arise. This function attempts to bridge that gap, by normalising the most common shorthands.
(data <- data %>% limpiar_shorthands())
We don't want our algorithm to have to learn the difference between 'ajajaj' and 'jaja' or 'ay' and 'ayyyy' as practically speaking, there is none. We also don't want to introduce unnecessary tokens, so we normalise the most common occurrences of repeated or additional characters.
(data <- data %>% limpiar_repeat_chars())
Generally, the steps we've taken so far will be used in each and every analysis/project to help clean the data. We will now look at some of the more circumstantial functions, i.e. they will not be used in every analysis.
Emojis are a type of non-ASCII unicode character, which means that removing all non-ASCII characters will remove emojis by force. It also means that functions designed to target certain unicode characters by patterns, may inadvertently remove other special characters - as well as emojis, or instead of emojis!
Why not juse a regular expression?
We scraped some lists of emojis which get the emojis directly for replacement. This approach is more computationally expensive than filtering out via RegEx, but it is often more precise as it targets the emojis directly.
We don't need to use limpiar_recode_emojis for every analysis, as many ParseR & SegmentR functions ignore them implicitly. However, if we know that we need to replace them with their text descriptios, we can can use limpiar__recode_emojis(). One problem with this, and the reason it is for special cases only, is the emoji's encodings are in English. We may, at some point, translate them to Spanish, but it seems unlikely.
data %>% limpiar_recode_emojis(text_var = mention_content, with_emoji_tag = FALSE)
Or if we set with_emoji_tag
to TRUE, our emojis are now pasted together with '_' and have an '_emoji' label.
data %>% limpiar_recode_emojis(mention_content, with_emoji_tag = TRUE)
Warning the
limpiar_recode_emojis()
function is quite slow and scales poorly with the size of inputs. So if using on a large dataset with many long documents expect functions to take a while to run.
What about in situations where we don't want to replace emojis with their text inputs, or we don't mind risking the loss of some other non-ASCII characters, or we want something that runs fast?
Instead of limpiar_recode_emojis
we can use limpiar_remove_emojis
! This function operates with a fairly simple RegEx pattern, meaning it runs a lot more efficiently than its recode counterpart.
data %>% limpiar_remove_emojis(mention_content)
ASCII (American Standard Code for Information Interchange) is a character-encoding standard for representing numbers and text. There are 128 ASCII characters, including the letters from a-z in upper and lowercase, numbers 0-9, common punctuation marks, and a few additional characters with specific uses for computers. Everything else is non-ASCII.
All 128 ASCII characters
Dec Hex Char Description
000 00 NUL Null 001 01 SOH Start of Heading 002 02 STX Start of Text 003 03 ETX End of Text 004 04 EOT End of Transmission 005 05 ENQ Enquiry 006 06 ACK Acknowledge 007 07 BEL Bell 008 08 BS Backspace 009 09 HT Horizontal Tab 010 0A LF Line Feed 011 0B VT Vertical Tab 012 0C FF Form Feed 013 0D CR Carriage Return 014 0E SO Shift Out 015 0F SI Shift In 016 10 DLE Data Link Escape 017 11 DC1 Device Control 1 (XON) 018 12 DC2 Device Control 2 019 13 DC3 Device Control 3 (XOFF) 020 14 DC4 Device Control 4 021 15 NAK Negative Acknowledge 022 16 SYN Synchronous Idle 023 17 ETB End of Transmission Block 024 18 CAN Cancel 025 19 EM End of Medium 026 1A SUB Substitute 027 1B ESC Escape 028 1C FS File Separator 029 1D GS Group Separator 030 1E RS Record Separator 031 1F US Unit Separator 032 20 SPACE Space 033 21 ! Exclamation Mark 034 22 " Double Quote 035 23 # Number Sign 036 24 $ Dollar Sign 037 25 % Percent 038 26 & Ampersand 039 27 ' Single Quote 040 28 ( Left Parenthesis 041 29 ) Right Parenthesis 042 2A * Asterisk 043 2B + Plus 044 2C , Comma 045 2D - Hyphen 046 2E . Period 047 2F / Forward Slash 048 30 0 Zero 049 31 1 One 050 32 2 Two 051 33 3 Three 052 34 4 Four 053 35 5 Five 054 36 6 Six 055 37 7 Seven 056 38 8 Eight 057 39 9 Nine 058 3A : Colon 059 3B ; Semicolon 060 3C < Less Than 061 3D = Equals 062 3E > Greater Than 063 3F ? Question Mark 064 40 @ At Sign 065 41 A Uppercase A 066 42 B Uppercase B 067 43 C Uppercase C 068 44 D Uppercase D 069 45 E Uppercase E 070 46 F Uppercase F 071 47 G Uppercase G 072 48 H Uppercase H 073 49 I Uppercase I 074 4A J Uppercase J 075 4B K Uppercase K 076 4C L Uppercase L 077 4D M Uppercase M 078 4E N Uppercase N 079 4F O Uppercase O 080 50 P Uppercase P 081 51 Q Uppercase Q 082 52 R Uppercase R 083 53 S Uppercase S 084 54 T Uppercase T 085 55 U Uppercase U 086 56 V Uppercase V 087 57 W Uppercase W 088 58 X Uppercase X 089 59 Y Uppercase Y 090 5A Z Uppercase Z 091 5B [ Left Bracket 092 5C \ Backslash 093 5D ] Right Bracket 094 5E ^ Caret 095 5F _ Underscore 096 60 ` Backtick 097 61 a Lowercase a 098 62 b Lowercase b 099 63 c Lowercase c 100 64 d Lowercase d 101 65 e Lowercase e 102 66 f Lowercase f 103 67 g Lowercase g 104 68 h Lowercase h 105 69 i Lowercase i 106 6A j Lowercase j 107 6B k Lowercase k 108 6C l Lowercase l 109 6D m Lowercase m 110 6E n Lowercase n 111 6F o Lowercase o 112 70 p Lowercase p 113 71 q Lowercase q 114 72 r Lowercase r 115 73 s Lowercase s 116 74 t Lowercase t 117 75 u Lowercase u 118 76 v Lowercase v 119 77 w Lowercase w 120 78 x Lowercase x 121 79 y Lowercase y 122 7A z Lowercase z 123 7B { Left Brace 124 7C | Vertical Bar 125 7D } Right Brace 126 7E ~ Tilde 127 7F DEL Delete
For our purposes we have extended the ASCII characters to include Latin accents (茅, 铆, etc.), let's get our original data frame back to demonstrate. We should remove things like emojis, but keep our punctuation and accented characters:
data <- data.frame( `Mention Url` = c("www.twitter.com/post1", "www.twitter.com/post2", "www.facebook.com/post1", "www.facebook.com/post2", "www.twitter.com/post3", "www.youtube.com/post1", "www.instagram.com/post1", "www.youtube.com/post2", "www.youtube.com/post3", "www.youtube.com/post4"), `Mention Content` = c( "holaaaaaa! c贸贸mo est谩s @magdalena ?!", " han visto este articulo!? Que horror! https://guardian.com/emojisbanned NO SE PUEDE!!", "ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor", "jajajajaja eres un wn!", "RT dale un click a ver una mujer baila con su perro", "grax ntonces q?", "yo soy el mejor 馃槀馃槀馃槀, no eres nada!! 馃ぃ馃ぃ ", "grax ntonces q?", "grax ntonces q?", "grax ntonces q?"))%>% tibble()%>% relocate(Mention.Content) %>% janitor::clean_names() data
data %>% limpiar_non_ascii(mention_content)
Similar to non-ASCII characters, we can retain only the alphanumeric characters (a-zA-Z0-9 + spaces). This is a heavy-duty option which will remove all accented characters.
data %>% limpiar_alphanumeric(mention_content)
If we want to use limpiar_alphanumeric
and retain our accented characters, then we should recode the accents first with limpiar_accents
:
data %>% limpiar_accents(mention_content) %>% limpiar_alphanumeric(mention_content)
You'll need to make an informed choice between
limpiar_alphanumeric
,limpiar_non_ascii
and other functions likelimpiar_accents
andlimpiar_*_emojis
Stop words are common words that do not provide us with much information as to an utterance's meaning. For example, in the sentence: 'the man is in prison for theft', if we knew only one word from this sentence, and that word was 'is', 'in', 'the', or 'for' then we wouldn't have much idea what the sentence is about. However, 'prison' or 'theft', would give us a lot more information.
For many analyses, we remove stop words to help us see the 'highest information' words, to get a high-level understanding of large bodies of texts (such as in topic modelling and bigram networks.) For virtually all scenarios, you will want to use the limpiar_stopwords() with the argument stop_words = "topics" like so:
data %>% limpiar_stopwords(stop_words = "topics") %>% limpiar_spaces() #to clear the spaces of words that were removed
However, sometimes we want to keep words that would usually be treated as stopwords for a specific purpose. For example, when we're analysing sentiment 'negatives' can invert the sentiment of a text - 'no me gusta' vs 'me gusta'. If we remove all instances of 'no' from our data, we will do a worse job at analysing sentiment. For Spanish we have a slightly shorter list of stopwords for sentiment than topics, where we have removed a few choice terms.
data %>% limpiar_stopwords(stop_words = "sentiment") %>% limpiar_spaces()
Warning - sentences can look quite strange without stopwords, and a lot of social posts are virtually meaningless altogether!
It's also worth pointing out, that a lot of information can be lost when removing stop words. Many phrases in English and Spanish have very different meanings when a stop word is removed, and some stopwords lists contain negatives, which can drastically change the meaning of a sentence! So use with care.
We are nearly at the end of this introduction to LimpiaR, but before we finish, let's look at two utility functions which may be useful. We've conjoured up a new data frame called df, which we will use to show the last two functions and how to chain everything together.
df <- data.frame( `Mention Url` = c("www.twitter.com/post1", "www.twitter.com/post2", "www.facebook.com/post1", "www.facebook.com/post2", "www.twitter.com/post3", "www.youtube.com/post1", "www.instagram.com/post1", "www.youtube.com/post2", "www.youtube.com/post3", "www.youtube.com/post4"), `Mention Content` = c( "holaaaaaa! c贸贸mo est谩s @magdalena ?!", " han visto este articulo!? Que horror! https://guardian.com/emojisbanned NO SE PUEDE!!", "ayyyyyy a mi me gustaria ir a londres yaaa #llevame #porfavor", "jajajajaja eres un wn!", "RT dale un click a ver una mujer baila con su perro", "grax ntonces q?", "yo soy el mejor 馃槀馃槀馃槀, no eres nada!! 馃ぃ馃ぃ ", "grax ntonces q?", "grax ntonces q?", "grax ntonces q?"), na_col = c(NA, NA, NA, NA, NA, NA, NA, NA, "tadaa", NA))%>% tibble()%>% relocate(Mention.Content)%>% janitor::clean_names()
df
So, imagine that we see a strange pattern, and we want to check what's going on with that specific pattern. We can use limpiar_inspect to view all posts which contain that pattern in an interactive frame!
limpiar_inspect(df, pattern = "ntonces", text_var = mention_content, url_var = mention_url, title = "ntonces")
limpiar_inspect(df, pattern = "ntonces", text_var = mention_content, url_var = mention_url, title = "ntonces") %>% knitr::kable()
Whilst it's pretty obvious that all of the 'grax ntonces q?' posts are exactly the same, in the real world we're going to have 10,000 times as many posts, and searching for suspicious patterns may take up a lot of our time.
This final function is useful when we want to remove 'mostly NA' columns of a data frame. We may want to do this to save memory, for example if we have 400,000 posts and 80 columns. In this case we'll get rid of all columns for which 25% or more of their values are NA.
limpiar_na_cols(df,threshold = 0.25)
To speed things up, we could call the functions together in one big long pipe.
df %>% limpiar_na_cols(threshold = 0.25)%>% limpiar_accents()%>% limpiar_retweets()%>% limpiar_shorthands()%>% limpiar_repeat_chars()%>% limpiar_url()%>% limpiar_remove_emojis()%>% limpiar_shorthands()%>% limpiar_spaces()%>% limpiar_duplicates()
However, we generally want to check what the effects of our transformations are on our data, so doing a lot of operations like this without any intermediate checks is risky.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.