knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
YouTube content frequently contains emojis, special Unicode characters, and text in various languages. The tuber package provides built-in functions for detecting, extracting, and manipulating emojis without external dependencies.
library(tuber) # Get comments from a video comments <- get_all_comments(video_id = "your_video_id") # Check which comments contain emojis comments$has_emoji <- has_emoji(comments$textDisplay) # Count emojis per comment comments$emoji_count <- count_emojis(comments$textDisplay) # Filter to emoji-rich comments emoji_comments <- comments[comments$emoji_count > 0, ]
The package provides five main functions for working with emojis:
has_emoji() - Check for emoji presencehas_emoji("Hello world") # FALSE has_emoji("Great video! \U0001F44D") # TRUE has_emoji(c("No emoji", "Has emoji \U0001F600", "Also none")) # c(FALSE, TRUE, FALSE)
count_emojis() - Count emojis in textcount_emojis("Hello world") # 0 count_emojis("Rating: \U0001F600\U0001F600\U0001F600") # 3 count_emojis(c("None", "\U0001F44D", "\U0001F600\U0001F601")) # c(0, 1, 2)
extract_emojis() - Get emojis from textextract_emojis("Hello \U0001F44B World \U0001F30D!") # list(c("\U0001F44B", "\U0001F30D")) extract_emojis(c("No emoji", "\U0001F600\U0001F601")) # list(character(0), c("\U0001F600", "\U0001F601"))
remove_emojis() - Strip emojis from textremove_emojis("Hello \U0001F44B World!") # "Hello World!" remove_emojis(c("No emoji", "Has \U0001F600 emoji")) # c("No emoji", "Has emoji")
replace_emojis() - Substitute emojisreplace_emojis("Hello \U0001F44B World!", replacement = "[emoji]") # "Hello [emoji] World!" replace_emojis("Rate: \U0001F600\U0001F600\U0001F600", replacement = "*") # "Rate: ***"
comments <- get_all_comments(video_id = "your_video_id") comments$emoji_count <- count_emojis(comments$textDisplay) # Top 10 most emoji-heavy comments top_emoji <- comments[order(-comments$emoji_count), ][1:10, ]
# Remove emojis for text analysis comments$clean_text <- remove_emojis(comments$textDisplay) # Now use clean_text for sentiment analysis or word clouds
# Extract all emojis from comments all_emojis <- unlist(extract_emojis(comments$textDisplay)) # Count frequency emoji_freq <- table(all_emojis) sort(emoji_freq, decreasing = TRUE)[1:10]
Beyond emojis, tuber handles Unicode text consistently:
safe_utf8() - Ensure UTF-8 encodingproblematic_text <- c("caf\xe9", "na\xefve") safe_text <- safe_utf8(problematic_text)
clean_youtube_text() - Clean HTML and normalize textraw_text <- "Great video! <3 & more..." clean_text <- clean_youtube_text(raw_text) # "Great video! <3 & more..."
Your R environment may not support UTF-8 display. The data is still correct; only the display is affected. Try:
# Check locale Sys.getlocale("LC_CTYPE") # Set UTF-8 locale on macOS/Linux Sys.setlocale("LC_CTYPE", "en_US.UTF-8")
Compound emojis (like family emojis or skin tone modifiers) may be counted as multiple characters. This is due to how Unicode encodes these as sequences of code points.
The emoji pattern covers most common Unicode emoji blocks. Very new emojis added in recent Unicode versions may not be detected until the pattern is updated.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.