knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = F,
  warning = F
)

options(scipen = 999)


require(dplyr)
require(ggplot2)
require(lubridate)
require(stringr)
require(janitor)
require(forcats)
require(kableExtra)

First, let's load tiktokr and some tidyverse libraries.

library(tiktokr)
library(dplyr)
library(ggplot2)
library(lubridate)
library(stringr)

Make sure to use your preferred Python installation

library(reticulate)

use_python(py_config()$python)

The next two steps you only need to do once:

  1. Install necessary Python libraries
tk_install()
  1. Authentication

In November 2020, Tiktok increased its security protocol. They now frequently show a captcha, which is easily triggered after a few requests. This can be solved by specifying the cookie parameter. To get a cookie session:

  1. Open a browser and go to "http://tiktok.com"
  2. Scroll down a bit, to ensure, that you don't get any captcha
  3. Open the javascript console (in Chrome: View > Developer > Javascript Console)
  4. Run document.cookie in the console. Copy the entire output (your cookie).
  5. Run tk_auth() in R and paste the cookie.

Click on image below for screen recording of how to get your TikTok cookie:

The tk_auth function will save cookies (and user agent) as environment variable to your .Renviron file. You need to only run this once to use the tiktokr or whenever you want to update your cookie/user agent.

tk_auth(cookie = "<paste here the output from document.cookie>")

Getting #statstiktok posts

Once per script you need to run tk_init to initialize tiktokr

tk_init()

Let's now get data on #statstiktok with tk_posts!

stats_tiktok <- tk_posts(scope = "hashtag", query = "statstiktok", n = 2000)
load('../data/stats_tiktok.rda')

Great! Now we have a dataset with metadata on tiktoks mentioning the hashtag 'statstiktok'!

glimpse(stats_tiktok)

First, the data needs to be cleaned. Many variables that TikTok returns are not relevant, so we focus on the most important ones and clean the variable names using janitor::clean_names().

stats_tk <- stats_tiktok %>% 
  select(id, createTime, 
         author_id:author_nickname, 
         author_signature, author_avatarLarger, 
         desc, music_id:authorStats_heart) %>% 
  janitor::clean_names() %>% 
  distinct(id, .keep_all = T)

glimpse(stats_tk)

Now we have a cleaned sample of #statstiktok!

Stats about tiktokers

Let`s first find out who the top posters are in the data. Chelsea Parlett-Pelleriti is the most prolific poster, when it comes to #statstiktok. This actually makes sense, considering that she is one of the pioneers of #statstiktok.

stats_tk %>% 
  count(author_unique_id, sort = T) %>% 
  filter(n >= 2) %>% 
  mutate(authr_url = paste0("https://www.tiktok.com/@", author_unique_id)) %>% 
  knitr::kable() %>%
  kableExtra::kable_styling()

Metadata includes author data, such as the unique handle of the author (author_unique_id), the text of their bio (author_signature) and some author stats (like the follower count: author_stats_follower_count). In addition to classical social media metrics (number of accounts followed, posts, likes), Tiktok also includes 'diggs'. Unfortunately, there is no documentation on how this metric is computed. The author_stats_heart shows how many likes a user has received in total.

tiktokers <- stats_tk %>% 
  select(contains("author")) %>% 
  add_count(author_unique_id, name = "vids_in_sample") %>% 
  distinct(author_id, .keep_all = T)

glimpse(tiktokers)

Any famous tiktoker using #statstiktok?

tiktokers %>% 
  select(author_unique_id, contains("count")) %>% 
  arrange(desc(author_stats_following_count)) %>% 
  rename_all(stringr::str_remove_all, "(author_stats_)") %>%
  arrange(desc(follower_count)) %>%
  slice(1:10) %>%
  mutate(authr_url = paste0("https://www.tiktok.com/@", author_unique_id)) %>% 
  knitr::kable()  %>%
  kableExtra::kable_styling()

Some of the tiktokers who used #statstiktok at least once seem to be pretty successful with more than 13700 followers. Maybe their one post post mentioning #statstiktok could have been be responsible for their success (...but most likely not :P). You can also see the strangeness of the digg count. It seems somehow uncorrelated to the other metrics.

Posts over time

When looking at frequency of posts over time, we see a continuous increase in the number of tiktoks mentioning #statstiktok since April 2020 (coinciding with the start of lockdowns in many countries!).

The function ´from_unix´ converts the timestamp in ´create_time´ to actual datetime.

stats_tk %>% 
  mutate(create_date = from_unix(create_time) %>% lubridate::floor_date("day")) %>% 
  count(create_date) %>% 
  mutate(cumsum_n = cumsum(n)) %>% 
  ggplot(aes(create_date, cumsum_n)) + 
  geom_line() +
  theme_minimal() +
  scale_x_datetime(date_labels = "%B %Y")  +
  labs(y = "Number of Posts", x = "")

Check out the most played tiktoks

stats_tk %>% 
  arrange(desc(stats_play_count)) %>% 
  select(id, author_unique_id, stats_play_count) %>% 
  slice(1:10) %>%
  mutate(video_url = paste0("https://www.tiktok.com/@", author_unique_id, "/video/", id)) %>% 
  knitr::kable()  %>%
  kableExtra::kable_styling()

Chelsea makes a lot of appearances again!

TikToks description

Each tiktok is usually accompanied with a brief text description. In this description, users typically use hashtags to increase the visibility of their posts.

Unsurprisingly, stats and academia related hashtags are used quite often combination with #statstiktok. We can now use these new hashtags to explore further stats tiktoks.

stats_tiktok %>%
  select(desc) %>%
  mutate(hashtags = stringr::str_extract_all(desc, "#\\w+")) %>%
  tidyr::unnest(hashtags) %>%
  mutate(hashtags = str_to_lower(hashtags)) %>% 
  count(hashtags, sort = T) %>% 
  slice(1:10) %>%
  mutate(hashtag_url = paste0("https://www.tiktok.com/tag/", str_remove(hashtags, "#"))) %>% 
  knitr::kable()  %>%
  kableExtra::kable_styling()

Expanding the data using hashtags

larger_stats_tiktok <- c("statsTikTok", "statistics", "rstats", "datascience") %>%
  purrr::map_dfr(~tk_posts("hashtag", .x, n = 2000)) %>%
  bind_rows(stats_tiktok) %>% # bind_rows the #statstiktok data
  distinct(id, .keep_all = T)
load('../data/larger_stats_tiktok.rda')

Hashtags

Using the hashtags we discovered earlier, we can now get data for other hashtags (statsTikTok, statistics, rstats, datascience). We obtain metadata on 3474 tiktoks, which can be further analyzed or used for further expansion.

larger_stats_tk <- larger_stats_tiktok %>% 
  select(id, createTime, 
         author_id:author_nickname, 
         author_signature, author_avatarLarger, 
         desc, music_id:authorStats_heart) %>% 
  janitor::clean_names() %>% 
  distinct(id, .keep_all = T)

glimpse(larger_stats_tk)

Before closing up, we take a look at the popularity of the considered hashtags. To do so, we filter the 6 queried hashtags and look at the distributions of plays depending on the hashtag. We find out that - in our small and not-at-all random sample -, #statistics is associated with the most viewed, while #rstats is (for now) the least popular hashtag.

larger_stats_tk %>%
  select(desc, matches("^stats")) %>%
  mutate(hashtags = stringr::str_extract_all(desc, "#\\w+")) %>%
  tidyr::unnest(hashtags) %>%
  filter(hashtags %in% c("#statsTikTok", "#statstiktok", "#statistics", "#rstats", "#datascience")) %>%
  mutate(hashtags = forcats::fct_reorder(hashtags, stats_play_count)) %>%
  ggplot(aes(x = hashtags, y = stats_play_count)) + 
  geom_boxplot() + 
  ylim(c(0, 12e3)) +
  theme_minimal() +
  labs(x = "Hashtag", y = "Number of Plays")

Check out the most common music titles usedd

larger_stats_tk %>% 
  count(music_title, music_play_url, sort = T) %>% 
  filter(!str_detect(music_title, "original")) %>% 
  slice(1:10) %>%
  knitr::kable()  %>%
  kableExtra::kable_styling()

Looks like Classical Music is quite popular with stats tiktok!



benjaminguinaudeau/tiktokr documentation built on Jan. 17, 2021, 8:52 a.m.