Datasets creation

To create the data we have into the package, we downloaded some from Kaggle Anime dataset. You will have to do it too, they are not in the GitHub since their guidelines ask to avoid, if possible to have repositories larger than 1Gb. Since we need to download both animeand animelist and that the second one is 1.9 Gb, we decided to not let them on the GitHub but to simply guide you to be able to reproduce our work.

Anime

So, to create our anime data, we will first load the needed packages : readr, dplyr and tidyverse. After that we load the data into R and remove some useless columns that we will not use.

#import dataset

library(readr)
library(dplyr)
library(tidyverse)
anime <- read_csv(
  here::here("~/Downloads/anime.csv"),
  col_types = cols(
    `Japanese name` = col_skip(),
    `Score-10` = col_skip(),
    `Score-9` = col_skip(),
    `Score-8` = col_skip(),
    `Score-7` = col_skip(),
    `Score-6` = col_skip(),
    `Score-5` = col_skip(),
    `Score-4` = col_skip(),
    `Score-3` = col_skip(),
    `Score-2` = col_skip(),
    `Score-1` = col_skip()
  )
)

Once this is done, we rename the column English name so we avoid troubles while using it. We then chose to use the English name as the basic name when available and keep the original one when there is no english name. We then add "ID" in front of the anime id's to avoid problems while creating matrices. Finally we filter for Unknown values and for some Seasons, since some anime appears many times, once for the complete version and then one time for each season which we do not need. We finally arrange by popularity since it represented a good indicator for how an anime is liked at he time the table was created.

#cleaning data: no NA nor Unknown and order by ranking
anime <- anime %>%
  rename(english_name = 'English name')%>%
  mutate(Name = case_when(english_name == 'Unknown' ~ Name,
                          TRUE ~ english_name))%>%
  rename(item_id = MAL_ID)%>%
  mutate(item_id = sprintf("ID%s", item_id))%>%
  filter(Score != 'Unknown') %>%
  filter(Ranked != 'Unknown') %>%
  filter(Rating != 'Unknown') %>%
  filter(Duration > 10) %>%
  filter(Producers != 'Unknown' |
           Studios != 'Unknown' | Licensors != 'Unknown') %>%
  filter(grepl("Season", Name) == FALSE) %>%
  filter(!is.na(Duration))%>%
  arrange(Popularity)

# separate genders
#anime <- anime %>% separate(Genders, c("gender1", "gender2", "gender3", "gender4","gender5", "gender6"), sep=",")

We then created some categories for duration that we are not really using anymore, but since they are present in the data provided we show you how it was created.

#Convert 2hr into 2hr.00min
anime <- anime %>%
  mutate(
    Duration = ifelse(Duration == "2 hr.", "2 hr. 00 min",
                      ifelse(Duration == "1 hr.", "1 hr. 00 min",
                             paste0(Duration))
  ))

#Convert time in min to have numerical feature
library(stringr)
anime$Duration  <-
  sapply(str_extract_all(anime$Duration , "\\d+"), function(x) {
    x1 <- as.numeric(x)
    if (length(x1) > 1)
      x1[1] * 60 + x1[2]
    else
      x1[1]
  })

anime$Duration <- as.integer(anime$Duration)

#mutate a new column to categorize duration
anime <-
  anime %>% mutate(Duration_C = ifelse(
    Duration < 30,
    "Less than 30min",
    ifelse(
      Duration >= 30 &
        Duration < 60,
      "Less than 1h",
      ifelse(Duration >= 60 &
               Duration < 120, "more than 1h", "more than 2h")
    )
  )) %>%
  relocate(Duration_C, .after = Duration)%>%
  filter(!is.na(Duration))

anime$Duration_C <-
  factor(
    anime$Duration_C,
    levels = c("Less than 30min",
               "Less than 1h",
               "more than 1h",
               "more than 2h")
  )

Finally we save the data in csv, that is not provided, and into a format that can be exported in a package easily.

write.csv(anime, here::here("data/anime_tidy.RData"))

usethis::use_data(anime, overwrite = TRUE)

Anime with ratings

First, we load the dataset animelist.csv still from Kaggle.

#import dataset

#This was kept locally and not added to the package since the document contains 107 milion lines ans is after this step not used anymore into the application
anime_list <- read.csv(here::here("~/Downloads/animelist.csv"))

Then we groups by user ID to filter for the largest scorer in order to filter as many lines as possible. This helps us reducing from a 100 million rows to 573'588 rows which then makes it usable.

#getting list of people with more than 1500 ratings
ratings_per_user <- anime_list %>%
  group_by(user_id) %>% 
  tally()%>%
  filter(n > 4000)

ggplot(ratings_per_user)+
  geom_bar(aes(n))


users_kept <- ratings_per_user$user_id

Finally, we filter the table by those users we keep, we then join it to the anime table to be able to recover some data about them. To avoid having too large data we take only what interests us. Finally as like as before we save the data in both csv and the exportable format for packages.

#using the list to create a table containing the ratings and the anime infos
ratings_kept <- anime_list %>%
  filter(user_id %in% users_kept & rating > 0)%>%
  rename(item_id = anime_id)%>%
  mutate(item_id = sprintf("ID%s", item_id))

anime_with_ratings <- left_join(anime, ratings_kept, by= "item_id") %>%
  select(item_id, user_id, Name, english_name, Episodes, Duration, rating)

write.csv(anime_with_ratings, here::here("inst/extdata/anime_with_ratings.csv"))

usethis::use_data(anime_with_ratings, overwrite = TRUE)

Then we also checked that there are no anime without ratings that was kept in a way or another which does not seem to be the case here. Therefore the data is usable for the rest that you will find in the vignette!

#check for null ratings since we want to avoid them
anime_with_ratings_null <- anime_with_ratings %>%
  filter(is.na(rating))

anime_without_ratings_vc <- anime_with_ratings_null$item_id


ptds2021/project--G5 documentation built on Dec. 22, 2021, 9:59 a.m.