To create the data we have into the package, we downloaded some from Kaggle Anime dataset. You will have to do it too, they are not in the GitHub since their guidelines ask to avoid, if possible to have repositories larger than 1Gb. Since we need to download both anime
and animelist
and that the second one is 1.9 Gb, we decided to not let them on the GitHub but to simply guide you to be able to reproduce our work.
So, to create our anime
data, we will first load the needed packages : readr
, dplyr
and tidyverse
. After that we load the data into R and remove some useless columns that we will not use.
#import dataset library(readr) library(dplyr) library(tidyverse) anime <- read_csv( here::here("~/Downloads/anime.csv"), col_types = cols( `Japanese name` = col_skip(), `Score-10` = col_skip(), `Score-9` = col_skip(), `Score-8` = col_skip(), `Score-7` = col_skip(), `Score-6` = col_skip(), `Score-5` = col_skip(), `Score-4` = col_skip(), `Score-3` = col_skip(), `Score-2` = col_skip(), `Score-1` = col_skip() ) )
Once this is done, we rename the column English name
so we avoid troubles while using it. We then chose to use the English name as the basic name when available and keep the original one when there is no english name. We then add "ID" in front of the anime id's to avoid problems while creating matrices. Finally we filter for Unknown values and for some Seasons, since some anime appears many times, once for the complete version and then one time for each season which we do not need. We finally arrange by popularity since it represented a good indicator for how an anime is liked at he time the table was created.
#cleaning data: no NA nor Unknown and order by ranking anime <- anime %>% rename(english_name = 'English name')%>% mutate(Name = case_when(english_name == 'Unknown' ~ Name, TRUE ~ english_name))%>% rename(item_id = MAL_ID)%>% mutate(item_id = sprintf("ID%s", item_id))%>% filter(Score != 'Unknown') %>% filter(Ranked != 'Unknown') %>% filter(Rating != 'Unknown') %>% filter(Duration > 10) %>% filter(Producers != 'Unknown' | Studios != 'Unknown' | Licensors != 'Unknown') %>% filter(grepl("Season", Name) == FALSE) %>% filter(!is.na(Duration))%>% arrange(Popularity) # separate genders #anime <- anime %>% separate(Genders, c("gender1", "gender2", "gender3", "gender4","gender5", "gender6"), sep=",")
We then created some categories for duration that we are not really using anymore, but since they are present in the data provided we show you how it was created.
#Convert 2hr into 2hr.00min anime <- anime %>% mutate( Duration = ifelse(Duration == "2 hr.", "2 hr. 00 min", ifelse(Duration == "1 hr.", "1 hr. 00 min", paste0(Duration)) )) #Convert time in min to have numerical feature library(stringr) anime$Duration <- sapply(str_extract_all(anime$Duration , "\\d+"), function(x) { x1 <- as.numeric(x) if (length(x1) > 1) x1[1] * 60 + x1[2] else x1[1] }) anime$Duration <- as.integer(anime$Duration) #mutate a new column to categorize duration anime <- anime %>% mutate(Duration_C = ifelse( Duration < 30, "Less than 30min", ifelse( Duration >= 30 & Duration < 60, "Less than 1h", ifelse(Duration >= 60 & Duration < 120, "more than 1h", "more than 2h") ) )) %>% relocate(Duration_C, .after = Duration)%>% filter(!is.na(Duration)) anime$Duration_C <- factor( anime$Duration_C, levels = c("Less than 30min", "Less than 1h", "more than 1h", "more than 2h") )
Finally we save the data in csv, that is not provided, and into a format that can be exported in a package easily.
write.csv(anime, here::here("data/anime_tidy.RData")) usethis::use_data(anime, overwrite = TRUE)
First, we load the dataset animelist.csv still from Kaggle.
#import dataset #This was kept locally and not added to the package since the document contains 107 milion lines ans is after this step not used anymore into the application anime_list <- read.csv(here::here("~/Downloads/animelist.csv"))
Then we groups by user ID to filter for the largest scorer in order to filter as many lines as possible. This helps us reducing from a 100 million rows to 573'588 rows which then makes it usable.
#getting list of people with more than 1500 ratings ratings_per_user <- anime_list %>% group_by(user_id) %>% tally()%>% filter(n > 4000) ggplot(ratings_per_user)+ geom_bar(aes(n)) users_kept <- ratings_per_user$user_id
Finally, we filter the table by those users we keep, we then join it to the anime table to be able to recover some data about them. To avoid having too large data we take only what interests us. Finally as like as before we save the data in both csv and the exportable format for packages.
#using the list to create a table containing the ratings and the anime infos ratings_kept <- anime_list %>% filter(user_id %in% users_kept & rating > 0)%>% rename(item_id = anime_id)%>% mutate(item_id = sprintf("ID%s", item_id)) anime_with_ratings <- left_join(anime, ratings_kept, by= "item_id") %>% select(item_id, user_id, Name, english_name, Episodes, Duration, rating) write.csv(anime_with_ratings, here::here("inst/extdata/anime_with_ratings.csv")) usethis::use_data(anime_with_ratings, overwrite = TRUE)
Then we also checked that there are no anime without ratings that was kept in a way or another which does not seem to be the case here. Therefore the data is usable for the rest that you will find in the vignette!
#check for null ratings since we want to avoid them anime_with_ratings_null <- anime_with_ratings %>% filter(is.na(rating)) anime_without_ratings_vc <- anime_with_ratings_null$item_id
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.