ARGOS - YouTube

This R package provides data collection, cleaning, and visualization for YouTube data, accessed from the YouTube API.

Install the Package

install.packages(c("tuber","plyr","gtools","ggplot2","dplyr","httr","networkD3","scales"))
token <- "xxxxxx"
devtools::install_github("maelezo/ytcol", auth_token = token)

Authentication

The first step is to get authorization credentials for a Google App account, such as Gmail or YouTube, at https://console.developers.google.com/projectselector/apis/credentials. You need to enable the YouTube APIs and the Freebase API. When creating credentials for your project, select Oauth Client ID and application type "Other". For more details, see https://developers.google.com/youtube/v3/getting-started.

Then you need to authenticate using the OAuth 2.0 credentials (client ID and client secret) with the YouTube API using the yt.oauth function. The function looks for a \code{.httr-oauth} file in the working directory. If that file is not in the working directory, it will try to authenticate with the client ID and secret by launching a browser to allow you to authorize the application. This part of the process will not work from an AWS instance - you must have the correct file in the workind directory on AWS. If you are working locally, you should select YES when asked if you want to save the httr-oauth file for later reference and click "Allow" on the browser to authorize the connection.

client <- "xxxxxx"
secret <- "yyyy"
yt.oauth(client, secret)

Outputs

Most of the functions provide an output in the form of a dataframe that the user can then save to a CSV file if desired.

Functions

The first thing to understand about YouTube is that it is a collection of videos and channels. A channel is equivalent to a user or account, and generally includes multiple videos. Each video has statistics including views, likes, dislikes, favorites, and comments. Each channel has a channel ID and some have vanity names that replace the channel ID in the URL for that channel. Every video has a video ID.

yt.AllChannelVidoes()

This function is to collect a list of all the videos from a channel, including the stats for each video. This function allows you to set a date range for the collection using the "published_before" and "published_after" parameters. Currently this function will only accept the channel ID, not channel vanity names. This example collects all the videos from The Late Show with Steven Colbert channel during the first week of March 2017.

colbert <- yt.AllChannelVideos(channel_id = "UCMtFAi84ehTSYSE9XoHefig",
                               published_after = "2017-03-01T00:00:00Z",
                               published_before = "2017-03-07T00:00:00Z")

The results include 56 videos and the following information about each video: video ID, date-time posted, channel ID, title, description, channel title, date collected, view count, like count, dislike count, favorite count, comment count, tags, and language.

yt.Search()

This function collect a list of videos related to a search term, regardless of channel, and can be limited by a date range. The user can also set the number of results that is returned, from 1 to 50, with the default being 50. The result is a dataframe with the following information about each video: video ID, date-time posted, channel ID, title, description, and channel title. The output is an dataframe in the RStudio environment. This example collects videos related to "golf" in the first week of March 2017.

golf <- yt.Search(term="golf", 
                  published_after = "2017-03-01T00:00:00Z",
                  published_before = "2017-03-07T00:00:00Z")

yt.SingleVideoInfo()

This function will provide statistics for a video. The results include video ID, view count, like count, dislike count, favorite count, comment count, date collected, date-time posted, channel ID, title, description, tags, and category ID. This example collects information on a video featuring Bubba the Bichon Frise. The output is an dataframe in the RStudio environment.

bubba <- yt.SingleVideoInfo(video_id = "tgchTz8XjrI")

yt.Related()

This function collects a list of videos that are related to another video. The output is an dataframe in the RStudio environment.

bubba_rel <- yt.Related(video_id = "tgchTz8XjrI")

yt.GetChannelID()

This function converts the vanity name of a channel to the actual channel ID, which is needed to collect data from a channel. If all you have is a vanity name of a channel, run this function first to get the channel ID, and then use that as the required parameter for the collection functions. The output is an object in the RStudio environment.

channel_id <- yt.GetChannelID(term = "TheBushCenter")

yt.VideoComments()

This function collects all the comments on a video on YouTube, including replies to comments. If a comment is a reply, information about the parent comment is provided including parent comment ID, parent author display name, and parent author channel ID. The function requires a video ID. The output is a dataframe in the RStudio environment.

bubba_comments <- yt.VideoComments(video_id = "tgchTz8XjrI")

yt.ChannelComments()

This function collects all the comments, including replies, on all the videos on a channel on YouTube. This function allows you to set a date range for the collection using the "published_before" and "published_after" parameters. Currently this function will only accept the channel ID, not channel vanity names. The output is an dataframe in the RStudio environment.

colbert_channel_comments <- yt.ChannelComments(channel_id = "UCMtFAi84ehTSYSE9XoHefig",
                                                    published_after = "2017-03-01T00:00:00Z",
                                                    published_before = "2017-03-02T00:00:00Z")

yt.RelatedVideoComments()

This function collects all the comments, including replies, on a video and the videos that result from the yt.Related() function. The output is an dataframe in the RStudio environment.

bubba_rel_comments <- yt.RelatedVideoComments(video_id = "tgchTz8XjrI")

yt.Network()

This function generates a network diagram using a dataframe of that includes unique comment IDs, comment authors, and video IDs. The dataframe must have columns named "comment_ID", "author_display_name", and "video_ID". The dataframes created with the yt.ChannelComments() and yt.RelatedVideoComments() functions work well in this function. The nodes are videos and comment authors. The links represent the fact that an author has commented at least once on the video. The minVideos parameter allows the user to reduce the size of the network by filtering for authors who commented on more than one video by setting that parameter to a number greater than 1. The output is a network diagram in the RStudio environment.

yt.Network(relatedComments = bubba_rel_comments, minVideos = 7)

yt.CommentsOverTime()

This function generates two charts to visualize the distribution of comments on a video over time using the dataframe produced by the yt.VideoComments() function. The videoComments parameter identifies the dataframe to use for the chart. The note parameter allows the user to make an annotation in the upper left corner of each chart. The breakBy parameter sets the breaks in the date sequence on the x-axis and can take the values day, week, month, quarter, or year. The first chart, called "Comments Over Time", shows the number of comments per day, and the second chart, called "Cumulative Comments Over Time", shows the sum of the number of comments on a video as time progresses.

yt.CommentsOverTime(videoComments = bubba_comments, note = "video ID: tgchTz8XjrI", breakBy = "quarter")

yt.Wordcloud()

This function generates a wordcloud from comments collected using one of the other functions in ytcol. The input is a dataframe that must have a column named "text_original" that contains that comments. The comments parameter is the dataframe in the environment that contains the comments. The max-words paramenter sets the number of words to display in the wordcloud. The remove parameter allows the users to eliminate some words for the wordcloud. The stopwords parameter also eliminates common words, such as me or you, based on the language of the comments. The default language is English.

yt.Wordcloud(comments = bubba_rel_comments, max_words = 50, remove = c('hes','thats','can','cant'))



Vintonm49/ytcol documentation built on May 27, 2019, 7:42 a.m.