First we load the package and the data
library(socnet) library(magrittr) data("subjects")
The following function creates a ranking and prints it as a nice table. We'll do this a couple of times so it makes sense to write it as a function:
rankfun <- function(x, colnames, maxn = 50) { x <- as.data.frame(table(x)) x <- x[order(-x$Freq),] dimnames(x) <- list(1:nrow(x), colnames) knitr::kable(x[1:maxn,], row.names = TRUE) }
Now the actual ranking
# Getting the from column and removing weird characters from <- subjects$from from <- iconv(from, to="ASCII//TRANSLIT") # Removing <[log in to unmask]> message from <- gsub("[<].+", "", from) # Creating the table rankfun(from, colnames=c("User", "Count"))
This is a bit meaningless, but it is just a way to show how the package can be used
# Removing symbols subject <- subjects$subject subject <- stringr::str_replace_all(subject, "[:punct:]", "") # Splitting words subject <- stringr::str_split(subject, pattern = "\\s+") # Removing stopwords subject <- lapply(subject, function(x) x[!(x %in% tm::stopwords())]) subject <- lapply(subject, tolower) rankfun(unlist(subject, TRUE), colnames = c("Word", "Count"))
We'll be using a bit of regular expressions here. The idea is that, instead of capturing plain words, we can actually try to retrieve concepts that are associated with, in this case, data managemnt, which is what motivated this R pacakge in the first place. Let's start by defining some concepts
regex <- "diffusion|contagion" dat <- iconv(subjects$subject, to = "ASCII//TRANSLIT") %>% tolower (ids <- stringr::str_which(dat, regex)) head(dat[ids])
Downloading the data as listed in the subjects
data frame.
dat <- subjects[1:50,"url"] pages <- socnet_parse_subject(dat)
This will return a list of length 50, each one of this with the retrieved contents from SOCNET. We can analyze now their contents
# Extracting contents dat <- lapply(pages, "[[", "contents") # First three lines are about SOCNET dat <- lapply(dat, "[", i=-c(1:4))
Now we put all together, remove stopwords and later look at the most popular words
dat <- dat %>% unlist(TRUE) %>% stringr::str_replace_all(pattern="https?[:graph:]+", replacement = "") %>% stringr::str_replace_all(pattern="[:punct:]+", replacement = "") %>% stringr::str_split("\\s+") %>% unlist(TRUE) %>% tolower %>% stringr::str_subset("^[a-z0-9]+$") dat <- dat[!(dat %in% tm::stopwords())]
rankfun(dat, colnames = c("Word", "Count"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.