In USCCANA/socnet: Web Scraping The Social Networks (SOCNET) Listserv

Most active users

First we load the package and the data

library(socnet)
library(magrittr)
data("subjects")

The following function creates a ranking and prints it as a nice table. We'll do this a couple of times so it makes sense to write it as a function:

rankfun <- function(x, colnames, maxn = 50) {
  x <- as.data.frame(table(x))
  x <- x[order(-x$Freq),]
  dimnames(x) <- list(1:nrow(x), colnames)
  knitr::kable(x[1:maxn,], row.names = TRUE)  
}

Now the actual ranking

# Getting the from column and removing weird characters
from <- subjects$from
from <- iconv(from, to="ASCII//TRANSLIT")

# Removing <[log in to unmask]> message
from <- gsub("[<].+", "", from)

# Creating the table
rankfun(from, colnames=c("User", "Count"))

Most active words of all time

This is a bit meaningless, but it is just a way to show how the package can be used

# Removing symbols
subject <- subjects$subject
subject <- stringr::str_replace_all(subject, "[:punct:]", "")

# Splitting words
subject <- stringr::str_split(subject, pattern = "\\s+")

# Removing stopwords
subject <- lapply(subject, function(x) x[!(x %in% tm::stopwords())])
subject <- lapply(subject, tolower)

rankfun(unlist(subject, TRUE), colnames = c("Word", "Count"))

A more refined example: Data management

We'll be using a bit of regular expressions here. The idea is that, instead of capturing plain words, we can actually try to retrieve concepts that are associated with, in this case, data managemnt, which is what motivated this R pacakge in the first place. Let's start by defining some concepts

regex <- "diffusion|contagion"

dat <- iconv(subjects$subject, to = "ASCII//TRANSLIT") %>% tolower

(ids <- stringr::str_which(dat, regex))
head(dat[ids])

Latest 50 emails

Downloading the data as listed in the subjects data frame.

dat <- subjects[1:50,"url"]
pages <- socnet_parse_subject(dat)

This will return a list of length 50, each one of this with the retrieved contents from SOCNET. We can analyze now their contents

# Extracting contents
dat <- lapply(pages, "[[", "contents")

# First three lines are about SOCNET
dat <- lapply(dat, "[", i=-c(1:4))

Now we put all together, remove stopwords and later look at the most popular words

dat <- dat %>%
  unlist(TRUE) %>%
  stringr::str_replace_all(pattern="https?[:graph:]+", replacement = "") %>%
  stringr::str_replace_all(pattern="[:punct:]+", replacement = "") %>%
  stringr::str_split("\\s+") %>%
  unlist(TRUE) %>%
  tolower %>%
  stringr::str_subset("^[a-z0-9]+$")
dat <- dat[!(dat %in% tm::stopwords())]

rankfun(dat, colnames = c("Word", "Count"))

USCCANA/socnet documentation built on Aug. 17, 2022, 8:42 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

USCCANA/socnet
Web Scraping The Social Networks (SOCNET) Listserv

In USCCANA/socnet: Web Scraping The Social Networks (SOCNET) Listserv

Most active users

Most active words of all time

A more refined example: Data management

Latest 50 emails

R Package Documentation

Browse R Packages

We want your feedback!

USCCANA/socnet Web Scraping The Social Networks (SOCNET) Listserv

In USCCANA/socnet: Web Scraping The Social Networks (SOCNET) Listserv

Most active users

Most active words of all time

A more refined example: Data management

Latest 50 emails

R Package Documentation

Browse R Packages

We want your feedback!

USCCANA/socnet
Web Scraping The Social Networks (SOCNET) Listserv