In this example, we'll train a logistic regression to classify tweets using only the natural language text found in these tweets. We'll only need about 800 tweets per user account.
To run this example, you will need the following packages.
install.packages("dplyr", "ROCR")
We'll use a collection of about 800 tweets from Bill Gates and Kanye West and train a logistic regression to predict (given a tweet) which account the tweet belongs to. In order to do that, we'll first load the tweets from the basilica package.
library(jsonlite)
bill <- fromJSON(system.file("extdata/twitter/billgates.json", package="basilica"))
kanye <- fromJSON(system.file("extdata/twitter/kanyewest.json", package="basilica"))
Now that we've loaded the JSON files, we can embedded the text of these tweets using Basilica.
library(basilica)
conn <- connect("05e19f1c-39de-ed9c-ae42-feab42f5f84d")
embeddings <- rbind(embed_sentences(bill[, 7], conn=conn), embed_sentences(kanye[, 7], conn=conn)) # 7 is the index of the text
Now that we have these embeddings, we'll want to run PCA and get the 100 features that explain the most variance. We'll also add a column to the matrix with the corresponding category each tweet belongs to.
pca <- prcomp(t(embeddings), center = TRUE,scale = TRUE)
features <- pca$rotation[,1:100]
type <- c(integer(dim(bill)[1]) + 1, integer(dim(kanye)[1]))
features <- cbind(type, features)
features <- data.frame(features[sample.int(nrow(features)),])
Finally, we can now train our model. In order to do that we'll separate out the data into training and test data.
library(dplyr)
train_data <- sample_frac(features, 0.8)
train_index <- as.numeric(rownames(train_data))
test_data <- features[-train_index, ]
model <- glm(type ~ ., data = train_data, family = "binomial")
After training the model, we can verify who well it's trained by taking a look at the confusion matrix.
predict <- predict(model, newdata=test_data, type = 'response')
table(train_data$type, predict > 0.5)
library(ROCR)
ROCRpred <- prediction(predict, test_data$type)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))
You have now trained a logistic regression with only the natural
language text of the tweets and 800 data points per category and getting
an R squared of about 0.80
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.