basilica-r-client: Deep Feature Extraction for Images and Natural Language Text

In this example, we'll train a logistic regression to classify tweets using only the natural language text found in these tweets. We'll only need about 800 tweets per user account.

To run this example, you will need the following packages.

install.packages("dplyr", "ROCR")

We'll use a collection of about 800 tweets from Bill Gates and Kanye West and train a logistic regression to predict (given a tweet) which account the tweet belongs to. In order to do that, we'll first load the tweets from the basilica package.

library(jsonlite)
bill <- fromJSON(system.file("extdata/twitter/billgates.json", package="basilica"))
kanye <- fromJSON(system.file("extdata/twitter/kanyewest.json", package="basilica"))

Now that we've loaded the JSON files, we can embedded the text of these tweets using Basilica.

library(basilica)
conn <- connect("05e19f1c-39de-ed9c-ae42-feab42f5f84d")

embeddings <- rbind(embed_sentences(bill[, 7], conn=conn), embed_sentences(kanye[, 7], conn=conn)) # 7 is the index of the text

Now that we have these embeddings, we'll want to run PCA and get the 100 features that explain the most variance. We'll also add a column to the matrix with the corresponding category each tweet belongs to.

pca <- prcomp(t(embeddings), center = TRUE,scale = TRUE)
features <- pca$rotation[,1:100]

type <- c(integer(dim(bill)[1]) + 1, integer(dim(kanye)[1]))
features <- cbind(type, features)
features <- data.frame(features[sample.int(nrow(features)),])

Finally, we can now train our model. In order to do that we'll separate out the data into training and test data.

library(dplyr)
train_data <- sample_frac(features, 0.8)
train_index <- as.numeric(rownames(train_data))
test_data <- features[-train_index, ]

model <- glm(type ~ ., data = train_data, family = "binomial")

After training the model, we can verify who well it's trained by taking a look at the confusion matrix.

predict <- predict(model, newdata=test_data, type = 'response')
table(train_data$type, predict > 0.5)

library(ROCR)
ROCRpred <- prediction(predict, test_data$type)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

You have now trained a logistic regression with only the natural language text of the tweets and 800 data points per category and getting an R squared of about 0.80.

basilica-ai/basilica-r-client documentation built on June 6, 2019, 3:40 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

basilica-ai/basilica-r-client
Deep Feature Extraction for Images and Natural Language Text

docs/logistic-regression.markdown
In basilica-ai/basilica-r-client: Deep Feature Extraction for Images and Natural Language Text

Setup

Step 1: Embedding all tweets

Step 2: Running PCA + Cleaning Data

Step 3: Training the model

Step 4: Verifying Results

R Package Documentation

Browse R Packages

We want your feedback!

basilica-ai/basilica-r-client Deep Feature Extraction for Images and Natural Language Text

docs/logistic-regression.markdown In basilica-ai/basilica-r-client: Deep Feature Extraction for Images and Natural Language Text

Setup

Step 1: Embedding all tweets

Step 2: Running PCA + Cleaning Data

Step 3: Training the model

Step 4: Verifying Results

R Package Documentation

Browse R Packages

We want your feedback!

basilica-ai/basilica-r-client
Deep Feature Extraction for Images and Natural Language Text

docs/logistic-regression.markdown
In basilica-ai/basilica-r-client: Deep Feature Extraction for Images and Natural Language Text