knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) options(width = 104)
A user-friendly factor-like interface for converting strings of text into numeric vectors and rectangular data structures.
You can install the development version from GitHub with:
## install {remotes} if not already if (!requireNamespace("remotes")) { install.packages("remotes") } ## install wactor for github remotes::install_github("mkearney/wactor")
Here's some basic text (e.g., natural language) data:
## load wactor package library(wactor) ## text data (sentences) x <- c( "This test is a test", "This one will be a test", "This this was a test", "And this is the fourth test", "Fifth: the test!", "And the sixth test", "This is the seventh test", "This test is going to be a test", "This one will have been a test", "This this has been a test" ) ## for demonstration purposes, store as a data frame as well data <- tibble::tibble( text = x, value = rnorm(length(x)), z = c(rep(TRUE, 7), rep(FALSE, 3)) )
split_test_train()
A convenience function for splitting an input object into test and train data frames. This is often useful for splitting a single data frame:
## split into test/train data sets split_test_train(data)
By default, split_test_train()
returns 80% of the input data in the train
data set and 20% of the input data in the test
data set. This proportion of
data used in the returned training data can be adjusted via the .p
argument:
## split into test/train data sets–with 70% of data in training set split_test_train(data, .p = 0.70)
When predicting categorical variables, it's often desirable to ensure the training data set has an even number of observations for each level in the response variable. This can be achieved by indicating the [column] name of the categorical response variable using tidy evaluation. This will prioritize evenly balanced observations over the specified proportion in training data:
## ensure evenly balanced groups in `train` data set split_test_train(data, .p = 0.70, z)
The split_test_train()
doesn't only work on data frames. It's also possible to
split atomic vectors (i.e., character, numeric, logical):
## OR split character vector into test/train data sets (d <- split_test_train(x))
wactor()
Use wactor()
to convert a character vector into a wactor
object. The code
below uses the previously split [into test/train] text data d
described above.
## create wactor w <- wactor(d$train$x)
dtm()
Get the document term frequency matrix
## term frequency–inverse document frequency dtm(w) ## same thing as dtm predict(w)
tfidf()
or term frequency–inverse document frequency matrix
## create tf-idf matrix tfidf(w)
Or apply the wactor on new data
## document term frequecy of new data dtm(w, d$test$x) ## same thing as dtm predict(w, d$test$x) ## term frequency–inverse document frequency of new data tfidf(w, d$test$x)
xgb_mat
The wactor package also makes it easy to work with the {xgboost} package:
## convert tfidf matrix into xgb.DMatrix xgb_mat(tfidf(w, d$test$x))
The xgb_mat()
function also allows users to specify a response/label/outcome
vector, e.g.:
## include a response variable xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)))
To return split (into test and train) data, specify a value between 0-1 to set the proportion of observations that should appear in the training data set:
## split into test/train xgb_data <- xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)), split = 0.8)
The object returned by xgb_mat()
can then easily be passed to {xgboost}
functions for powerful and fast machine learning!
## specify hyper params params <- list( max_depth = 2, eta = 0.25, objective = "binary:logistic" ) ## init training xgboost::xgb.train( params, xgb_data$train, nrounds = 4, watchlist = xgb_data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.