dtm_align: Reorder a Document-Term-Matrix alongside a vector or...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_align

R Documentation

Reorder a Document-Term-Matrix alongside a vector or data.frame

Description

This utility function is useful to align a Document-Term-Matrix with information in a data.frame or a vector to predict, such that both the predictive information as well as the target is available in the same order.
Matching is done based on the identifiers in the rownames of x and either the names of the y vector or the first column of y in case it is a data.frame.

Usage

dtm_align(x, y, FUN, ...)

Arguments

`x`	a Document-Term-Matrix of class dgCMatrix (which can be an object returned by `document_term_matrix`)
`y`	either a vector or data.frame containing something to align with `x` (e.g. for predictive purposes). In case `y` is a vector, it should have names which are available in the rownames of `x`. In case `y` is a data.frame, it's first column should contain identifiers which are available in the rownames of `x`.
`FUN`	a function to be applied on `x` before aligning it to `y`. See the examples
`...`	further arguments passed on to FUN

Value

a list with elements x and y containing the document term matrix x in the same order as y.

If in y a vector was passed, the returned y element will be a vector
If in y a data.frame was passed with more than 2 columns, the returned y element will be a data.frame
If in y a data.frame was passed with exactly 2 columns, the returned y element will be a vector

Only returns data of x with overlapping identifiers in y.

Examples

x <- matrix(1:9, nrow = 3, dimnames = list(c("a", "b", "c")))
x
dtm_align(x = x, 
          y = c(b = 1, a = 2, c = 6, d = 6))
dtm_align(x = x, 
          y = c(b = 1, a = 2, c = 6, d = 6, d = 7, a = -1))
          
data(brussels_reviews)
data(brussels_listings)
x <- brussels_reviews
x <- strsplit.data.frame(x, term = "feedback", group = "listing_id")
x <- document_term_frequencies(x)
x <- document_term_matrix(x)
y <- brussels_listings$price
names(y) <- brussels_listings$listing_id

## align a matrix of predictors with a vector to predict
trainset <- dtm_align(x = x, y = y)
trainset <- dtm_align(x = x, y = y, FUN = function(dtm){
  dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
  dtm <- dtm_sample(dtm)
  dtm
})
head(names(y))
head(rownames(x))
head(names(trainset$y))
head(rownames(trainset$x))

## align a matrix of predictors with a data.frame
trainset <- dtm_align(x = x, y = brussels_listings[, c("listing_id", "price")])
trainset <- dtm_align(x = x, 
                y = brussels_listings[, c("listing_id", "price", "room_type")])
head(trainset$y$listing_id)
head(rownames(trainset$x))

## example with duplicate data in case of data balancing
dtm_align(x = matrix(1:30, nrow = 3, dimnames = list(c("a", "b", "c"))), 
          y = c(a = 1, a = 2, b = 3, d = 6, b = 6))
target   <- subset(brussels_listings, listing_id %in% brussels_reviews$listing_id)
target   <- rbind(target[1:3, ], target[c(2, 3), ], target[c(1, 4), ])
trainset <- dtm_align(x = x, y = target[, c("listing_id", "price")])
trainset <- dtm_align(x = x, y = setNames(target$price, target$listing_id))
names(trainset$y)
rownames(trainset$x)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_align: Reorder a Document-Term-Matrix alongside a vector or...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Reorder a Document-Term-Matrix alongside a vector or data.frame

Description

Usage

Arguments

Value

See Also

Examples

Related to dtm_align in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_align: Reorder a Document-Term-Matrix alongside a vector or... In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Reorder a Document-Term-Matrix alongside a vector or data.frame

Description

Usage

Arguments

Value

See Also

Examples

Related to dtm_align in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_align: Reorder a Document-Term-Matrix alongside a vector or...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit