README.md

sentenceR

R package for (almost) language-agnostic sentence tokenization

sentenceR is a language-agnostic utility designed for sentence tokenization of raw text. Using the UDPipe POS tagging pipeline, the package automatically extracts sentences with their appropriate indexes (hence the “crowbar” logo as a reference to extraction). The package works with any of the 100+ language models natively provided by UDPipe package (see https://github.com/bnosac/udpipe).

Overview

The package is intended for use especially with non-English languages that are under-resourced in terms of standardized tools. The approach is not particularly fast as the pipeline relies on full POS tagging done by UDPipe but it provides a reliable option for programmers working on small corpora of various origins (typically social scientists) who need a simple pre-processing tool for extracting sentence-level tokens and their higher n-grams. For the convenience, the main function provides several control arguments for cleaning the raw text as well as extracting its lemmatized form for further processing.

Installation Instruction

Install the package from the GitHub repository:

devtools::install_github('mmochtak/sentenceR', dependencies = TRUE)

Version

0.0.2

Usage

The package contains three general functions: get_sentences; sent_ngrams; sent_ngrams_lem

library(sentenceR)
sample_text <- c("This is sentence number one. This is sentence number two. This is sentence number three.", "This is sentence number four. This is sentence number five. This is sentence number six.")

get_sentences(text = sample_text, language = "english", lem = TRUE, verbose = TRUE)

  doc_id paragraph_id sentence_id                       sentence                   sentence_lem
1      1            1           1   This is sentence number one.   this be sentence number one.
2      1            1           2   This is sentence number two.   this be sentence number two.
3      1            1           3 This is sentence number three. this be sentence number three.
4      2            1           1  This is sentence number four.  this be sentence number four.
5      2            1           2  This is sentence number five.  this be sentence number five.
6      2            1           3   This is sentence number six.   this be sentence number six.
library(sentenceR)
sample_text <- c("This is sentence number one. This is sentence number two. This is sentence number three.", "This is sentence number four. This is sentence number five. This is sentence number six.")

result <- get_sentences(text = sample_text, language = "english", lem = TRUE, verbose = TRUE)
sent_ngrams(sentences = result, n = 2)

  doc_id ngram_id                                                       ngram
1      1        1   This is sentence number one. This is sentence number two.
2      1        2 This is sentence number two. This is sentence number three.
3      2        1 This is sentence number four. This is sentence number five.
4      2        2  This is sentence number five. This is sentence number six.
library(sentenceR)
sample_text <- c("This is sentence number one. This is sentence number two. This is sentence number three.", "This is sentence number four. This is sentence number five. This is sentence number six.")

result <- get_sentences(text = sample_text, language = "english", lem = TRUE, verbose = TRUE)
sent_ngrams_lem(sentences = result, n = 2)

  doc_id ngram_id                                                       ngram
1      1        1   this be sentence number one. this be sentence number two.
2      1        2 this be sentence number two. this be sentence number three.
3      2        1 this be sentence number four. this be sentence number five.
4      2        2  this be sentence number five. this be sentence number six.

Final Remarks

The package has been developed as a practice for me as I wanted to learn how to create packages to better organize my code. However, it might be useful for anybody working on NLP tasks requiring sentence tokenization and lemmatization in languages other than English. I know the approach is not the most efficient in terms of speed and accuracy but it is very versatile when it comes to languages that can be processed out of the box (for a slightly extended tutorial see my blog). Regarding accuracy, it is important to stress that sentence tokenization is based on accuracy of language models and may differ among models. Although I’ve never planned to submit it to CRAN, I am planning to maintain it as long as it is still useful via GitHub. Feel free to contact me if standard GitHub channels are not suitable for you via my personal website. If used, please cite it as:

Mochtak, Michal (2021): sentenceR: Language-Agnostic Sentence Tokenization for Low-Resourced Languages. URL: https://github.com/mmochtak/sentenceR/. DOI: 10.13140/RG.2.2.12134.65608



mmochtak/sentenceR documentation built on Aug. 25, 2022, 9:31 a.m.