R package generates the summary of textual data
One of the most relevant applications of machine learning for corporations globally is the use of natural language processing (NLP). Whether it be parsing through business documents to gather key word segments or detecting Twitter sentiment of a certain product, NLP’s use case is prevalent in virtually every business environment.
Unfortunately, there are few tools today which provide summary statistics on textual data that a user may want to analyze. Our goal with this package is to provide users with a simple and flexible tool to gather key insights that would be useful during the exploratory data analysis phase of the data science workflow.
Our library specifically will make extensive use of pre-existing packages in the R eco-system. We will use the textcat
and openNLP
library to build most of the sentiment analysis functions while also leveraging well-known packages such as tidyverse
to aid in the overall presentation of our final output results.
To the best of our knowledge, there is no any other library that combines all the below mentioned functionality in one.
And the development version from GitHub with:
install.packages("devtools")
devtools::install_github("UBC-MDS/nlpsummarizer")
detect_language
: This function will parse through the textual data and come up with the language of the text. This can be useful for an international company’s customer service process in detecting which regions are requesting the most help.
part_of_speech
: This function will generate key statistics on the proportions of textual data points including verbs, prepositions, adjectives, nouns and articles.
polarity
: This function will check the overall sentiment of the data by assessing the number of negative, positive and neutral words in the textual data.
sentence_stopwords_freq
: This function will check the proportion on the number of sentences, stop words and also output high frequency words in the textual data.
This is a basic example which shows you how to generate a summary:
library(nlpsummarizer)
df <- data.frame(text_col = c('I love travelling to Japan and
eating Mexican food but I can only speak
English!'))
get_language(df$text_col)
[1] 'English'
get_part_of_speech(df$text_col)
[2] | verbs | prepositions | adjectives | nouns | articles |
| 0.2 | 0.11 | 0.3 | 0.06 | 0.18 |
get_polarity(df$text_col)
[3] | positive words | negative words | neutral words |
| 3 | 0 | 15 |
stopword_sentence(df$text_col)
[4] | number of sentences | number of stop words |
| 1 | 4 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.