knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Text Mining is a process of extracting information from unstructured data files such as Word Documents, PDF Files, and XML Files. It is also known as Text Data Mining.
Data mining is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases, where the data are organized in records structured by categorical, ordinal, or continuous variables.
Search and Information Retrieval (IR): Storage and retrieval of Text documents, including search engines and keyword search.
Document Clustering: Grouping and categorizing terms, paragraphs or documents using data mining clustering methods.
Document Classification: Grouping and categorizing snippets, paragraphs, or document using data mining classification methods, based on models trained on labeled examples.
Web Mining: Data and Text Mining on the Internet with a specific focus on the scale and interconnectedness of the web.
Information Extraction (IE): Identification and extraction of relevant facts and relationships from unstructured text; the process of making structured data from unstructured and semi-structured text
Natural Language Processing (NLP): Low-level language processing and understanding tasks (e.g., tagging part of speech); often used synonymously with computational linguistics.
Concept Extraction: Grouping of words and phrases into semantically similar groups.
•Information extraction Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. •Topic tracking Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. •Summarization Summarizing a document to save time on the part of the reader. •Categorization Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. •Clustering Grouping similar documents without having a predefined set of categories. •Concept linking Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. •Question answering Finding the best answer to a given question through knowledge-driven pattern matching.
Data Collection Text Parsing Text Filtering Transformation Text Mining
Via the serachTwitter function of the twitteR package is it possible to load tweets for a specific tweet.
tweets <- searchTwitter(word, lang = "en", n = max_number)
The text mining pipline indicates at the beginning to create corpus. There exist a function in the tm package.
tweets_corpus <- Corpus(VectorSource(tweets_text))
tweets_clean <- tm_map(tweets_corpus,removePunctuation) tweets_clean <- tm_map(tweets_clean,content_transformer(tolower)) tweets_clean <- tm_map(tweets_clean, removeWords,stopwords("english")) tweets_clean <- tm_map(tweets_clean, removeNumbers) tweets_clean <- tm_map(tweets_clean, stripWhitespace) tweets_clean <- tm_map(tweets_clean, stemDocument)
After the step cleaning followes the trainformation of the data. On way is to create a Term Document Matrix
tdm_tweets <- TermDocumentMatrix(tweets_clean) ```` ### How to run the Shiny app: ```` library(AdvRLab5) shiny::runGitHub("AdvRLab5", "Philhoels", subdir = "TwitterWordCould/")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.