Research Guides: Introduction to Text Mining: Common Methods

Bag of Words and Tokenization

Bag of Words

The bag of words approach to text mining is the most common method for performing computations on string data. In bag of words, the data are broken down into tokens. A token can be thought of as a unit of a certain or uncertain length, depending on the context in which string data is tokenized. Tokens can be as small as a single character or as large as an entire text document. Common tokens include:

characters
words
sentences
documents

The bag of words approach does not assign any kind of hierarchical significance to each token in a 'bag'. The idea here is that you can pull any token at random from your corpus and the computer will treat it with the same significance as every other token.

The function unnest_tokens(output, input, token) from the tidytext R package can be used to tokenize string data.

The Term-Document Matrix

The term-document matrix (tdm) is a useful data object that can be created from unnested tokens. In a tdm, documents in a corpus are represented as rows and each individual term is represented as a column. The matrix is populated with numerical count data for the number of times each token appears in each document. This is a quick way to see and compare term frequencies between a large number of documents.

The TermDocumentMatrix(corpus_object) function from the tm package can be used to create a term-document matrix from a corpus object.

Topic Modeling

Topic modeling is an approach to mining text that allows users to quickly get an idea of the main contents of a corpus. Topic models are especially useful when working with text documents with which the user has no prior exposure.

Topic models typically take a document-term matrix as input to unnest tokens within a corpus and then compute the probabilities that words will appear in each topic, as well as the probabilities that topics will appear within the individual documents. These probabilities can be used to make predictions about the future prevalence of topics.

Named Entity Recognition

Named entity recognition is another common text mining method that allows users to locate and classify names from text.

Here it is important to note that the term 'entity' is traditionally meant to refer to discrete individuals, places, and organizations. For example, we may think of "Florida Institute of Technology" and the "School of Arts and Communications" as entities, whereas terms like "university" and "liberal arts" may not normally be thought of as discrete entities. While this principle holds true in many cases, we may expand the definition to anything that can be easily recognized. This could include dates, times, specific descriptive words, quantities, and even ideas.

Common named entities include:

individuals
locations
dates
times
organizations

Named entity recognition is also fundamental to sentiment analysis, where the named entity is the object the sentiment can be attributed to.

Example:

Language-based software will often see dates or times and suggest adding them as an event in a calendar application.

Sentiment Analysis

Sentiment Analysis refers to text mining methods that are used to gauge the emotional attitude of text. Sentiment analysis relies on dictionaries of words that typically indicate emotion and a corresponding emotional designation or score.

For example, words like happy, excited, and energized might be designated as positive sentiments while words such as bad, oppressed, and foolish might be designated as negative sentiments. The R language tidytext package contains several premade sentiment dictionaries commonly used for sentiment analysis.

These dictionaries include:

bing
afinn
loughran

***You can explore these dictionaries in the RStudio Examples section of this guide.