Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Florida Tech Evans Library Logo

Introduction to Text Mining

An overview of text mining tools and techniques.

Importing Multiple Documents

Importing Multiple Documents

 

The readtext() function from the readtext package is an easy way to upload multiple separate files containing plain-text to a single dataframe at once. 

The readtext() function takes a file directory as input and will import multiple files into R as a single dataframe object. In the dataframe, the filename will be listed in the 'doc_id' column and the file contents will be listed in the 'text' column.  

Readtext will extract plain-text data from several different file types including: 

  • .txt
  • .csv                                                       
  • .tab
  • .tsv
  • .json
  • .doc
  • .docx
  • .pdf

readtext() example: 

The .R file is preloaded with an example of how readtext() works. You can download the .zip file and follow the instructions in the .R file to navigate to the folder on your system to execute the function. 

Tokenization

Tokenizing Text

 

Below is a .R script file with an example of the bag-of-words approach to text mining and tokenizing text.

  • The unnest_tokens(input, output) function takes a dataframe as input and returns a new tokenized dataframe comprised of tokens taken the input dataframe.
  • The length of the tokens are specified by the user using the 'token' argument. The default for an unspecified argument will be word-level tokenization. 
  • The unnest_tokens() function requires character vectors as input. 

 

Below is an example you can run in RStudio to see how unnest_tokens() works in R. 

Wordclouds and Stop Words

Word Clouds

Word clouds are a fun and useful way to visualize textual data. A word cloud will display the n most frequently occurring tokens in a corpus, with the size of the token relative to the percentage of the total count each token comprises. 

 

Stop Words

Even with a bag-of-words approach, not all tokens in a corpus are useful. Loading the tidytext package into R will import the stop_words dataframe. Stop_words contains three lexicons of words commonly considered to be of little value to knowledge extraction. 

The example below shows how to visualize unnested data using the wordcloud package.

Making a Corpus and a Document-Term Matrix

Making a Corpus and Document-Term Matrix

 

While readtext() makes a corpus-like object, the output of readtext() is remains a dataframe. The example below illustrates the steps a user must take to turn a readtext dataframe or a collection of text objects into a corpus object. Corpus objects are necessary as inputs to several text-mining functions, chief among which is the document-term matrix. 

 

The .R script below shows how the VCorpus() function can be used to make a corpus object in R.. 

Sentiment Analysis

Sentiment Analysis

 

You can use tokenization with a bag-of-words approach to peform sentiment analysis on text. The tidytext package contains a 'sentiments' library with 3 different sentiment dictionaries that can be used to begin sentiment analysis. 

The .R script below illustrates how a sentiment dictionary can be used to mine text for sentiment. 

Topic Modeling