The readtext() function from the readtext package is an easy way to upload multiple separate files containing plain-text to a single dataframe at once.
The readtext() function takes a file directory as input and will import multiple files into R as a single dataframe object. In the dataframe, the filename will be listed in the 'doc_id' column and the file contents will be listed in the 'text' column.
Readtext will extract plain-text data from several different file types including:
readtext() example:
The .R file is preloaded with an example of how readtext() works. You can download the .zip file and follow the instructions in the .R file to navigate to the folder on your system to execute the function.
Below is a .R script file with an example of the bag-of-words approach to text mining and tokenizing text.
Below is an example you can run in RStudio to see how unnest_tokens() works in R.
Word clouds are a fun and useful way to visualize textual data. A word cloud will display the n most frequently occurring tokens in a corpus, with the size of the token relative to the percentage of the total count each token comprises.
Even with a bag-of-words approach, not all tokens in a corpus are useful. Loading the tidytext package into R will import the stop_words dataframe. Stop_words contains three lexicons of words commonly considered to be of little value to knowledge extraction.
The example below shows how to visualize unnested data using the wordcloud package.
While readtext() makes a corpus-like object, the output of readtext() is remains a dataframe. The example below illustrates the steps a user must take to turn a readtext dataframe or a collection of text objects into a corpus object. Corpus objects are necessary as inputs to several text-mining functions, chief among which is the document-term matrix.
The .R script below shows how the VCorpus() function can be used to make a corpus object in R..
You can use tokenization with a bag-of-words approach to peform sentiment analysis on text. The tidytext package contains a 'sentiments' library with 3 different sentiment dictionaries that can be used to begin sentiment analysis.
The .R script below illustrates how a sentiment dictionary can be used to mine text for sentiment.