Skip to Main Content
Florida Tech Evans Library Logo

Introduction to Text Mining

An overview of text mining tools and techniques.

The Stringr Package

What is a String? 

'String' is a term commonly used to describe textual data objects which are also commonly referred to as the character data type. In R language, strings are enclosed in either a set of single quotes ('string') or double quotes ("string"). It is important for users to remember that the digits included between quotes are considered to be string data. 

 

The stringr Package

The following pages include interactive examples for manipulating string data in R language using the stringr package. stringr is a set of pre-created functions that allow users to work with string data. stringr functions require less input from the user than attempting to perform the actions by hard-coding the functions, saving users valuable time and energy in working with string data. For further details on the functions represented in the exercises, see the stringr Cheat Sheet.

Note: You can use the function: install.packages('stringr') in the R console to install the stringr package to your machine. 

The Data in the Exercises

The data you will work with throughout most of these exercises have been preloaded into the DataCamp Interactive Development Environment. The data consists of a variable named "shakespeare" which consists of two columns: the 'play' column, which lists the titles of several of William Shakespeare's plays in a string vector, and the 'quote' variable, which lists famous lines from the corresponding play. You can download this simple dataset in a .csv file below. 

***Note: You can always see the data in each column within the IDE by entering the variable name with the dollar sign operator and then the column name into the console. In this case, you would enter:

shakespeare$play

or

shakespeare$quote

For a more complete explanation of column selection, see this interactive tutorial on Data Frame Manipulation.