Skip to Main Content
Florida Tech Evans Library Logo

Introduction to Text Mining

An overview of text mining tools and techniques.

Subsetting with Stringr

The daily work of data scientists typically does not require the use of an entire dataset at once. Subsetting is the act of pulling smaller parts of data from a larger dataset. The stringr package contains several functions that enable users to quickly subset data. This is an important skill that any data scientist should know. 

str_sub(string, start, end)

str(sub(string, start, end) will extract a substring within a string. 

Arguments: 

  1. string: the character object you want to extract a substring from. 
  2. start: the starting position within the string of the substring to be extracted.
  3. end: the ending position within the string of the substring to be extracted.

str_subset(string, pattern)

str_subset(string, pattern) will return only the strings containing a matched pattern specified by the user. 

str_extract(string, pattern)

str_extract(string, pattern) returns the first matched pattern found in a string. Note: to extract all matches within a string, use str_extract_all. 

str_match(string, pattern)

str_match(string, pattern) will return the first matched pattern within a string.

When applied to a string vector, the matches will be returned in a list.

To return all matches in a string instead of the first match only, use str_match_all()