Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Florida Tech Evans Library Logo

Introduction to Text Mining

An overview of text mining tools and techniques.

Subsetting with Stringr

The daily work of data scientists typically does not require the use of an entire dataset at once. Subsetting is the act of pulling smaller parts of data from a larger dataset. The stringr package contains several functions that enable users to quickly subset data. This is an important skill that any data scientist should know. 

str_sub(string, start, end)

str(sub(string, start, end) will extract a substring within a string. 

Arguments: 

string: the character object you want to extract a substring from. 

start: the starting position within the string of the substring to be extracted.

end: the ending position within the string of the substring to be extracted.

# This will get executed each time the exercise gets initialized library(stringr) play = c("Hamlet", "Romeo & Juliette", "Romeo & Julliette", "The Merchant of Venice", "King Henry IV", "Julius Ceasar", "MacBeth", "King Lear") quote = c("What a piece of work is man! how noble in reason! how infinite in faculty! in form and moving how express and admirable! in action how like an angel! in apprehension how like a god! the beauty of the world, the paragon of animals", "What's in a name? That which we call a rose by any other name would smell as sweet", "Tempt not a desperate man", "The devil can cite Scripture for his purpose", "A man can die but once", "But, for my own part, it was Greek to me", "Double, double toil and trouble; Fire burn, and cauldron bubble", "Nothing will come of nothing") shakespeare = data.frame(play, quote) # This exercise uses the shakespeare dataframe # Use str_locate() to show the position of matches for the pattern "man" within a string. # This exercise uses the shakespeare dataframe # Use str_sub(string, start, end) to subset a string. start and end represent the positions within the string. Use 1 for start and 10 for end. str_sub(shakespeare, 1, 10) test_function("str_sub") success_msg("Great! As you can see, str_sub will subset each string in the quote vector and return the subsetted result.")
Use()

str_subset(string, pattern)

str_subset(string, pattern) will return only the strings containing a matched pattern specified by the user. 

# This will get executed each time the exercise gets initialized library(stringr) play = c("Hamlet", "Romeo & Juliette", "Romeo & Julliette", "The Merchant of Venice", "King Henry IV", "Julius Ceasar", "MacBeth", "King Lear") quote = c("What a piece of work is man! how noble in reason! how infinite in faculty!", "What's in a name? That which we call a rose by any other name would smell as sweet", "Tempt not a desperate man", "The devil can cite Scripture for his purpose", "A man can die but once", "But, for my own part, it was Greek to me", "Double, double toil and trouble; Fire burn, and cauldron bubble", "Nothing will come of nothing") shakespeare = data.frame(play, quote) # This exercise uses the shakespeare dataframe # Use str_subset(string, pattren) to return only the strings that match a pattern. Use "man" from the quote vector in the shakespeare dataframe. # This exercise uses the shakespeare dataframe # Use str_subset(string, pattren) to return only the strings that match a pattern. str_subset(shakespeare$quote, "man") test_function("str_subset") success_msg("Great! As you can see, str_subset will return only the strings that contain a pattern match.")
Use()

str_extract(string, pattern)

str_extract(string, pattern) returns the first matched pattern found in a string. Note: to extract all matches within a string, use str_extract_all. 

# This will get executed each time the exercise gets initialized library(stringr) play = c("Hamlet", "Romeo & Juliette", "Romeo & Julliette", "The Merchant of Venice", "King Henry IV", "Julius Ceasar", "MacBeth", "King Lear") quote = c("What a piece of work is man! how noble in reason! how infinite in faculty!", "What's in a name? That which we call a rose by any other name would smell as sweet", "Tempt not a desperate man", "The devil can cite Scripture for his purpose", "A man can die but once", "But, for my own part, it was Greek to me", "Double, double toil and trouble; Fire burn, and cauldron bubble", "Nothing will come of nothing") shakespeare = data.frame(play, quote) # This exercise uses the shakespeare dataframe # Use str_extract(string, pattren) to return the first pattern match in each string. Use "a" from the quote vector in the shakespeare dataframe. # This exercise uses the shakespeare dataframe # Use str_extract(string, pattren) to return the first pattern match in each string. Use "a" from the quote vector in the shakespeare dataframe. str_extract(shakespeare$quote, "a") test_function("str_extract") success_msg("Great! As you can see, str_extract will eturn the first pattern match in each string. Notice how the returned strings do not necesserily contain the standalone word 'a', but any string containing the character 'a'.")
Use()

str_match(string, pattern)

str_match(string, pattern) will return the first matched pattern within a string.

When applied to a string vector, the matches will be returned in a list.

To return all matches in a string instead of the first match only, use str_match_all()

# This will get executed each time the exercise gets initialized library(stringr) play = c("Hamlet", "Romeo & Juliette", "Romeo & Julliette", "The Merchant of Venice", "King Henry IV", "Julius Ceasar", "MacBeth", "King Lear") quote = c("What a piece of work is man! how noble in reason! how infinite in faculty!", "What's in a name? That which we call a rose by any other name would smell as sweet", "Tempt not a desperate man", "The devil can cite Scripture for his purpose", "A man can die but once", "But, for my own part, it was Greek to me", "Double, double toil and trouble; Fire burn, and cauldron bubble", "Nothing will come of nothing") shakespeare = data.frame(play, quote) # This exercise uses the shakespeare dataframe # Use str_match(string, pattren) to return the first pattern match in each string. Use "but" from the quote vector in the shakespeare dataframe. # This exercise uses the shakespeare dataframe # Use str_match(string, pattren) to return the first pattern match in each string. Use "but" from the quote vector in the shakespeare dataframe. str_match(shakespeare$quote, "but") test_function("str_match") success_msg("Great! As you can see, str_match() functions similarly to str_extract(), with the output formatted in a matrix. Please note, the pattern is case-sensitive, as 'But' from the 6th index is excluded from the matched patterns.")
Use()