Skip to Main Content
Florida Tech Evans Library Logo

Data Cleaning

An introductory guide to data cleaning concepts, tools, and methods.

Common Functions Used in R

R has many packages useful for data cleaning. The majority of these packages are a part of the TIdyverse Library. The Tidyverse library packages contain many functions that can help with data cleaning tasks. Some functions have a lower barrier to entry and some are used much more frequently than others. Here is a list with information breaking down a few of the most popular data cleaning functions used in R. 

Note: If you load the Tidyverse library is loaded into R all at once with library(Tidyverse), all of the packages will be loaded and you will not have to remember which function belongs to which package. 

Exploratory Functions: 

  • head()
    • Returns the first or last parts of a vector, matrix, table, data frame or function
  • tail()
    • Returns the last part of a vector, matrix, table, data frame, or function 
  • class()
    • Returns the object type of the first argument passed to the function
  • dim()
    • Retrieve or set the dimension of an object
  • names()
    • Functions to get or set the names of an object
  • str()
    • Compactly display the internal structure of any R object. 
  • glimpse()
    • This is like a transposed version of print: columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str() applied to a data frame but it tries to show you as much data as possible
  • summary() 
    • A generic function used to produce result summaries of the results of various model fitting functions

Functions for Changing the Format of a Data Set: 

  • gather()
    • Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. Use gather() when there are columns that are not variables.
  • spread()
    • Spread a key-value pair across multiple column. Use spread() when variables are stored in both rows and columns. 
  • separate()
    • Turns a single character column into multiple columns.Use separate() when multiple variables are stored in one column. 
  • unite()
    • Paste together multiple columns into one. Use unite() 

Functions for Coercing Data Types: 

Note: The prefix is. can be used in place of "as." to check the whether the object passed to the function is of that specific data type. 

***There are other options for coercing data types. See the full documentation for more information. 

Functions for Coercing Date-Time data: 

Note: the date and time values of ymd() and hms() will parse dates in the order the letters are entered. The letters may appear in any order, and exclude letters as well, so my() and md() are acceptable for ymd(). Likewise, hm() and ms() are acceptable for hms(). Any other combination or ordering of letters is also acceptable, as long as the string passed as an argument also follows that format. 

  • ymd()
    • Transforms dates stored in character and numeric vectors to date objects
  • hms()
    • Transforms times stored in character and numeric vectors to date objects
  • ymd_hms()
    • A combination of ymd() and hms(), transforms date-times stored in character and numeric vectors to date-time objects 

Functions for Working with Strings: 

  • str_trim()
    • delete leading and trailing white space from a string. 
  • str_pad()
    • pad a string with specific characters. 
  • str_detect()
    • Detect the presence or absence of a pattern in a string. 
  • str_replace()
    • Replace matched patterns in a string with a new pattern. 
  • toupper()
    • Converts all characters in a string to capital letters. 
  • tolower()
    • Converts all characters in a string to lowercase letters.
  • n_gram_merge()
    • This function takes a character vector and makes edits and merges values that are approximately equivalent yet not identical.