Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Florida Tech Evans Library Logo

Data Cleaning

An introductory guide to data cleaning concepts, tools, and methods.

Basic Concepts of Data Cleaning

In 2014, Hadley Wickham, a software developer at RStudio published the paper, "Tidy Data" in the Journal of Statistical Software, which outlines the basic elements of clean data. Wickham is also the creator of the Tidyverse package for R; a widely-used package containing many functions that make it much easier explore and clean data. You can view the "Tidy Data" article here

Below is a summary of the concepts outlined in the article: 

3 Principles of Tidy Data: 

  • Columns represent different variables
  • Rows represent individual observations
  • Each observational unit forms a table

Common Elements of Messy Data: 

  • Column headers store values not variable names
  • Multiple variables are stored in a single column
  • Variables are stored in both rows and columns
  • A single observational unit is stored in multiple tables
  • Multiple types of observational unites are stored in the same table

Column Headers Store Values Not Variable Names

The structure of data may need to be adjusted to prepare data for analysis. Improper formatting can make data more difficult to interpret and process.Proper formatting will make it easier to process data and spot inaccuracies that may need to be corrected. 

Here is an example of a data set that has been formatted in two different ways. This data comes from the bmi.csv training dataset.

The first example shows the body mass index values per country per year with the individual years occupying separate columns. This format may obscure clarity, especially when interpreting values for years further down the list. 

 

Here is an example of an alternative format for the same data set, with the years gathered together in a single column.

 

No data has been lost in this example. Instead, the years are repeated for each country, resulting in a data set that is much longer than it is wide. It is certainly easier to connect each BMI Value with the corresponding year and country. Some may argue that this obscures clarity, but some data formats are better for analysis while others are better for reporting. 

This example focuses on a a specific formatting issue in which values are stored as variables. Other formatting issues may exist in a data set. For examples of other possible formatting issues, see the "Elements of Messy Data" section above. 

Multiple Variables Stored in a Single Column

Sometimes multiple values can be stored in a single variable. This is common with date-time data, as can be seen in the example below: 

 

While keeping the YYYY-MM format may be ok in some situations, data analysts may be able to draw more useful insights from the data if the year and month are separated into their own individual columns. In the example below, the start_date column has been separated into a year column and a month column. Both Python and R have functions which users to separate multiple values stored in a single column into multiple columns.  

Merging Similar Values

Repeated data values often suffer from human entry error when added to a spreadsheet or ledger. It is common for humans to enter similar, but different values representing the same object. These errors might be introduced by simple mistakes such as typos, or due to a lack of quality control over data entry, such as when multiple individuals are entering data or there are no well-defined rules for data entry.  Such differences can make it difficult for programming languages and data analysis applications to perform useful computations on data. Below are examples scenarios where similar but different data are entered into a data set. 

Below is an excerpt from the Menu.csv training dataset. Here you can see similar, but different values for the "event" column. The following values essentially represent the same information, and could be merged to enable more effective analysis of the data. These values include:

  • Lunch
  • [Lunch]
  • Luncheon

And: 

  • Dinner
  • Semi-Annual Dinner
  • Supper
  • Souper

                                                                                          

‚Äč

Here it might be best to simplify all of the values to "Lunch" and "Dinner". There are various methods for merging similar values in the R and Python programming languages, as well as in the OpenRefine software application. For specific examples, see the "Cleaning with..." tabs of this guide.