Have a question? Librarians are here to help.
It is commonly said that data scientists and researchers who work with data spend the majority of their time preparing data for analysis. The numbers vary from source to source, but the most often cited statistic claims that researchers spend about 80% of their time getting data ready so a productive and accurate analysis can be performed on the data.
So much effort is put into preprocessing data because unclean data can negatively impact the results and conclusions drawn from analysis. Even the outcomes obtained by analysis experts can be useless if the data they work with has not been formatted and checked for errors before analysis actions are taken.
This guide is meant as an introduction to the concept of clean data, as well as tools, learning resources, and methods commonly used for cleaning data.