Python is a powerful open-source programming language often used by data scientists. Python has several packages geared toward cleaning data. These packages include:
R is another powerful open-source language which has been specifically developed to handle data and perform statistical operations.The Tidyverse library developed for R is among the most robust data cleaning tools available. You can learn about the Tidyverse library, as well as several of its key functions from Joey Stanley's detailed An Introduction to Tidyverse.
SQL is the most widely-adopted database querying language. SQL allows users to quickly and effectively explore data with a much lower barrier to entry than come with programming languages. However, SQL is best used to explore and make sense of data as they are, rather than to manipulate and change inaccuracies in data.
Microsoft Excel has tools for quickly filtering and cleaning data. Most of these tools can be found in the "Data" tab. Excel presents an opportunity for performing low-level cleaning tasks like adding and removing filters, as well as performing summary statistics. Users should beware, however, that a filtered Excel spreadsheet may not import as it appears into a programming language. Another drawback to Excel is that cleaning tasks are best suited to small data sets, rather than large sets of hundreds of thousands or millions of records, which will load much more slowly than if the same tasks are performed directly via a programming language.
Microsoft Power Business Intelligence is an analytics and visualization tool developed by Microsoft and available to Florida Tech affiliates. Similar to Tableau, Power BI provides a point-and-click graphical user interface to help expedite cleaning and visualization tasks.
OpenRefine is an open-source software application that runs through a web browser and allows users to perform complex data cleaning operations with a point-and-click interface. It should be noted, however, that OpenRefine is best suited to cleaning smaller data sets, and will often encounter issues when handling large amounts of data.