Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Florida Tech Evans Library Logo

Data Cleaning

An introductory guide to data cleaning concepts, tools, and methods.

Common Functions Used in Python

Pandas Functions: 

  • pd.read_csv()
    • Read a comma-separated value file (.csv) into Python as a DataFrame. 
  • pd.melt()
    • Spread a column so that values stored in a single column can be made into columns as well. 
  • pd.pivot_table()
    • Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
  • pd.concat()
    • Concatenate pandas objects along a particular axis. 
  • pd.merge()
    • Merge DataFrame objects by performing a column-column join similar to database-style join commands.
  • pd.notnull()
    • Check a Pandas object for missing values. 

Regex Functions: 

  • re.compile()
    • Compile a regular expression pattern into a Python object. 
  • re.findall()
    • Return all non-overlapping matches of a pattern in a string, as a list of strings.

Commonly used Python Methods: 

  • .head()
    • Return the first rows in an object. The defaults to 5. 
  • .tail()
    • Return the last n rows in an object. As with .head(), the defaults to 5. 
  • .info()
    • Return information about a data frame, including the index and column data types, non-null values, and memory usage. 
  • .value_counts()
    • Return an object containing counts of unique values for chosen data. 
  • .describe()
    • Provides summary statistical information about chosen data.
  • .split()
    • Split each string in the chosen values based on a pattern. 
  • .astype()
    • Coerce a Pandas object to a specific data type. 
  • .apply()
    • Apply a function to each row or column in a data frame. 
  • .replace()
    • Replace values passed to to_replace argument with specified values. 
  • .drop_duplicates()
    • Return a data frame where duplicate rows have been removed from specified columns. 
  • .fillna()
    • Fill in NA / NaN values using a specified method.