Skip to Main Content
Florida Tech Evans Library Logo

Data Cleaning

An introductory guide to data cleaning concepts, tools, and methods.

Common Functions Used in Python

Pandas Functions: 

  • pd.read_csv()
    • Read a comma-separated value file (.csv) into Python as a DataFrame. 
  • pd.melt()
    • Spread a column so that values stored in a single column can be made into columns as well. 
  • pd.pivot_table()
    • Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
  • pd.concat()
    • Concatenate pandas objects along a particular axis. 
  • pd.merge()
    • Merge DataFrame objects by performing a column-column join similar to database-style join commands.
  • pd.notnull()
    • Check a Pandas object for missing values. 

Regex Functions: 

  • re.compile()
    • Compile a regular expression pattern into a Python object. 
  • re.findall()
    • Return all non-overlapping matches of a pattern in a string, as a list of strings.

Commonly used Python Methods: 

  • .head()
    • Return the first rows in an object. The defaults to 5. 
  • .tail()
    • Return the last n rows in an object. As with .head(), the defaults to 5. 
  • .info()
    • Return information about a data frame, including the index and column data types, non-null values, and memory usage. 
  • .value_counts()
    • Return an object containing counts of unique values for chosen data. 
  • .describe()
    • Provides summary statistical information about chosen data.
  • .split()
    • Split each string in the chosen values based on a pattern. 
  • .astype()
    • Coerce a Pandas object to a specific data type. 
  • .apply()
    • Apply a function to each row or column in a data frame. 
  • .replace()
    • Replace values passed to to_replace argument with specified values. 
  • .drop_duplicates()
    • Return a data frame where duplicate rows have been removed from specified columns. 
  • .fillna()
    • Fill in NA / NaN values using a specified method.