Research Guides: Data Cleaning: Cleaning with Python

Common Functions used in Python

Pandas Functions:

pd.read_csv()
Read a comma-separated value file (.csv) into Python as a DataFrame.
pd.melt()
Spread a column so that values stored in a single column can be made into columns as well.
pd.pivot_table()
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
pd.concat()
Concatenate pandas objects along a particular axis.
pd.DataFrame.merge()
Merge DataFrame objects by performing a column-column join similar to database-style join commands.
pd.notnull()
Check a Pandas object for missing values.

Regex Functions:

re.compile()
Compile a regular expression pattern into a Python object.
re.findall()
Return all non-overlapping matches of a pattern in a string, as a list of strings.

Commonly used Python Methods:

.head()
Return the first n rows in an object. The n defaults to 5.
.tail()
Return the last n rows in an object. As with .head(), the n defaults to 5.
.info()
Return information about a data frame, including the index and column data types, non-null values, and memory usage.
.value_counts()
Return an object containing counts of unique values for chosen data.
.describe()
Provides summary statistical information about chosen data.
.split()
Split each string in the chosen values based on a pattern.
.astype()
Coerce a Pandas object to a specific data type.
.apply()
Apply a function to each row or column in a data frame.
.replace()
Replace values passed to to_replace argument with specified values.
.drop_duplicates()
Return a data frame where duplicate rows have been removed from specified columns.
.fillna()
Fill in NA / NaN values using a specified method.