Functions
Text
CLEAN
How to Use Excel's CLEAN Function in Pandas
Pandas' data cleaning capabilities are vital for preparing datasets for analysis. By cleaning data, analysts ensure that their datasets are accurate and that their analyses are reliable.
A helpful, but sometimes forgotten Excel function is CLEAN. This function removes all non-printable characters from text, including line breaks, tabs, and other non-printable characters.
This page guides you through the process repllicating Excel's CLEAN function in pandas.
Use Mito's CLEAN function
Mito is an open source library that lets you write Excel formulas in Python. Either write the formula directly in Python or use the CLEAN formula in the Mito Spreadsheet and generate the equivalent Python code automatically.
Mito's CLEAN function works exactly like it does in Excel. That means you don't need worry about managing data types, handling errors, or the edge case differences between Excel and Python formulas.
Install Mito to start using Excel formulas in Python.
# Import the mitosheet Excel functions
from mitosheet.public.v3 import *;
# Use Mito's CLEAN function
df['A'] = CLEAN(df['A'])
Implementing the Data Cleaning function in Pandas#
To remove non-printable characters in pandas, you often need to employ regular expressions alongside pandas' string methods. This ensures that your DataFrame only contains printable and useful data.
Remove Non-Printable Characters#
The Excel CLEAN function removes removes the first 32 non-printable characters from the provided text. To implement the same functionality in pandas, you need to use a regular expression to match non-printable characters and replace them with an empty string.
# Remove non-printable characters
df['A'] = df['A'].str.replace(r'[\x00-\x1F]', '', regex=True)
Common mistakes when using CLEAN in Python#
While cleaning data with pandas is a powerful process, there are pitfalls that can lead to incorrect data manipulation or loss. Here's how to avoid some of the most common mistakes.
Data Loss During Cleaning#
It's possible to accidentally remove valuable data when cleaning. For instance, using a broad regex pattern might catch and remove characters that are actually part of the data you need.
Ensure your regex patterns in the str.replace method are specific and test them on a subset of data to prevent unintended data loss
For example, the following code removes all non-printable characters and all extra white spaces from the column
# Test the pattern on a small subset first
df['test_column'] = df['Column'].str.replace('[\x00-\x1F\x7F]', '', regex=True)
Understanding the Data Cleaning Formula in Excel#
The CLEAN function in Excel removes all non-printable characters from text. That is, it removes the first 32 non-printable characters in the 7-bit ASCII code (values 0 through 31), if any are found, and returns the result. Text without these characters is returned unchanged.
=CLEAN(text)
CLEAN Excel Syntax
Parameter | Description | Data Type |
---|---|---|
text | The text from which you want to remove non-printable characters. | string |
Examples
Formula | Description | Result |
---|---|---|
=CLEAN("This is a test\n") | Removes the line break from the text. | "This is a test" |
=CLEAN(A1) | Removes non-printable characters from the text in cell A1. | Cleaned text |
Don't re-invent the wheel. Use Excel formulas in Python.
Install MitoDon't want to re-implement Excel's functionality in Python?
Edit a spreadsheet.
Generate Python.
Mito is the easiest way to write Excel formulas in Python. Every edit you make in the Mito spreadsheet is automatically converted to Python code.
View all 100+ transformations →