Text Cleaning in Practice: Real-World Examples and Best Practices
Why did the data scientist’s text data always smell so fresh?
Because he used a powerful text cleaning tool that left it feeling brand new!
In all seriousness, text cleaning is an important step in the data science process, as it helps to remove noise and ensure that the data is ready for analysis or modeling. By taking the time to properly clean your text data, you can improve the accuracy and reliability of your results and make your data smell fresh and clean!
Text cleaning is an essential step in the natural language processing (NLP) process, as it helps to remove noise and ensure that the data is ready for analysis or modeling. In this article, we will explore some common techniques for text cleaning and how they can be applied to real-world data.
Aspects of text cleaning
- One of the most basic text cleaning techniques is removing punctuation and non-alphabetic characters. This can be done using regular expressions or built-in string methods such as
translate
. It is also common to convert all text to lowercase, as this can help to reduce the number of unique words in the dataset and make it easier to analyze.
Text Cleaning using Python Inbuild Translate method.
Code Description:
This code defines a clean_text
function that takes a string as input and returns the cleaned version of the string. It creates a translation table using the str.maketrans
function and the string.punctuation
constant and a slice of the string.printable
constant, which includes all ASCII characters that are considered printable (excluding space). It then uses the translate
method to remove the specified characters. It also removes leading and trailing whitespace using the strip
method.
You can customize this function to suit your specific needs. For example, you can modify the list of punctuation characters to include or exclude specific characters, or you can add additional cleaning steps such as converting the text to lowercase or removing digits.
Text Cleaning using regex library
Code Description
This code defines a clean_text
function that takes a string as input and returns the cleaned version of the string. It uses the sub
function from the re
library to remove any characters that are not alphanumeric or whitespace. It then removes leading and trailing whitespace using the strip
method.
2. Another important aspect of text cleaning is removing stop words. These are words that are commonly used in language but do not provide much meaning on their own, such as “a,” “the,” and “but.” Removing stop words can help to reduce the size of the dataset and improve the performance of machine learning models
Code snippet to remove stop words using python standard library.
Code Description
This code defines a remove_stop_words
function that takes a string as input and returns the version of the string with stop words removed. It tokenizes the text using the split
method, removes stop words using a list comprehension, and then joins the clean tokens into a single string.
Code Snippet to remove stop words using NLTK library
This code defines a remove_stop_words
function that takes a string as input and returns the version of the string with stop words removed. It tokenizes the text using the word_tokenize
function from nltk
, removes stop words using a list comprehension, and then joins the clean tokens into a single string.
3. In addition to removing stop words, it is often useful to lemmatize or stem words. Lemmatization involves reducing words to their base form, while stemming involves removing the suffixes from words. Both techniques can help to improve the accuracy of NLP models by reducing the number of unique words in the dataset
Code Snippet using NLTK
Here is a sentence with words and their lemmas:
“I am running to the store to buy milk. (I — I, am — be, running — run, to — to, the — the, store — store, to — to, buy — buy, milk — milk)”
In this sentence, the lemmas of the words are shown in parentheses. The lemma of a word is its base form, which can be useful for reducing the number of unique words in a dataset and improving the performance of machine learning models. For example, the lemma of “running” is “run,” and the lemma of “buys” is “buy.” By lemmatizing the text, you can reduce the number of unique words and make it easier to analyze or model the data.
Code Snippet to lemmatize text using spacy.
4. When cleaning text data, it is important to be mindful of the specific characteristics of the data. For example, if you are working with social media data, you may need to account for hashtags, emojis, and other non-standard characters. You may also need to consider how to handle different languages or dialects.
Code Example using regex library
Code Snippet to clean social media data using spacy
Pro tips on text cleaning.
- Define your cleaning goals: Before you start cleaning your text, it is important to define your goals for the cleaning process. This will help you identify which steps are most important and prioritize your efforts.
- Use regular expressions: Regular expressions are a powerful tool for cleaning text data. They allow you to specify patterns of characters to search for and replace, which can save you a lot of time and effort compared to manually cleaning the data.
- Test your cleaning steps: As you develop your cleaning pipeline, it is important to test your cleaning steps to make sure they are working as intended. This will help you identify any issues or errors in your code and allow you to make adjustments as needed.
- Use existing libraries: There are many libraries and tools available that can help you clean your text data. These libraries often include a wide range of functions and features that can save you time and effort, and they are usually well-documented and easy to use.
- Consider the context of your data: The context of your data can play a big role in how you approach text cleaning. For example, social media data may require different cleaning steps than academic papers or legal documents. Understanding the context of your data can help you tailor your cleaning efforts to the specific needs of your project.
In summary, text cleaning is an essential step in the NLP process that helps to remove noise and prepare the data for analysis or modeling. By understanding common text cleaning techniques and how to apply them to real-world data, you can improve the accuracy and reliability of your results.
Test your knowledge with following questions
- Given the following text data:
"This is some sample text with extra whitespace and
weird formatting."
Write a Python function that removes the extra whitespace and weird formatting from the text data. The function should take a string as input and return the cleaned version of the string.
2. Given the following text data:
"This is some sample text with special characters and digits: #$%^&*123"
Write a Python function that removes the special characters and digits from the text data. The function should take a string as input and return the cleaned version of the string.
3. Given the following text data:
"This is some sample text with stop words: a, an, the, and, or, but"
Write a Python function that removes the stop words from the text data. The function should take a string and a list of stop words as inputs and return the version of the string with the stop words removed.
Solutions
def clean_text(text):
# Remove extra whitespace
text = ' '.join(text.split())
# Remove weird formatting
text = re.sub(r'\s+', ' ', text)
return text
2.
def clean_text(text):
# Remove special characters and digits
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\d+', '', text)
return text
3.
def remove_stop_words(text, stop_words):
# Tokenize the text
tokens = nltk.word_tokenize(text)
# Remove stop words
tokens = [token for token in tokens if token.lower() not in stop_words]
# Join the tokens into a single string
cleaned_text = ' '.join(tokens)
return cleaned_text