Creating a Tailored Stopwords List for NLP with Python
Written on
Chapter 1: Understanding Stopwords
In this section, we will explore the concept of stopwords and their significance in natural language processing (NLP). A stopword list is crucial for ensuring that machine learning algorithms yield precise results. Stopwords are essentially words that lack substantial meaning and do not contribute significantly to the overall information conveyed.
According to the Oxford Dictionary, a stopword is defined as a word that is automatically excluded from a computer-generated index or concordance. For instance, when “to” is recognized as a stopword, the phrase “to eat apples” simplifies to “eat apples” by omitting “to.”
Many NLP applications rely on stopwords, making it essential to keep an updated list for optimal algorithm performance.
Section 1.1: The Need for Customized Stopwords
In specialized domains, the frequency distribution of words often differs from that of general data. Consider an example where you are developing a document classification system for an eCommerce website. Each page may include repeated phrases such as:
- Review the product.
- Rate the product.
- Contact us.
- Available in stocks.
If you utilize the standard NLTK stopwords list, you might encounter terms like:
- I
- me
- my
- myself
- we
- our
- you
- your
However, common terms relevant to your data, such as “review,” “rate,” “contact,” and “available,” are absent from this list. Given their frequent occurrence across documents, these words do not provide valuable insights regarding the document class. Therefore, it is advisable to exclude them to streamline the feature set and simplify the model.
I recommend combining the default NLTK stopwords list with additional terms derived from your dataset.
Subsection 1.1.1: Building a Stopwords List with NLTK
To create a stopwords list in Python, we will employ the sklearn library, utilizing the following pipeline:
CountVectorizer:
This module processes a text list (or a dataframe column) and transforms it into a matrix of token counts. Essentially, it generates a word list for each item, along with the frequency of each word.
The output from this module, after fitting it to your data, yields the word_count, commonly referred to as term frequency (TF).
TfidfTransformer:
According to the Python sklearn documentation, TF indicates term frequency, while TF-IDF represents term frequency multiplied by inverse document frequency. This term weighting scheme is widely used in information retrieval and various NLP applications, including document classification and text summarization.
The formula for calculating the TF-IDF of a term t in a document d across a document set is:
tf-idf(t, d) = tf(t, d) * idf(t)
Where idf is computed as:
idf(t) = log[n / df(t)] + 1
In this formula, n refers to the total number of documents, and df(t) indicates the number of documents containing term t. Notably, the addition of “1” prevents terms that appear in all documents from being completely disregarded.
Now, we can consolidate our efforts and generate a word list ranked by their TF-IDF scores, sorting them from the least to the most informative terms.
The code assumes you have a pandas dataframe (df) with a column named text, where each row represents a document in your collection.
Section 1.2: Manual Review of Generated Words
After generating the df_tfidf dataframe, it's crucial to manually review the top N words to ensure they align with your needs and expertise in the field. The total number of words is at your discretion, but typically, in natural language contexts, about 40-60% of unique words may be classified as stopwords. Conversely, for product titles, this percentage can drop significantly to around 5%.
I hope you find this information valuable! Please feel free to reach out with any questions you may have.
Chapter 2: Practical Applications
In this video titled "NLP with Python! Stop Words," you will discover the role of stopwords in natural language processing and how they can be effectively managed.
This second video, "Stop Words in NLP | Natural Language Processing with Python| #4," delves deeper into the applications of stopwords and their significance in enhancing NLP tasks.