Advanced Sentiment Analysis of COVID-19 Tweets Using ML
Written on
Chapter 1: Data Acquisition from Kaggle
To begin our analysis, we first need to gather data from Kaggle. For this, I utilize Google Collaboratory. We start by installing the Kaggle library with the following command in your notebook:
!pip install kaggle
Next, navigate to kaggle.com and access your account settings by clicking on your profile icon in the top right corner. On the settings page, click on 'Create Token' under the API section, which will download the kaggle.json file.
Now, we need to upload this kaggle.json file into Google Collaboratory. Use the following code:
from google.colab import files
uploaded = files.upload()
After executing this code, an upload button will appear. Select the kaggle.json file to upload it into the environment.
Following the upload, we must configure the directory with these commands:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
Once this setup is complete, run the next line to list available datasets:
!kaggle datasets list
To access our specific dataset, use the following command:
!kaggle datasets download -d datatattle/covid-19-nlp-text-classification
After downloading, unzip the dataset:
!unzip covid-19-nlp-text-classification.zip
Now, we can load the data using pandas:
import pandas as pd
# Specify the file paths
train_file_path = "/content/Corona_NLP_train.csv"
test_file_path = "/content/Corona_NLP_test.csv"
# Load the dataset into Pandas DataFrames
train_df = pd.read_csv(train_file_path, encoding='ISO-8859-1')
test_df = pd.read_csv(test_file_path)
# Preview the first few entries of the training DataFrame
train_df.head()
Section 1.1: Word2Vec Model Development
Next, we need to refine our data to focus on two essential columns:
# Retain only the 'OriginalTweet' and 'Sentiment' columns
train_df = train_df[['OriginalTweet', 'Sentiment']]
test_df = test_df[['OriginalTweet', 'Sentiment']]
# Confirm the changes by displaying the first few rows
test_df.head()
Examining the unique sentiment classes yields:
unique_sentiments = train_df['Sentiment'].unique()
print(unique_sentiments)
The output will show classes such as: ['Neutral', 'Positive', 'Extremely Negative', 'Negative', 'Extremely Positive'].
To simplify our analysis, we recode the sentiment into two classes:
# Create a new column 'SentimentClass'
train_df['SentimentClass'] = train_df['Sentiment'].apply(lambda x: 'Positive' if 'positive' in x.lower() else 'Not Positive')
train_df = train_df.drop('Sentiment', axis=1)
test_df['SentimentClass'] = test_df['Sentiment'].apply(lambda x: 'Positive' if 'positive' in x.lower() else 'Not Positive')
test_df = test_df.drop('Sentiment', axis=1)
# Display the modified DataFrame
test_df.head()
Now we can install the necessary packages for our analysis:
!pip install gensim
!pip install nltk
To tokenize the text and create a Word2Vec model:
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')
# Tokenizing and preprocessing the tweets
train_df['OriginalTweet_2'] = train_df['OriginalTweet'].apply(lambda x: word_tokenize(x.lower()))
# Removing stop words
stop_words = set(stopwords.words('english'))
train_df['OriginalTweet_2'] = train_df['OriginalTweet_2'].apply(lambda x: [word for word in x if word not in stop_words])
# Train the Word2Vec model
model = Word2Vec(train_df['OriginalTweet_2'], vector_size=100, window=5, min_count=1, sg=0)
# Optionally save the model
model.save("word2vec.model")
# Access vectors for specific words
word_vectors = model.wv
To check how the vector representation of a word looks:
vector_example = word_vectors['example']
print(vector_example)
To load the model later:
word2vec_model = gensim.models.Word2Vec.load('/content/word2vec.model')
Chapter 2: Building the Sentiment Analysis Model
To create a sentiment analysis model, we need to convert the tweets into vectors using our Word2Vec model. This involves splitting the tweets into words, converting them to vectors, and then creating a new column for these vectorized tweets. Afterward, we will partition the data into training and testing sets.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Function to convert tweets into vectors
def tweet_to_vector(tweet):
words = tweet.split()
vectorized_words = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]
return np.mean(vectorized_words, axis=0) if vectorized_words else np.zeros(word2vec_model.vector_size)
# Applying the function to the training DataFrame
train_df['VectorizedTweet'] = train_df['OriginalTweet'].apply(tweet_to_vector)
# Splitting into features and labels
X = np.vstack(train_df['VectorizedTweet'])
y = train_df['SentimentClass']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we will utilize three different models to construct an ensemble classifier:
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
# Defining individual classifiers
logistic_classifier = LogisticRegression()
rf_classifier = RandomForestClassifier()
lgbm_classifier = LGBMClassifier()
# Creating a Voting Classifier
voting_classifier = VotingClassifier(estimators=[
('logistic', logistic_classifier),
('random_forest', rf_classifier),
('lgbm', lgbm_classifier)
], voting='soft')
# Training the Voting Classifier
voting_classifier.fit(X_train, y_train)
# Making predictions
y_pred = voting_classifier.predict(X_test)
# Calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Section 2.1: Model Application on Test Data
To apply the sentiment model on the test dataset from Kaggle, we first need to save the trained model:
import joblib
# Save the model
model_filename = "voting_classifier_model.pkl"
joblib.dump(voting_classifier, model_filename)
Next, we will apply the same vectorization process to the test data:
# Preprocess the test data
test_df['VectorizedTweet'] = test_df['OriginalTweet'].apply(tweet_to_vector)
# Extract features and labels
X_test_final = np.vstack(test_df['VectorizedTweet'])
y_test_final = test_df['SentimentClass']
# Load the saved model
loaded_model = joblib.load(model_filename)
# Make predictions on the new dataset
y_new_pred = loaded_model.predict(X_test_final)
Section 2.2: Evaluation Metrics
To assess the model's performance, we can calculate the accuracy and confusion matrix:
from sklearn.metrics import accuracy_score, confusion_matrix
# Calculate accuracy
accuracy = accuracy_score(y_test_final, y_new_pred)
print(f"Accuracy: {accuracy:.2f}")
# Confusion matrix
confusion = confusion_matrix(y_test_final, y_new_pred)
print("Confusion Matrix:")
print(confusion)
To further understand the model's accuracy, we can visualize the distribution of sentiments:
import matplotlib.pyplot as plt
# Unique labels and their counts
unique_labels, label_counts = np.unique(y_test_final, return_counts=True)
# Create a pie chart
plt.figure(figsize=(8, 8))
plt.pie(label_counts, labels=unique_labels, autopct='%1.1f%%', startangle=140)
plt.axis('equal') # Equal aspect ratio ensures circular pie chart
plt.title('Sentiment Distribution in Test Data')
plt.show()
This visualization illustrates the distribution of positive and negative sentiments, reinforcing that an accuracy of 68% is a commendable result, especially given the inherent class imbalance.
The first video, "Twitter Sentiment Analysis Using Python | Machine Learning Project 8", provides a hands-on approach to sentiment analysis using tweets as a dataset.
The second video, "NLP for Beginners - Sentiment Analysis of Twitter Data Using Scikit-Learn in Python," offers an introductory guide to sentiment analysis techniques within the context of natural language processing.