Advanced Sentiment Analysis of COVID-19 Tweets Using ML

Chapter 1: Data Acquisition from Kaggle

To begin our analysis, we first need to gather data from Kaggle. For this, I utilize Google Collaboratory. We start by installing the Kaggle library with the following command in your notebook:

!pip install kaggle

Next, navigate to kaggle.com and access your account settings by clicking on your profile icon in the top right corner. On the settings page, click on 'Create Token' under the API section, which will download the kaggle.json file.

Now, we need to upload this kaggle.json file into Google Collaboratory. Use the following code:

from google.colab import files

uploaded = files.upload()

After executing this code, an upload button will appear. Select the kaggle.json file to upload it into the environment.

Following the upload, we must configure the directory with these commands:

!mkdir -p ~/.kaggle

!mv kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

Once this setup is complete, run the next line to list available datasets:

!kaggle datasets list

To access our specific dataset, use the following command:

!kaggle datasets download -d datatattle/covid-19-nlp-text-classification

After downloading, unzip the dataset:

!unzip covid-19-nlp-text-classification.zip

Now, we can load the data using pandas:

import pandas as pd

# Specify the file paths

train_file_path = "/content/Corona_NLP_train.csv"

test_file_path = "/content/Corona_NLP_test.csv"

# Load the dataset into Pandas DataFrames

train_df = pd.read_csv(train_file_path, encoding='ISO-8859-1')

test_df = pd.read_csv(test_file_path)

# Preview the first few entries of the training DataFrame

train_df.head()

Section 1.1: Word2Vec Model Development

Next, we need to refine our data to focus on two essential columns:

# Retain only the 'OriginalTweet' and 'Sentiment' columns

train_df = train_df[['OriginalTweet', 'Sentiment']]

test_df = test_df[['OriginalTweet', 'Sentiment']]

# Confirm the changes by displaying the first few rows

test_df.head()

Examining the unique sentiment classes yields:

unique_sentiments = train_df['Sentiment'].unique()

print(unique_sentiments)

The output will show classes such as: ['Neutral', 'Positive', 'Extremely Negative', 'Negative', 'Extremely Positive'].

To simplify our analysis, we recode the sentiment into two classes:

# Create a new column 'SentimentClass'

train_df['SentimentClass'] = train_df['Sentiment'].apply(lambda x: 'Positive' if 'positive' in x.lower() else 'Not Positive')

train_df = train_df.drop('Sentiment', axis=1)

test_df['SentimentClass'] = test_df['Sentiment'].apply(lambda x: 'Positive' if 'positive' in x.lower() else 'Not Positive')

test_df = test_df.drop('Sentiment', axis=1)

# Display the modified DataFrame

test_df.head()

Now we can install the necessary packages for our analysis:

!pip install gensim

!pip install nltk

To tokenize the text and create a Word2Vec model:

import gensim

from gensim.models import Word2Vec

from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords

import nltk

nltk.download('punkt')

nltk.download('stopwords')

# Tokenizing and preprocessing the tweets

train_df['OriginalTweet_2'] = train_df['OriginalTweet'].apply(lambda x: word_tokenize(x.lower()))

# Removing stop words

stop_words = set(stopwords.words('english'))

train_df['OriginalTweet_2'] = train_df['OriginalTweet_2'].apply(lambda x: [word for word in x if word not in stop_words])

# Train the Word2Vec model

model = Word2Vec(train_df['OriginalTweet_2'], vector_size=100, window=5, min_count=1, sg=0)

# Optionally save the model

model.save("word2vec.model")

# Access vectors for specific words

word_vectors = model.wv

To check how the vector representation of a word looks:

vector_example = word_vectors['example']

print(vector_example)

To load the model later:

word2vec_model = gensim.models.Word2Vec.load('/content/word2vec.model')

Chapter 2: Building the Sentiment Analysis Model

To create a sentiment analysis model, we need to convert the tweets into vectors using our Word2Vec model. This involves splitting the tweets into words, converting them to vectors, and then creating a new column for these vectorized tweets. Afterward, we will partition the data into training and testing sets.

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

# Function to convert tweets into vectors

def tweet_to_vector(tweet):

words = tweet.split()

vectorized_words = [word2vec_model.wv[word] for word in words if word in word2vec_model.wv]

return np.mean(vectorized_words, axis=0) if vectorized_words else np.zeros(word2vec_model.vector_size)

# Applying the function to the training DataFrame

train_df['VectorizedTweet'] = train_df['OriginalTweet'].apply(tweet_to_vector)

# Splitting into features and labels

X = np.vstack(train_df['VectorizedTweet'])

y = train_df['SentimentClass']

# Train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now, we will utilize three different models to construct an ensemble classifier:

from lightgbm import LGBMClassifier

from sklearn.ensemble import RandomForestClassifier, VotingClassifier

# Defining individual classifiers

logistic_classifier = LogisticRegression()

rf_classifier = RandomForestClassifier()

lgbm_classifier = LGBMClassifier()

# Creating a Voting Classifier

voting_classifier = VotingClassifier(estimators=[

('logistic', logistic_classifier),

('random_forest', rf_classifier),

('lgbm', lgbm_classifier)

], voting='soft')

# Training the Voting Classifier

voting_classifier.fit(X_train, y_train)

# Making predictions

y_pred = voting_classifier.predict(X_test)

# Calculating accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")

Section 2.1: Model Application on Test Data

To apply the sentiment model on the test dataset from Kaggle, we first need to save the trained model:

import joblib

# Save the model

model_filename = "voting_classifier_model.pkl"

joblib.dump(voting_classifier, model_filename)

Next, we will apply the same vectorization process to the test data:

# Preprocess the test data

test_df['VectorizedTweet'] = test_df['OriginalTweet'].apply(tweet_to_vector)

# Extract features and labels

X_test_final = np.vstack(test_df['VectorizedTweet'])

y_test_final = test_df['SentimentClass']

# Load the saved model

loaded_model = joblib.load(model_filename)

# Make predictions on the new dataset

y_new_pred = loaded_model.predict(X_test_final)

Section 2.2: Evaluation Metrics

To assess the model's performance, we can calculate the accuracy and confusion matrix:

from sklearn.metrics import accuracy_score, confusion_matrix

# Calculate accuracy

accuracy = accuracy_score(y_test_final, y_new_pred)

print(f"Accuracy: {accuracy:.2f}")

# Confusion matrix

confusion = confusion_matrix(y_test_final, y_new_pred)

print("Confusion Matrix:")

print(confusion)

To further understand the model's accuracy, we can visualize the distribution of sentiments:

import matplotlib.pyplot as plt

# Unique labels and their counts

unique_labels, label_counts = np.unique(y_test_final, return_counts=True)

# Create a pie chart

plt.figure(figsize=(8, 8))

plt.pie(label_counts, labels=unique_labels, autopct='%1.1f%%', startangle=140)

plt.axis('equal') # Equal aspect ratio ensures circular pie chart

plt.title('Sentiment Distribution in Test Data')

plt.show()

This visualization illustrates the distribution of positive and negative sentiments, reinforcing that an accuracy of 68% is a commendable result, especially given the inherent class imbalance.

The first video, "Twitter Sentiment Analysis Using Python | Machine Learning Project 8", provides a hands-on approach to sentiment analysis using tweets as a dataset.

The second video, "NLP for Beginners - Sentiment Analysis of Twitter Data Using Scikit-Learn in Python," offers an introductory guide to sentiment analysis techniques within the context of natural language processing.

johnburnsonline.com

Advanced Sentiment Analysis of COVID-19 Tweets Using ML

Chapter 1: Data Acquisition from Kaggle

Section 1.1: Word2Vec Model Development

Chapter 2: Building the Sentiment Analysis Model

Section 2.1: Model Application on Test Data

Section 2.2: Evaluation Metrics

Share the page:

Recent Post:

Creating a Successful Coaching Career: Insights from Jorge Moral

Embracing Mistakes: Finding Wisdom in Life's Blunders

Choosing Single Life: Embracing Freedom Over Abuse

Unlocking the Secrets to OnlyFans Promotion in 2023

Exploring the Halo of Geminga: Unraveling Cosmic Mysteries

Navigating the Challenges of Job Searching After Being Ghosted

Understanding Your Energy: Measuring Vibration and Frequency

Navigating the Complexities of Introducing Alcohol to Children