Toxic Comment Classification using LSTM and LSTM-CNN.

My first attempt to solve a Natural Language Processing Use-Case with Deep Learning.

Shaunak Varudandi
Towards Data Science

--

Photo by Dan Edge on Unsplash

Introduction

Online forums and social media platforms have provided individuals with the means to put forward their thoughts and freely express their opinion on various issues and incidents. In some cases, these online comments contain explicit language which may hurt the readers. Comments containing explicit language can be classified into myriad categories such as Toxic, Severe Toxic, Obscene, Threat, Insult, and Identity Hate. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions.

To protect users from being exposed to offensive language on online forums or social media sites, companies have started flagging comments and blocking users who are found guilty of using unpleasant language. Several Machine Learning models have been developed and deployed to filter out the unruly language and protect internet users from becoming victims of online harassment and cyberbullying.

Problem Statement

“To build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate.”

Having worked on an NLP use-case before (Fake News Classifier to tackle Covid-19 dis-information”), my aim in this project was to focus on data pre-processing and feature engineering and ensure that the data, which will be consumed by my deep-learning models, is as clean as possible. Additionally, I decided to use fastText’s pre-trained word embedding to harness the power of Transfer Learning.

Work Flow

Toxic Comment Classifier is a competition that has been organized by Jigsaw/Conversation AI and hosted on Kaggle. The data set for building the classification model was acquired from the competition site and it included the training set as well as the test set. The steps elaborated in the workflow below will describe the entire process from Data Pre-Processing to Model Testing.

Data Exploration, Data Pre-processing, and Feature Engineering

Step 1: Checking for missing values.

First and foremost, after importing the training and test data into the pandas dataframe, I decided to check for missing values in the downloaded data. Using the “isnull” function on both the training and test data, I discovered that there were no missing records and therefore, I moved on to the next step of my project.

Step 2: Text Normalization.

As I was now certain that there are no missing records in my data, I decided to start with data pre-processing. Firstly, I decided to normalize the text data since comments from online forums usually contain inconsistent language, use of special characters in place of letters (e.g. @rgument), as well as the use of numbers to represent letters (e.g. n0t). To tackle such inconsistencies in data, I decided to use Regex. The text normalization steps that I performed are listed below:-

  • Removing Characters in between Text.
  • Removing Repeated Characters.
  • Converting data to lower-case.
  • Removing Punctuation.
  • Removing unnecessary white spaces in between words.
  • Removing “\n”.
  • Removing Non-English characters.

To accomplish the above-listed steps, I referenced the following jupyter notebooks [1] [2]. Firstly, I created a dictionary that had common representations of cuss words which are frequently found on online forums or social media platforms. Secondly, I decided to create a function that performs all the above-listed steps and gives us clean data as the output. These steps were performed on the training as well as the test data.

Step 3: Lemmatization.

Since the data is now clean and consistent, it is the right time to perform Lemmatization. Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. For example, we do not want the Machine Learning algorithm to treat studying, studies, and study as three separate words because, in truth, they are not. Lemmatization helps reduce the words “studying” and “studies” to their root form, i.e. study. To implement Lemmatization, I imported “WordNetLemmatizer” from the “nltk” library, created a function “lemma” to perform Lemmatization, and applied it to the clean data that I procured from Step 2.

Step 4: Stopwords Removal.

Stopwords Removal, as we all know, is one of the most critical steps in text pre-processing for use-cases that involve text classification. Removing stopwords ensures that more focus is on those words that define the meaning of the text.

i. To remove stopwords from my data, I took the help of the “spacy” library. Spacy has a list of common stopwords, “STOP_WORDS” that can be used to remove stopwords from any textual data.

ii. Although the list provided by spacy’s library is quite extensive, I decided to search for additional stopwords that might be unique to my dataset.

iii. Firstly, I decided to add single-letter and two-letter words to the list of stopwords. While reading through random comments in my dataset, I came across instances where single-letter or two-letter words existed without any context, (e.g. Wow such a lovely pillow w!! or He is such a happy guy bb.) To make sure that such instances of single-letter or two-letter words do not affect the performance of my deep learning model, I added them to the list of stopwords. Although, I made sure that words like me, am, as, or letters like I and a are not added to the list of stopwords.

iv. Once the above task was completed, I decided to search for words in my dataset that could be possible stopwords, (the criteria being that they appear in the dataset very frequently, and second, they do not have a significant contribution towards the classification task). To complete this step, I wrote the below-given code snippet that helped me find stopwords in my dataset, which satisfies the criteria highlighted above. Lastly, I added these newly acquired stopwords to spacy’s “STOP_WORDS” list and therefore, created my final list of stopwords.

v. Now that I have my desired list of stopwords, I use it to remove stopwords from my training data and my test data. Once this step is completed, I have a clean data set that is free from all inconsistencies.

Step 5: Tokenization, Indexing, and Index Representation.

As we all know, machine learning and deep learning models work on numerical data irrespective of the use case. Therefore, to train a deep-learning model using the clean text data, the data must be converted into its equivalent machine-readable form. To achieve such a feat. we need to perform the following steps [3]:

  1. Tokenization — We need to break down the sentence into unique words. e.g. “I love cats and love dogs” will become [“I”, “love”, “cats”, “and”, “dogs”].
  2. Indexing — We put the words in a dictionary-like structure and give them an index each e.g. {1: “I”,2: “love”,3: “cats”,4: “and”,5: “dogs”}.
  3. Index Representation- We could represent the sequence of words in the comments in the form of an index, and feed this chain of index into our deep-learning model. For e.g. [1,2,3,4,2,5].

Using the “Tokenizer” class from the “Keras” library, the above-mentioned steps can be easily performed. This class allows vectorizing a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf, etc [4]. The code snippet below demonstrates the conversion of text data into sequence vectors.

Step 6: Padding.

Comments found on online forums or social media platforms are variable in length, some are one-word replies while others are vastly elaborated thoughts. Variable-length sentences are converted into variable-length sequence vectors and we cannot pass vectors of inconsistent lengths to our deep-learning model. To circumvent this issue, we use Padding. With the help of padding, we can make the shorter sentences as long as the others by filling the shortfall by zeros, and on the other hand, we can trim the longer ones to the same length as the short ones [3]. I used the “pad_sequences” function from the “Keras” library and, I fixed the sentence length at 200 words and applied post padding (i.e. for shorter sentences, 0’s will be added at the end of the sequence vector). As soon as we are done with the padding of our sequence vectors, we can start creating our deep-learning models.

Model Creation & Model Assessment

Step 1: Split Training Data into Train-Set and Validation-Set.

Since we have completed the data pre-processing and feature engineering part of our project, we move on to the model creation and model assessment part of the project. Before trying to fit a deep learning model on the training data, I randomly split the data into train-set and validation-set. The validation set accounts for 20% of the training data.

Step 2: Import fastText’s pre-trained word embeddings.

As mentioned earlier, in the Problem Statement, I wanted to use pre-trained word embeddings from fastText to harness the power of Transfer Learning. To do so, I load the fastText word embeddings into my own environment, and then, I create an embedding matrix by assigning the vocabulary with the pre-trained word embeddings.

Step 3: Model Creation (LSTM).

It is now time to choose a deep-learning model and train the model using the train-set and the validation-set. Since we are working on a Natural Language Processing use-case, it is ideal that we use the Long Short Term Memory model (LSTM). LSTM networks are similar to RNNs with one major difference that hidden layer updates are replaced by memory cells. This makes them better at finding and exposing long-range dependencies in data which is imperative for sentence structures [5].

i. Firstly, I imported the “Taloslibrary since it will help us perform hyperparameter tuning as well as model evaluation. Using the “Scan” function, I performed a GridSearchCV and found the best parameters that would give me the highest accuracy.

ii. Next, using the best hyper-parameters, I defined the required number of layers for the LSTM model, compiled the model, and lastly trained the model using the train-set and validation-set.

Step 4: Model Creation (LSTM-CNN).

During the research phase of my project, I came across papers that achieved Toxic Comment Classification using a hybrid model (i.e. an LSTM and CNN model that worked together). Such architecture, for a deep-learning model, intrigued me. LSTM can effectively preserve the characteristics of historical information in long text sequences whereas CNN can extract the local features of the text [6]. Combining the two traditional neural network architectures will help us harness their combined capabilities. Therefore, I decided to implement an LSTM-CNN hybrid model as a part of my project. The goal was to compare the performance of both the deep-learning architectures and ascertain the best deep-learning model for my project.

Similar to the process followed in Step 3, I discovered the best hyper-parameters for my hybrid model using “Talos”. Once the operation was completed, I evaluated the results and picked the hyperparameters which gave me the highest accuracy. Finally, I trained my hybrid model using the train-set and the validation-set.

Step 5: Evaluate the Model Accuracy and Model Loss during the training phase.

As we have completed the training of both our deep-learning models, we should now visualize their accuracy and loss values during the entire training process. Ideally, the loss value for any deep-learning model should decrease as the number of epochs increases, on the other hand, the accuracy should increase as the number of epochs increases. This gives us a fairly decent idea about the quality of our deep-learning model, and whether it has been appropriately trained. Trends in the accuracy and the loss values during every epoch can be seen in the images below.

Loss and Accuracy values for the LSTM model over a span of 2 epochs. (Image by author)
Loss and Accuracy values for the LSTM-CNN model over a span of 2 epochs. (Image by author)

Step 6: Calculating Model Accuracy using Test-set

Evaluating the model based on accuracy and loss values gave me promising results. It gave me the confidence to assess the performance of my deep learning models using the test set. As mentioned earlier in this blog, the test-set was procured from Kaggle and it was passed through the same data pre-processing and feature engineering steps as the training data. Since I now have the processed test data available with me, I used the “predict” function to generate outputs for the inputs present in the test data.

As soon as the above process was completed for both my deep learning models, I uploaded the respective “.csv” output files to the Kaggle Competition and submitted them to generate the final accuracy scores. The maximum accuracy scores for both my deep learning models can be seen in the image below.

Comparison of Accuracy scores for the traditional LSTM Model and the hybrid LSTM-CNN Model. (Image by author)

Conclusion

After evaluating the results procured during the training phase of my project and the results that I received from the competition website, I can claim that the traditional LSTM Model performs better than the hybrid LSTM-CNN Model. The hybrid model loses marginally to the traditional deep-learning model which states that the traditional LSTM model is the right choice for the Toxic Comment Classification use-case.

The next step is to deploy the LSTM model as a backend to a web application that determines the toxicity of a comment which is provided as an input by the user. A detailed walkthrough of all the steps needed to deploy the LSTM model on AWS EC2 can be found in Part II of this blog. Be sure to check that one out as well.

Key Learnings from the Project

This project allowed me to work with two different deep learning models and additionally, I was able to implement them on a Natural Language Processing use-case. The various data pre-processing and feature engineering steps in the project made me cognizant of the efficient methods that can be used to clean textual data. I understood the working of various deep-learning models such as CNN, LSTM, and the LSTM-CNN hybrid model. I got introduced to the concepts of word embedding and the advantages of using pre-trained word embedding. Lastly, the discovery of the “Talos” library helped me perform seamless hyper-parameter tuning for both my deep-learning models and helped me achieve optimum results.

References

[1] - https://www.kaggle.com/fizzbuzz/toxic-data-preprocessing
[2] - https://github.com/susanli2016/NLP-with-Python/blob/master/Toxic%20Comments%20LSTM%20GloVe.ipynb
[3] - https://www.kaggle.com/sbongo/for-beginners-tackling-toxic-using-keras
[4] - https://keras.io/api/preprocessing/text/
[5] - https://towardsdatascience.com/recurrent-neural-networks-deep-learning-for-nlp-37 baa188aef5#:~:text=Long%20Short%2DTerm% 20Memory%20Cell,can%20help%20to%20be%20useful.
[6] - J. Zhang, Y. Li, J. Tian, and T. Li, “LSTM-CNN Hybrid Model for Text Classification,” 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, 2018, pp. 1675–1680, doi: 10.1109/IAEAC.2018.8577620.

The workflow followed by me for this project can be found on my Github page. I hope you enjoyed reading my blog.

--

--

MBA (Co-op) student at the DeGroote School of Business || Aspiring Business Data Analyst.