CONSUMER COMPLAINT CLASSIFICATION

05 Jan

05Jan

Consumer Complaint Classification means classifying the nature of the complaint reported by the consumer. It is helpful for consumer care departments as they receive thousands of complaints daily, so classifying them helps identify complaints that need to be solved first to reduce the loss of the consumer. So, if you want to learn how to use Machine Learning for consumer complaint classification, this article is for you. In this article, I will take you through the task of Consumer Complaint Classification with Machine Learning using Python.

Consumer Complaint Classification

The problem of consumer complaint classification is based on Natural Language Processing and Multiclass Classification. To solve this problem, we needed a dataset containing complaints reported by consumers. I found an ideal dataset for this task that contains data about:

The nature of the complaint reported by the consumer
The Issue mentioned by the consumer
The complete description of the complaint of the consumer

We can use this data to build a Machine Learning model that can classify the nature of complaints reported by consumers in real time. You can download the dataset here. In the section below, I will take you through the task of Classifying Consumer Complaints with Machine Learning using Python.

Consumer Complaint Classification using Python

As the dataset we are using is a huge dataset of more than 1 GB of memory, I recommend you upload the data on your Google Drive and use Google colab notebook for this task. Now let’s start with the task of consumer complaint classification by importing the necessary Python libraries and the dataset: 1

import pandas as pd

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.linear_model import SGDClassifier

import nltk

import re

from nltk.corpus import stopwords

import string

data = pd.read_csv("/content/drive/MyDrive/consumercomplaints.csv")

print(data.head())

   Unnamed: 0 Date received  \ 0           0    2022-11-11   1           1    2022-11-23   2           2    2022-11-16   3           3    2022-11-15   4           4    2022-11-07                                                Product  \ 0                                           Mortgage   1  Credit reporting, credit repair services, or o...   2                                           Mortgage   3                        Checking or savings account   4                                           Mortgage                     Sub-product                           Issue  \ 0  Conventional home mortgage  Trouble during payment process   1            Credit reporting     Improper use of your report   2                 VA mortgage  Trouble during payment process   3            Checking account             Managing an account   4      Other type of mortgage  Trouble during payment process                                          Sub-issue  \ 0                                            NaN   1  Reporting company used your report improperly   2                                            NaN   3                                    Fee problem   4                                            NaN                           Consumer complaint narrative  0                                                NaN  1                                                NaN  2                                                NaN  3  Hi, I have been banking with Wells Fargo for o...  4                                                NaN

If you have not uploaded your data on Google Drive, you can read the data using the command mentioned below: 1

data = pd.read_csv("consumercomplaints.csv")

The dataset contains an Unnamed column. I’ll remove the column and move further: 1

data = data.drop("Unnamed: 0",axis=1)

Now let’s have a look if the dataset contains null values or not: 1

print(data.isnull().sum())

Date received                         0 Product                               0 Sub-product                      235294 Issue                                 0 Sub-issue                        683355 Consumer complaint narrative    1987977 dtype: int64

The dataset contains so many null values. I’ll drop all the rows containing null values and move further: 1

data = data.dropna()

The product column in the dataset contains the labels. Here the labels represent the nature of the complaints reported by the consumers. Let’s have a look at all the labels and their frequency: 1

print(data["Product"].value_counts())

Credit reporting, credit repair services, or other personal consumer reports    507582 Debt collection                                                                 192045 Credit card or prepaid card                                                      80410 Checking or savings account                                                      54192 Student loan                                                                     32697 Vehicle loan or lease                                                            19874 Payday loan, title loan, or personal loan                                         1008 Name: Product, dtype: int64

Training Consumer Complaint Classification Model

The consumer complaint narrative column contains the complete description of the complaints reported by the consumers. I will clean and prepare this column before using it in a Machine Learning model (you can learn more about this process here): 1

nltk.download('stopwords')

stemmer = nltk.SnowballStemmer("english")

stopword=set(stopwords.words('english'))

def clean(text):

    text = str(text).lower()

    text = re.sub('\[.*?\]', '', text)

    text = re.sub('https?://\S+|www\.\S+', '', text)

    text = re.sub('<.*?>+', '', text)

    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

    text = re.sub('\n', '', text)

    text = re.sub('\w*\d\w*', '', text)

    text = [word for word in text.split(' ') if word not in stopword]

    text=" ".join(text)

    text = [stemmer.stem(word) for word in text.split(' ')]

    text=" ".join(text)

    return text

data["Consumer complaint narrative"] = data["Consumer complaint narrative"].apply(clean)

Now, let’s split the data into training and test sets: 1

data = data[["Consumer complaint narrative", "Product"]]

x = np.array(data["Consumer complaint narrative"])

y = np.array(data["Product"])

cv = CountVectorizer()

X = cv.fit_transform(x)

X_train, X_test, y_train, y_test = train_test_split(X, y,

                                                    test_size=0.33,

                                                    random_state=42)

Now, let’s train the Machine Learning model using the Stochastic Gradient Descent classification algorithm: 1

sgdmodel = SGDClassifier()

sgdmodel.fit(X_train,y_train)

Now, let’s use our trained model to make predictions: 1

user = input("Enter a Text: ")

data = cv.transform([user]).toarray()

output = sgdmodel.predict(data)

print(output)

Enter a Text: On XXXX/XXXX/2022, I called Citi XXXX XXXX XXXX XXXX XXXX Customer Service at XXXX. I did not want to pay {$99.00} for the next year membership and wanted to cancel my card account. A customer service representative told me if I pay the {$99.00} membership fee and spending {$1000.00} in 3 months, I can get XXXX mileage reward points of XXXX XXXX. I believed what he said and paid {$99.00} membership fee on XXXX/XXXX/2022.   I spent more than {$1000.00} in 3 months since XXXX/XXXX/2022. On XXXX/XXXX/2022, I called the card Customer Service about my reward mileage points. I was total the reward mileage points are NOT XXXX. I can only get XXXX mileage points instead. I believe that the Citi XXXX XXXX XXXX XXXX XXXX Customer Service cheated me. This is business fraud!['Credit card or prepaid card']

user = input("Enter a Text: ")

data = cv.transform([user]).toarray()

output = sgdmodel.predict(data)

print(output)

Enter a Text: Investigation took more than 30 days and nothing was changed when clearly there are misleading, incorrect, inaccurate items on my credit report..i have those two accounts attached showing those inaccuracies... I need them to follow the law because this is a violation of my rights!! The EVIDENCE IS IN BLACK AND WHITE ....['Credit reporting, credit repair services, or other personal consumer reports']

So this is how you can use Machine Learning for the task of Classifying Consumer Complaints using Python.

Summary

Consumer Complaint Classification is helpful for consumer care departments as they receive thousands of complaints daily, so classifying them helps identify complaints that need to be solved first to reduce the loss of the consumer. I hope you liked this article on Classifying Consumer Complaints with Machine Learning using Python. Feel free to ask valuable questions in the comments section below.

Comments