In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

CLASSIFICATION OF LEGAL QUESTIONS

INTRODUCTION

In this report I will discuss the steps I took to build a classification model for legal questions. My goal for this project was to build a classification model that could accurately classify legal questions into different categories, pointing to the type of lawyers required to handle a particular case. The potential users of this model would be able to post a question, decribed in thier own words, and the model would output a legal category.

To build the classification model, I used data obtained from a free source website. On the website users from around the country receive free legal advice by posting questions into 18 different legal forums.

The question title and details as well as the corresponding legal forum made up my dataset for constructing the classification model.

The dataset consists of 126567 questions, grouped into 18 forums. The different legal forums are: 1) Auto accident, 2) Bankruptcy, 3) Business, 4) Collections & Debt, 5) Consumer & Lemon, 6) Child Custody, 7) Criminal Defense, 8) Divorce, 9) DUI & DWI, 10) Employment & Labor, 11) Immigration, 12) Insurance, 13) Landlord & Tenant, 14) Medical Malpractice, 15) Personal Injury, 16) Real Estate, 17)Traffic, and 18) Wills,Trust & Probate.

There are 3 columns:

  • Category - This is the name of the legal forum
  • Titles - This is a brief description of the question
  • Questions - This is a detailed description of the question

LOADING DATA AND PYTHON PACKAGES

Loading the required packages

Several packages from different libraries were used in this project. I made use of a number of packages from sklearn library. Sklearn library contains many useful packages for features extraction, modelling, and model evaluation. I used scipy for statistics, matplotlib for making plots and sqlalchemy for working with postgres database.

In [2]:
import pandas as pd
import random
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sqlalchemy import create_engine
from collections import Counter
from nltk.corpus import stopwords
from nltk import ngrams
import matplotlib.pyplot as plt
from scipy import stats
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
##from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

##import string
import itertools
##from sklearn.metrics import f1_score
from IPython.display import Javascript
from IPython.core.display import Image, display
import base64

% matplotlib inline

Load data from postgres database

A postgres database was built to store the dataset obtained from the website. The database 'classificationdata' which contains the the dataset is loaded into juptyter for use in this project.

In [3]:
## Load the dataset from postgres database
engine = create_engine('postgresql://postgres:Edoamen1@localhost:5433/LegalQuestions')

legal = pd.read_sql_query('select * from legal."classificationdata"',con=engine)

DATA CLEANING

Based on my observation of the website, some users didn't provide a question title, while some didn't include question detail. So, in order to have as much description of the question as possible, it was best to merge both the titles and questions columns into one ('description') column.

So my current dataset contains 2 columns:

  • Category
  • Description

Since, the goal is to classify a question into a legal category, the category column is the target feature or dependent feature. While, the description column is the independent feature. The description column will be engineered to generate various independent features for use in the model.

In [4]:
## Replacing missing title and question details with space
legal = legal.replace(np.nan, '', regex=True)

## Merging the title and question text columns
legal['description'] = legal['titles'].str.cat(legal['questions'], sep=' ')

## Select required columns for the classification text
data = legal[['category', 'description']]

Splitting data into training, validation and testing set

The next thing I did was split the dataset into 3 parts: training, validation and testing set. The training set will be used for exploratory analysis and to train models during the development stage. The validation set will be used to tune the model hyperparameters in order to improve accuracy and avoid overfitting. The model will not run the testing set during this development stage. When I believe I have built the best model possible, I will test the accuracy using the testing set. This testing step will give me an estimate of the out-of-sample accuracy of my model.

It is important to split ones dataset into these 3 sets whenever performing machine learning tasks. This is done to prevent building a model that is overfit to the dataset and thus cannot be generalized. An idea model should accurately classify a new set of legal questions.

Since the dataset was already shuffled prior to storing in the database, I didn't reshuffle. I split the dataset into 60% training set, 20% validation set, and 20% testing set. This is a commonly used ratio. I made a development set which is a combination of the training and validation set. The development set is all the data that will be involved in building and tuning hyperparameters of the models.

In [5]:
## merge all the text (i.e. titles and question details) into one list
doc = data['description'].tolist()

## Split data into training set, validation set and testing set
train_doc = doc[:75958]
val_doc = doc[75958:101278]
test_doc = doc[101278:]

## Merge training and validation set 
dev_doc = doc[:101278]
In [6]:
## Split target 
##target = data['category_id'].tolist()
target = data['category']

## Split data into training set, validation set and testing set
train_target = target.iloc[:75958].reset_index(drop = True)
val_target = target.iloc[75958:101278].reset_index(drop = True)
test_target = target.iloc[101278:].reset_index(drop = True)

dev_target = target.iloc[:101278].reset_index(drop = True)

Features Engineering

After splitting the data, I will only make use of the training set for features engineering and exploratory analysis. Again this is to avoid overfitting.

The next step was to generate features for the model to train on. Recall that the model should be able to take as input a question description, which is text, and output a category. Texts are made up of words, phrases, numbers and punctuations. So, the model should be able to identify certain words or phrases and associate them with a legal category. Thus, the most useful features would be unigrams and bigrams.

To engineer these features, I began by using regular expressions to split each description into unigrams or tokens, at the same time removing punctuations except apsotrophe and hyphens. Regular expressions is a very useful tool for performing task such as this. The expression used would transform a sentence such as:

  "I've been in the office all day, and I need a long-break."

into a list of words shown below:

["I've", 'been', 'in', 'the', 'office', 'all', 'day', 'and', 'I', 'need', 'a', 'long-break']

After splitting the text into unigrams, I changed all the unigrams to lower case. Changing to lowercase will avoid the model counting words like 'Process' and 'process' as different words.

In [7]:
## Change text to tokens by splitting word, removing punctuations except apostrophe and hyphens
train_tokens = [[] for i in range(len(train_doc))]

for i in range(len(train_doc)):
    rgx = re.compile("(\w[\w'|-]*\w|\w)")
    train_tokens[i] = rgx.findall(train_doc[i])
    
## Change tokens to lower case
train_tokens_lower = [[] for i in range(len(train_tokens))]

for i in range(len(train_tokens)):
    for w in train_tokens[i]:
        train_tokens_lower[i].append(w.lower())
        
## create dataframe 
train_tokens_data = pd.DataFrame()

train_tokens_data['category'] = train_target
train_tokens_data['token'] = train_tokens_lower

Since the model would be using the tokens as features to determine the legal categories, we can reduce the number of features by removing words that are commonly found in any english statement, and most especially words that are common to all legal categories. These common words can be viewed as near-zero variance features.

The words commonly found in any english statement are in the english stopwords dictionary. To determine the words common to all legal categories, I started by making lists of all the unique words in each legal category. Then I made a count of how many legal category each word is found. Finally I made a high_freq_tokens list containing all the unique words that were present in 18 categories.

As expected some of the words in the high_freq_tokens list are also present in the english stopwords dictionary. However, some words such as legal, insurance and court, are not commonly found in any english statement, but are commonly used when describing any legal case.

A combination of the words from the english stopwords dictionary and the high_freq_tokens list made up my custom stop words list. All the words in the custom stop words list were removed from the unigrams as they would not make useful features for the model.

After removing the custom stop words, I proceeded to creating bigrams. At this stage my independent features are no longer long sentences, but rather unigrams and bigrams.

In [8]:
## defined function to make set of unique unigrams 
def tokens(data):
    tokens = data['token'].tolist()
    tokens = list(itertools.chain.from_iterable(tokens))
    unique_tokens = set(tokens)
    
    return unique_tokens
In [9]:
## subset the different legal categories
auto_words = train_tokens_data[train_tokens_data['category'] == 'auto']
bankruptcy_words = train_tokens_data[train_tokens_data['category'] == 'bankruptcy']
business_words = train_tokens_data[train_tokens_data['category'] == 'business']
consumer_words = train_tokens_data[train_tokens_data['category'] == 'consumer']
criminal_words = train_tokens_data[train_tokens_data['category'] == 'criminal']
custody_words = train_tokens_data[train_tokens_data['category'] == 'custody']
debt_words = train_tokens_data[train_tokens_data['category'] == 'debt']
divorce_words = train_tokens_data[train_tokens_data['category'] == 'divorce']
dui_words = train_tokens_data[train_tokens_data['category'] == 'dui']
estate_words = train_tokens_data[train_tokens_data['category'] == 'estate']
immigration_words = train_tokens_data[train_tokens_data['category'] == 'immigration']
injury_words = train_tokens_data[train_tokens_data['category'] == 'injury']
insurance_words = train_tokens_data[train_tokens_data['category'] == 'insurance']
labor_words = train_tokens_data[train_tokens_data['category'] == 'labor']
medical_words = train_tokens_data[train_tokens_data['category'] == 'medical']
tenant_words = train_tokens_data[train_tokens_data['category'] == 'tenant']
traffic_words = train_tokens_data[train_tokens_data['category'] == 'traffic']
wills_words = train_tokens_data[train_tokens_data['category'] == 'wills']

## make sets of unique words in each category
unique_auto_tokens = tokens(auto_words)
unique_bankruptcy_tokens = tokens(bankruptcy_words)
unique_business_tokens = tokens(business_words)
unique_consumer_tokens = tokens(consumer_words)
unique_criminal_tokens = tokens(criminal_words)
unique_custody_tokens = tokens(custody_words)
unique_debt_tokens = tokens(debt_words)
unique_divorce_tokens = tokens(divorce_words)
unique_dui_tokens = tokens(dui_words)
unique_estate_tokens = tokens(estate_words)
unique_immigration_tokens = tokens(immigration_words)
unique_injury_tokens = tokens(injury_words)
unique_insurance_tokens = tokens(insurance_words)
unique_labor_tokens = tokens(labor_words)
unique_medical_tokens = tokens(medical_words)
unique_tenant_tokens = tokens(tenant_words)
unique_traffic_tokens = tokens(traffic_words)
unique_wills_tokens = tokens(wills_words)

## create list of all the sets of unique words
all_unique_tokens = [unique_auto_tokens, unique_bankruptcy_tokens, unique_business_tokens, unique_consumer_tokens,
                    unique_criminal_tokens, unique_custody_tokens, unique_debt_tokens, unique_divorce_tokens, unique_dui_tokens,
                    unique_estate_tokens, unique_immigration_tokens, unique_injury_tokens, unique_insurance_tokens, unique_labor_tokens,
                    unique_medical_tokens, unique_tenant_tokens, unique_traffic_tokens, unique_wills_tokens ]
all_unique_tokens = list(itertools.chain.from_iterable(all_unique_tokens))


## make list of words that appear in all 18 legal categories
high_freq_tokens = [k for k, v in Counter(all_unique_tokens).items() if v > 17 ]

## add high frequency tokens to the stop words list to make custom stop words list
stop_words = set(stopwords.words('english'))
custom_stop_words = [stop_words, high_freq_tokens]
custom_stop_words = list(itertools.chain.from_iterable(custom_stop_words))
In [10]:
## remove custom stopwords
train_clean_words = [[] for i in range(len(train_tokens_lower))]

for i in range(len(train_tokens_lower)):
    for w in train_tokens_lower[i]:
        if w not in custom_stop_words:
            train_clean_words[i].append(w)
            
## Create bigrams   
train_bigrams = [[] for i in range(len(train_clean_words))]

for i in range(len(train_clean_words)):
    train_bigrams[i] = ngrams(train_clean_words[i], 2)
    

## create features dataframe 
features = pd.DataFrame()

features['category'] = train_target
features['unigrams'] = train_clean_words
features['bigrams'] = train_bigrams

EXPLORATORY ANALYSIS

Still using the training dataset, now that I have features for classification, I decided to do some exploratory analysis. The goal of this exploratory analysis, is to find out if there are any patterns or characteristics of my data that may affect my classification model.

Classification of a legal question to a specific legal category may sometimes be tricky. The challenge is because a particular legal case could span over multiple legal categories. For example, a hit and run, auto accident that resulted in injury. This case could be classified under any of the following categories: Auto accident, Criminal or Injury. Another example, a couple gets divorced, and have to make custody arrangement for their 2 yrs old son. This case could be classified under the categories custody or divorce.

Following from the first example, we would expect so see similar terms used to describe legal questions in the auto, criminal and injury categories. From the second example, we would expect high similarity in terms used to describe legal questions in the custody and divorce categories.

In order to know which legal categories are similar, I calculated a similarity ratio between pairs of legal categories. The similarity ratio was calculated by first making a list of all the unique unigrams and bigrams in each category. Then taking pairs of categories, I counted the number of similar unigrams and bigrams, and divided this by the number of unique unigrams and bigrams.

For example, the similarity ratio of auto to bankruptcy is: Number of unique similar unigrams and bigrams in auto and bankruptcy divided by the number of unique unigrams and bigrams in auto.

The table below shows legal categories with high similarity ratios. For example a legal question that a user classifies in the wills category could easily be classified by a model as estate, divorce or business.

This misclassification is more probable with higher similarity ratio. For example, questions under categories such as traffic, insurance, dui, consumer and bankruptcy with similarity ratios greater than 0.2 are more likely to be misclassified. While questions under categories such as labor and criminal with lower similarity ratios are less likely to be misclassified.

In [11]:
## defined function to create set of unique unigrams and bigrams
def unique_unigrams_bigrams(data):
    bigrams = data['bigrams']
    unigrams = data['unigrams']
    
    bigrams = list(itertools.chain.from_iterable(bigrams))    
    unigrams = list(itertools.chain.from_iterable(unigrams))
    unigrams_bigrams = [unigrams,bigrams]
    unigrams_bigrams = list(itertools.chain.from_iterable(unigrams_bigrams))
    unique_unigrams_bigrams = set(unigrams_bigrams)
    
    return unique_unigrams_bigrams
In [12]:
## subset the legal categories
auto = features[features['category'] == 'auto']
bankruptcy = features[features['category'] == 'bankruptcy']
business = features[features['category'] == 'business']
consumer = features[features['category'] == 'consumer']
criminal = features[features['category'] == 'criminal']
custody = features[features['category'] == 'custody']
debt = features[features['category'] == 'debt']
divorce = features[features['category'] == 'divorce']
dui = features[features['category'] == 'dui']
estate = features[features['category'] == 'estate']
immigration = features[features['category'] == 'immigration']
injury = features[features['category'] == 'injury']
insurance = features[features['category'] == 'insurance']
labor = features[features['category'] == 'labor']
medical = features[features['category'] == 'medical']
tenant = features[features['category'] == 'tenant']
traffic = features[features['category'] == 'traffic']
wills = features[features['category'] == 'wills']
  
    
## make sets of unique unigrams and bigrams in the different categories
unique_auto_unigrams_bigrams = unique_unigrams_bigrams(auto)
unique_bankruptcy_unigrams_bigrams = unique_unigrams_bigrams(bankruptcy)
unique_business_unigrams_bigrams = unique_unigrams_bigrams(business)
unique_consumer_unigrams_bigrams = unique_unigrams_bigrams(consumer)
unique_criminal_unigrams_bigrams = unique_unigrams_bigrams(criminal)
unique_custody_unigrams_bigrams = unique_unigrams_bigrams(custody)
unique_debt_unigrams_bigrams = unique_unigrams_bigrams(debt)
unique_divorce_unigrams_bigrams = unique_unigrams_bigrams(divorce)
unique_dui_unigrams_bigrams = unique_unigrams_bigrams(dui)
unique_estate_unigrams_bigrams = unique_unigrams_bigrams(estate)
unique_immigration_unigrams_bigrams = unique_unigrams_bigrams(immigration)
unique_injury_unigrams_bigrams = unique_unigrams_bigrams(injury)
unique_insurance_unigrams_bigrams = unique_unigrams_bigrams(insurance)
unique_labor_unigrams_bigrams = unique_unigrams_bigrams(labor)
unique_medical_unigrams_bigrams = unique_unigrams_bigrams(medical)
unique_tenant_unigrams_bigrams = unique_unigrams_bigrams(tenant)
unique_traffic_unigrams_bigrams = unique_unigrams_bigrams(traffic)
unique_wills_unigrams_bigrams = unique_unigrams_bigrams(wills)


## make a list of all the sets
unique_unigrams_bigrams = [unique_auto_unigrams_bigrams,unique_bankruptcy_unigrams_bigrams,unique_business_unigrams_bigrams,unique_consumer_unigrams_bigrams,
                         unique_criminal_unigrams_bigrams,unique_custody_unigrams_bigrams,unique_debt_unigrams_bigrams,unique_divorce_unigrams_bigrams,unique_dui_unigrams_bigrams,
                         unique_estate_unigrams_bigrams,unique_immigration_unigrams_bigrams,unique_injury_unigrams_bigrams,unique_insurance_unigrams_bigrams,unique_labor_unigrams_bigrams,
                         unique_medical_unigrams_bigrams,unique_tenant_unigrams_bigrams,unique_traffic_unigrams_bigrams,unique_wills_unigrams_bigrams]
In [13]:
## create dataframe with each row showing the similarity of a category to the others.

inter_ratio = pd.DataFrame()

for i in range(len(unique_unigrams_bigrams)):
    for j in range(len(unique_unigrams_bigrams)):
        inter = len(unique_unigrams_bigrams[i].intersection(unique_unigrams_bigrams[j]))/len(unique_unigrams_bigrams[i])
        inter_ratio.loc[i,j] = inter
        
inter_ratio.columns = sorted(list(set(data['category'])))
inter_ratio.index = sorted(list(set(data['category'])))

## Top 4 categories similar to index category
top_similar_category = pd.DataFrame(columns=['Category','Similar_1','Similar_2','Similar_3'])
for i in inter_ratio.columns:
    sim_category = inter_ratio.T.nlargest(4, i).index.tolist()
    top_similar_category.loc[i] = sim_category
    
top_similar_category = top_similar_category.reset_index(drop = True)
top_similar_category = top_similar_category.set_index('Category')
In [14]:
## Top inter_ratio values for each category
top_similarity_values = pd.DataFrame(columns = ["Category","Similar_1 ratio",'Similar_2 ratio','Similar_3 ratio'])
for i in range(len(inter_ratio.columns)):
    similarity_ratio = inter_ratio.iloc[i].T.nlargest(4, keep = "first").values.tolist()
    top_similarity_values.loc[i] = similarity_ratio
top_similarity_values = round(top_similarity_values.drop(['Category'], axis =1), 2)
In [15]:
## Dataframe of similar categories and their similarity ratio
category_simialrity = pd.DataFrame(columns = ['Category','Similar_1','Similar_1 ratio','Similar_2',"Similar_2 ratio",'Similar_3',"Similar_3 ratio"])
category_simialrity['Category'] = top_similar_category.index
category_simialrity['Similar_1'] = top_similar_category['Similar_1'].values
category_simialrity['Similar_1 ratio'] = top_similarity_values['Similar_1 ratio'].values
category_simialrity['Similar_2'] = top_similar_category['Similar_2'].values
category_simialrity['Similar_2 ratio'] = top_similarity_values['Similar_2 ratio'].values
category_simialrity['Similar_3'] = top_similar_category['Similar_3'].values
category_simialrity['Similar_3 ratio'] = top_similarity_values['Similar_3 ratio'].values

category_simialrity = category_simialrity.set_index('Category')

print(category_simialrity)
            Similar_1  Similar_1 ratio Similar_2  Similar_2 ratio Similar_3  \
Category                                                                      
auto           injury             0.14     labor             0.13  criminal   
bankruptcy     estate             0.23  business             0.20      debt   
business        labor             0.14    estate             0.13    tenant   
consumer     business             0.24    estate             0.20     labor   
criminal        labor             0.11  business             0.09    estate   
custody       divorce             0.25     labor             0.19  criminal   
debt         business             0.19    estate             0.18     labor   
divorce        estate             0.15     wills             0.14     labor   
dui          criminal             0.27     labor             0.20      auto   
estate         tenant             0.14  business             0.10     wills   
immigration   divorce             0.19     labor             0.19  criminal   
injury          labor             0.16  criminal             0.14      auto   
insurance       labor             0.23    estate             0.22  business   
labor        business             0.08  criminal             0.07    estate   
medical         labor             0.19    injury             0.19  criminal   
tenant         estate             0.18     labor             0.11  business   
traffic      criminal             0.37      auto             0.34     labor   
wills          estate             0.15   divorce             0.12  business   

             Similar_3 ratio  
Category                      
auto                    0.13  
bankruptcy              0.20  
business                0.11  
consumer                0.20  
criminal                0.09  
custody                 0.18  
debt                    0.16  
divorce                 0.13  
dui                     0.17  
estate                  0.10  
immigration             0.17  
injury                  0.12  
insurance               0.21  
labor                   0.07  
medical                 0.15  
tenant                  0.11  
traffic                 0.29  
wills                   0.11  

The table above suggests that a classification model may have a challenging time accurately classifying between the categories in each row.

BUILDING TEXT CLASSIFICATION MODELS

At this stage i'm ready to start building classification models. But just before I get started, since this is a multiclass problem (i.e. the target has multiple classes), it is best I perform one-hot-encoding on the target column.

Some machine learning algorithims are less efficient when working with strings, they require numbers. To avoid such problems down the road, I performed integer coding of the target feature containing the categories. The codes are in alphabetical order (i.e. 0 - auto, and 17 - wills). However, since there is no natural ordinal relationship between the categories, it is best to avoid any error from having the model attempting to learned from a false relationship.

One hot encoding changes the integer encoded data to binary form of length equal to the number of classes. Only the appropriate class is coded 1 while the others are 0.

I used the training set target to fit a label encoder and one-hot encoder, then I tranformed the targets of all the datasets(i.e. training, validation, and testing sets) using the fits.

In [16]:
## Integer encode
le = LabelEncoder()
train_category_id = le.fit_transform(train_target)
train_category_id = pd.DataFrame(train_category_id)

val_category_id = le.fit_transform(val_target)
val_category_id = pd.DataFrame(val_category_id)

test_category_id = le.fit_transform(test_target)
test_category_id = pd.DataFrame(test_category_id)

dev_category_id = le.fit_transform(dev_target)
dev_category_id = pd.DataFrame(dev_category_id)



## OneHot encode
enc = OneHotEncoder(sparse=False)
train_target_2 = enc.fit_transform(train_category_id)  
val_target_2 = enc.transform(val_category_id)
test_target_2 = enc.transform(test_category_id)

dev_target_2 = enc.transform(dev_category_id)

LOGISTIC REGRESSION

With the targets in the right format, I could then proceed to building classification models. Each of the classification models will be built with the OneVsRestClassifier. The OneVsRestClassifier fits one classifier per class, so the classification model is a set of n binary classifiers, where n is the number or classes.

As a starting point, I considered building a simple classification model such as a logistic regression model. For the features, I made use of unigrams and bigrams as detailed earlier, then I generated a term frequency matrix. A term frequency matrix is a count of the number of appearance of each term (or feature) in each document. I trained the model using the training set. While training the model, it was determined that including bigrams in the features added nothing to the performance of the model, so I only used unigrams. Then I performed cross_validation using the development set(i.e. a combination of the training and validation set). By comparing the accuracy from cross-validation to the accuracy on the training set, I was able to do regularization to avoid overfitting. When I felt comfortable with the model, I plotted a learning curve to check if my model had high bias or high variance.

The learning curve shown below, shows that the model has no problem with variance or bias. After getting the logistic model as best as possible, I proceeded to building models with more complex algorithms.

Model Accuracy

In [21]:
## logistic regression model with OneVsRestClassifier
text_clf1 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(LogisticRegression(C = 0.5, random_state = 123))),
                    ])

## fit the model to the training set
text_clf1 = text_clf1.fit(train_doc, train_target_2)
## retrieve the probability values of y = 1, from each binary classifier
prob_train = text_clf1.predict_proba(train_doc)
## determine predicted class based on classifier with max. probability value
pred_train = np.argmax(prob_train, axis = 1)
## calculate classification accuracy 
print("Accuracy on training set: %0.2f" % accuracy_score(train_category_id, pred_train))


## Perform cross-validation
## retrieve the probability values of y=1, from each binary classifier, for when each data point was in the testing set
cross_val_prob1 = cross_val_predict(text_clf1, dev_doc, dev_target_2, cv=3, method= 'predict_proba')
## determine predicted class based on classifier with max. probability value
cross_val_pred1 = np.argmax(cross_val_prob1, axis = 1)
## calculate the cross-validation accuracy.
print("Cross_validation accuracy: %0.2f" % accuracy_score(dev_category_id, cross_val_pred1))
Accuracy on training set: 0.77
Cross_validation accuracy: 0.74
In [18]:
## Learning curve for logistic regression model
def learning_curve_LogR():
    text_clf1 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(LogisticRegression(C = 0.5, random_state = 123))),
                    ])
    
    ## use development set to train
    X_train = dev_doc
    y_train = dev_target_2
    y_train_id = dev_category_id
    X_test = test_doc
    y_test_id = test_category_id
    
    
    index = [2, 5000, 25000, 50000, 80000, 101278]
   
 
    curve_data = pd.DataFrame(columns = ['number of training examples', 'training error', 'testing error'])
    
    no_of_training_examples = [] 
    training_error = []
    testing_error = []
    
    for i in range(len(index)):
        ## randomly pick questions for training
        random.seed(12)
        ind = random.sample(range(0,len(X_train)), index[i])
        
        random.seed(12)
        new_X_train = []
        new_X_train += random.sample(X_train, index[i])
        
        
        new_y_train = y_train[ind]
       
        new_y_train_id = y_train_id.iloc[ind].reset_index(drop=True)
        
   
        ## fit to training examples 
        no_of_training_examples.append(index[i])
        text_clf1 = text_clf1.fit(new_X_train, new_y_train)
        prob_train = text_clf1.predict_proba(new_X_train)
        pred_train = np.argmax(prob_train, axis = 1)
        ## classification error on training examples
        train_error = 1 - accuracy_score(new_y_train_id, pred_train)
        training_error.append(train_error)
        ## use fit to classify testing set
        prob_test = text_clf1.predict_proba(X_test)
        pred_test = np.argmax(prob_test, axis = 1)
        ## classification error on testing set
        test_error = 1 - accuracy_score(y_test_id, pred_test)
        testing_error.append(test_error)
        
    
    ## data for learning curve
    curve_data['no of training examples'] = index
    curve_data['training error'] = training_error
    curve_data['testing error'] = testing_error
    
    ## plot learning curve
    plt.figure(figsize = (10, 8))
    plt.title("Logistic Regression", fontsize = 20)
    
    # plot the average training and test score lines at each training set size
    plt.plot(curve_data['no of training examples'], curve_data['training error'], 'o-', color="r", label="Training")
    plt.plot(curve_data['no of training examples'], curve_data['testing error'], 'o-', color="g", label="Testing")
    plt.legend(loc='upper right', fontsize = 20)
    plt.xlabel("Training examples", fontsize = 14)
    plt.ylabel("Error", fontsize = 14)
    
    
    # box-like grid
    plt.grid()
    
    return plt.show()

learning_curve_LogR()
    

LINEAR CLASSIFICATION MODEL WITH STOCHASTIC GRADIENT DESCENT TRAINING

The second model I built was a linear classification model using stochastic gradient descent to optimize. Stochastic gradient descent algorithm is used to train the model to minimize a cost function by updating the gradient of the cost function after each training sample. I used the log cost function (i.e. loss = "log") in this model. The log cost function is the same cost function used in the logistic regression model.

I built this model using the same features as in the logistic regression model. Following the same steps as I took when building the previous model, I built this linear model classifier with SGD training, and performed regularization. The learning curve below, shows no problem with high bias or high variance.

Model Accuracy

In [22]:
## Linear Classifier with Stochastic Gradient Descent Training
text_clf2 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(SGDClassifier(loss='log',alpha=0.00003, random_state = 123))),
                    ])

## fit the model to the training set
text_clf2 = text_clf2.fit(train_doc, train_target_2)
## retrieve the probability values of y = 1, from each binary classifier
prob_train2 = text_clf2.predict_proba(train_doc)
## determine predicted class based on classifier with max. probability value
pred_train2 = np.argmax(prob_train2, axis = 1)
## calculate classification accuracy 
print("Accuracy on training set: %0.2f" % accuracy_score(train_category_id, pred_train2))


## Perform cross-validation
## retrieve the probability values of y=1, from each binary classifier, for when each data point was in the testing set
cross_val_prob2 = cross_val_predict(text_clf2, dev_doc, dev_target_2, cv=3, method= 'predict_proba')
## determine predicted class based on classifier with max. probability value
cross_val_pred2 = np.argmax(cross_val_prob2, axis = 1)
## calculate the cross-validation accuracy.
print("Cross_validation accuracy: %0.2f" % accuracy_score(dev_category_id, cross_val_pred2))
Accuracy on training set: 0.76
Cross_validation accuracy: 0.74
In [23]:
## Learning curve for linear classifier with stochastic gradient descent training
def learning_curve_SGD():
    text_clf2 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(SGDClassifier(loss='log',alpha=0.00003, random_state = 123))),
                    ])
    
    
    ## use development set to train
    X_train = dev_doc
    y_train = dev_target_2
    y_train_id = dev_category_id
    X_test = test_doc
    y_test_id = test_category_id
    
    
    index = [2, 5000, 25000, 50000, 80000, 101278]
   
    curve_data = pd.DataFrame(columns = ['number of training examples', 'training error', 'testing error'])
    
    no_of_training_examples = [] 
    training_error = []
    testing_error = []
    
    for i in range(len(index)):
        ## randomly pick questions for training
        random.seed(12)
        ind = random.sample(range(0,len(X_train)), index[i])
        
        random.seed(12)
        new_X_train = []
        new_X_train += random.sample(X_train, index[i])
        
        
        new_y_train = y_train[ind]
       
        new_y_train_id = y_train_id.iloc[ind].reset_index(drop=True)
        
   
        ## fit to training examples
        no_of_training_examples.append(index[i])
        text_clf2 = text_clf2.fit(new_X_train, new_y_train)
        prob_train = text_clf2.predict_proba(new_X_train)
        pred_train = np.argmax(prob_train, axis = 1)
        ## classification error on training examples
        train_error = 1 - accuracy_score(new_y_train_id, pred_train)
        training_error.append(train_error)
        ## use fit to classify testing set
        prob_test = text_clf2.predict_proba(X_test)
        pred_test = np.argmax(prob_test, axis = 1)
        ## classification error on testing set
        test_error = 1 - accuracy_score(y_test_id, pred_test)
        testing_error.append(test_error)
        
    ## data for learning curve
    curve_data['no of training examples'] = index
    curve_data['training error'] = training_error
    curve_data['testing error'] = testing_error
    
    ## plot learning curve
    plt.figure(figsize = (10,8))
    plt.title("Linear Classifier with Stochastic Gradient Descent Training", fontsize = 20)
    
    # plot the average training and test score lines at each training set size
    plt.plot(curve_data['no of training examples'], curve_data['training error'], 'o-', color="r", label="Training")
    plt.plot(curve_data['no of training examples'], curve_data['testing error'], 'o-', color="g", label="Testing")
    plt.legend(loc='upper right', fontsize = 20)
    plt.xlabel("Training examples", fontsize = 14)
    plt.ylabel("Error", fontsize = 14)
    
    
    # box-like grid
    plt.grid()
    
    return plt.show()

learning_curve_SGD()
    

MULTINOMIAL NAIVE BAYES CLASSIFICATION MODEL

The third model was a Multinomial Naive Bayes Classification model. This model is commonly used for text classification. This classifaction model applies Bayes' thoerom, and assume the features are independent. This assumption is valid for this project since our features are term frequency counts from independent unigrams.

Just like with the previous models, I trained the model using the training set, and performed cross-validation using the development set. Next, I used the smoothing parameter, alpha as the regularization parameter. An alpha greater or equal to zero prevents the model from overfitting to only features present in the training set.

Finally I plot a learning curve shown below. The learning curve shows no problem with high variance, however, the bias is slightly higher than in both the logistic regression and linear classifier with stochastic gradient descent training.

Model Performance

In [24]:
## Multinomial Naive Bayes Classifier
text_clf3 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.5))),
                    ])


## fit the model to the training set
text_clf3 = text_clf3.fit(train_doc, train_target_2)
## retrieve the probability values of y = 1, from each binary classifier
prob_train3 = text_clf3.predict_proba(train_doc)
## determine predicted class based on classifier with max. probability value
pred_train3 = np.argmax(prob_train3, axis = 1)
## calculate classification accuracy 
print("Accuracy on training set: %0.2f" % accuracy_score(train_category_id, pred_train3))


## Perform cross-validation
## retrieve the probability values of y=1, from each binary classifier, for when each data point was in the testing set
cross_val_prob3 = cross_val_predict(text_clf3, dev_doc, dev_target_2, cv=3, method= 'predict_proba')
## determine predicted class based on classifier with max. probability value
cross_val_pred3 = np.argmax(cross_val_prob3, axis = 1)## calculate the cross-validation accuracy.
## calculate the cross-validation accuracy.
print("Cross_validation accuracy: %0.2f" % accuracy_score(dev_category_id, cross_val_pred3))
Accuracy on training set: 0.74
Cross_validation accuracy: 0.70
In [25]:
## Learning curve for multinomial naive bayes classification model
def learning_curve_MNB():
    text_clf3 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(MultinomialNB(alpha = 0.5))),
                    ])
    
    ## use development set to train
    X_train = dev_doc
    y_train = dev_target_2
    y_train_id = dev_category_id
    X_test = test_doc
    y_test_id = test_category_id
    
    
    index = [2, 5000, 25000, 50000, 80000, 101278]
   
    curve_data = pd.DataFrame(columns = ['number of training examples', 'training error', 'testing error'])
    
    no_of_training_examples = [] 
    training_error = []
    testing_error = []
    
    for i in range(len(index)):
        ## randomly pick questions for training
        random.seed(12)
        ind = random.sample(range(0,len(X_train)), index[i])
        
        random.seed(12)
        new_X_train = []
        new_X_train += random.sample(X_train, index[i])
        
        
        new_y_train = y_train[ind]
       
        new_y_train_id = y_train_id.iloc[ind].reset_index(drop=True)
        
   
        ## fit to training examples
        no_of_training_examples.append(index[i])
        text_clf3 = text_clf3.fit(new_X_train, new_y_train)
        prob_train = text_clf3.predict_proba(new_X_train)
        pred_train = np.argmax(prob_train, axis = 1)
        ## classification error on training examples
        train_error = 1 - accuracy_score(new_y_train_id, pred_train)
        training_error.append(train_error)
        ## use fit to classify testing set
        prob_test = text_clf3.predict_proba(X_test)
        pred_test = np.argmax(prob_test, axis = 1)
        ## classification error on testing set
        test_error = 1 - accuracy_score(y_test_id, pred_test)
        testing_error.append(test_error)
        
    ## data for learning curve
    curve_data['no of training examples'] = index
    curve_data['training error'] = training_error
    curve_data['testing error'] = testing_error
    
    ## plot learning curve
    plt.figure(figsize = (10,8))
    plt.title("Multinomial Naive Bayes Classifier", fontsize = 20)
    
    # plot the average training and test score lines at each training set size
    plt.plot(curve_data['no of training examples'], curve_data['training error'], 'o-', color="r", label="Training")
    plt.plot(curve_data['no of training examples'], curve_data['testing error'], 'o-', color="g", label="Testing")
    plt.legend(loc='upper right', fontsize = 14)
    plt.xlabel("Training examples", fontsize = 14)
    plt.ylabel("Error", fontsize = 14)
    
    
    # box-like grid
    plt.grid()
    
    return plt.show()
    
learning_curve_MNB()   

LINEAR SUPPORT VECTOR CLASSIFIER

The fourth model was a linear support vector classification model. This model uses a linear kernel, and is known to efficiently handle large number of samples.

Following the same steps, I trained the model using the training set, performed cross-validation, and performed regularization. Then I made the learning curve shown below. The learning curve shows no problem with variance, and it also shows similar bias to the previous linear classification models(i.e. Logistic regression and linear classifier with SGD).

Model Performance

In [26]:
## Linear SVC 
text_clf4 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(LinearSVC(random_state=0, C = 0.04))),
                    ])


## fit the model to the training set
text_clf4 = text_clf4.fit(train_doc, train_target_2)
## retrieve the confidence score from each binary classifier
conf_score_train4 = text_clf4.decision_function(train_doc)
## determine classification based on classifier with highest confidence score
class_train4 = np.argmax(conf_score_train4, axis = 1)
## calculate classification accuracy 
print("Accuracy on training set: %0.2f" % accuracy_score(train_category_id, class_train4))


## Perform cross-validation
## retrieve the confidence score from each binary classifier, for when each data point was in the testing set
cross_val_conf4 = cross_val_predict(text_clf4, dev_doc, dev_target_2, cv=3, method= 'decision_function')
## determine classification based on classifier with highest confidence score
cross_val_pred4 = np.argmax(cross_val_conf4, axis = 1)
## calculate cross-validation accuracy
print("Cross_validation accuracy: %0.2f" % accuracy_score(dev_category_id, cross_val_pred4))
Accuracy on training set: 0.77
Cross_validation accuracy: 0.74
In [27]:
## Learning curve for linear support vector classification model
def learning_curve_SVC():
    text_clf4 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 2))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(LinearSVC(random_state=0, C = 0.04))),
                    ])
    
    ## use development set to train
    X_train = dev_doc
    y_train = dev_target_2
    y_train_id = dev_category_id
    X_test = test_doc
    y_test_id = test_category_id
    
    
    index = [2, 1000, 5000, 10000, 50000, 80000, 101278]
   
    curve_data = pd.DataFrame(columns = ['number of training examples', 'training error', 'testing error'])
    
    no_of_training_examples = [] 
    training_error = []
    testing_error = []
    
    for i in range(len(index)):
        ## randomly pick questions for training
        random.seed(12)
        ind = random.sample(range(0,len(X_train)), index[i])
        
        random.seed(12)
        new_X_train = []
        new_X_train += random.sample(X_train, index[i])
        
        
        new_y_train = y_train[ind]
       
        new_y_train_id = y_train_id.iloc[ind].reset_index(drop=True)
        
   
        ## fit to training examples
        no_of_training_examples.append(index[i])
        text_clf4 = text_clf4.fit(new_X_train, new_y_train)
        prob_train = text_clf4.decision_function(new_X_train)
        pred_train = np.argmax(prob_train, axis = 1)
        ## classification error on training examples
        train_error = 1 - accuracy_score(new_y_train_id, pred_train)
        training_error.append(train_error)
        ## use fit to classify testing set
        prob_test = text_clf4.decision_function(X_test)
        pred_test = np.argmax(prob_test, axis = 1)
        ## classification error on testing set
        test_error = 1 - accuracy_score(y_test_id, pred_test)
        testing_error.append(test_error)
        
    ## data for learning curve
    curve_data['no of training examples'] = index
    curve_data['training error'] = training_error
    curve_data['testing error'] = testing_error
    
    ## plot learning curve
    plt.figure(figsize = (10,8))
    plt.title("Linear Support Vector Machine Classifier", fontsize = 20)
    
    # plot the average training and test score lines at each training set size
    plt.plot(curve_data['no of training examples'], curve_data['training error'], 'o-', color="r", label="Training")
    plt.plot(curve_data['no of training examples'], curve_data['testing error'], 'o-', color="g", label="Testing")
    plt.legend(loc='upper right', fontsize = 20)
    plt.xlabel("Training examples", fontsize = 14)
    plt.ylabel("Error", fontsize = 14)
    
    
    # box-like grid
    plt.grid()
    
    return plt.show()
    
learning_curve_SVC()  

RANDOM FOREST CLASSIFIER

The last model I built was a random forest classification model. Like with previous models I trained the model using the training set, and used the development set for cross-validation. Then I tuned the hyperparameter n_estimator which is the number of trees in the forest, and max_depth to achieve the best classification accuracy while avoiding overfitting.

The learning curve of this model shown below, show the absence of overfitting, however, there is more bias in this model than the previous 4 models.

In [28]:
## RandomForest Classifier

text_clf5 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(RandomForestClassifier(n_estimators=500, criterion='gini',max_depth=20,random_state=0))),
                    ])

## fit the model to the training set
text_clf5 = text_clf5.fit(train_doc, train_target_2)
## retrieve the probability values of y = 1, from each binary classifier
prob_train5 = text_clf5.predict_proba(train_doc)
## determine predicted class based on classifier with max. probability value
pred_train5 = np.argmax(prob_train5, axis = 1)
## calculate classification accuracy
print("Accuracy on training set: %0.2f" % accuracy_score(train_category_id, pred_train5))


## Perform cross-validation
## retrieve the probability values of y=1, from each binary classifier, for when each data point was in the testing set
cross_val_prob5 = cross_val_predict(text_clf5, dev_doc, dev_target_2, cv=3, method= 'predict_proba')
## determine predicted class based on classifier with max. probability value
cross_val_pred5 = np.argmax(cross_val_prob5, axis = 1)
print("Cross_validation accuracy: %0.2f" % accuracy_score(dev_category_id, cross_val_pred5))
Accuracy on training set: 0.67
Cross_validation accuracy: 0.61
In [29]:
## Learning curve for random forest classification model
def learning_curve_RF():
    text_clf5 = Pipeline([('vect', CountVectorizer(token_pattern="(\w[\w'|-]*\w|\w)",stop_words=custom_stop_words,ngram_range=(1, 1))),
                     ('tfidf', TfidfTransformer(use_idf = False)),
                     ('clf', OneVsRestClassifier(RandomForestClassifier(n_estimators=500, criterion='gini',max_depth=20,random_state=0))),
                    ])
    
    ## use development set to train
    X_train = dev_doc
    y_train = dev_target_2
    y_train_id = dev_category_id
    X_test = test_doc
    y_test_id = test_category_id
    
    
    index = [2, 5000, 25000, 50000, 80000, 101278]
   
    curve_data = pd.DataFrame(columns = ['number of training examples', 'training error', 'testing error'])
    
    no_of_training_examples = [] 
    training_error = []
    testing_error = []
    
    for i in range(len(index)):
        ## randomly pick questions for training
        random.seed(12)
        ind = random.sample(range(0,len(X_train)), index[i])
        
        random.seed(12)
        new_X_train = []
        new_X_train += random.sample(X_train, index[i])
        
        
        new_y_train = y_train[ind]
       
        new_y_train_id = y_train_id.iloc[ind].reset_index(drop=True)
        
   
        ## fit to training examples
        no_of_training_examples.append(index[i])
        text_clf5 = text_clf5.fit(new_X_train, new_y_train)
        prob_train = text_clf5.predict_proba(new_X_train)
        pred_train = np.argmax(prob_train, axis = 1)
        ## classification error on training examples
        train_error = 1 - accuracy_score(new_y_train_id, pred_train)
        training_error.append(train_error)
        ## use fit to classify testing set
        prob_test = text_clf5.predict_proba(X_test)
        pred_test = np.argmax(prob_test, axis = 1)
        ## classification error on testing set
        test_error = 1 - accuracy_score(y_test_id, pred_test)
        testing_error.append(test_error)
        
    ## data for learning curve
    curve_data['no of training examples'] = index
    curve_data['training error'] = training_error
    curve_data['testing error'] = testing_error
    
    ## plot learning curve
    plt.figure(figsize = (10,8))
    plt.title("Random Forest Classifier", fontsize = 20)
    
    # plot the average training and test score lines at each training set size
    plt.plot(curve_data['no of training examples'], curve_data['training error'], 'o-', color="r", label="Training")
    plt.plot(curve_data['no of training examples'], curve_data['testing error'], 'o-', color="g", label="Testing")
    plt.legend(loc='upper right', fontsize = 14)
    plt.xlabel("Training examples", fontsize = 14)
    plt.ylabel("Error", fontsize = 14)
    
    
    # box-like grid
    plt.grid()
    
    return plt.show()
    
learning_curve_RF()   

With these 5 models I was convinced I had explored a large enough number of algorithms, and that I would find a suitable model amongst these selection.

To recap, the 5 text classification model built for this project are:

  • Logistic Regression Model
  • Linear Classifier with Stochastic Gradient Descent Training
  • Multinomial Naive Bayes Classification Model
  • Linear Support Vector Classification Model
  • Random Forest Classification Model

MODEL SELECTION

After building several text classification models, the next step was to choose which of the model would be ideal for this classification problem. My choice was based on the classification accuracy of the model on the testing set, as well as the model that could better generalize to new data. I also based my decision on the AUC values of the individual binary classifiers in each classification model.

Accuracy Values

In order to compare the performance of the different classification models against each order, I used each model to classify the testing set, then calculated the classification accuracy.

The classification accuracy of all the models are shown below. The random forest and multinormial naive bayes models have the lowest classification accuracy.

In [32]:
## Calculate classification accuracy for the 5 classification models
logR_prob = text_clf1.predict_proba(test_doc)
logR_pred = np.argmax(logR_prob, axis = 1)

SGD_prob = text_clf2.predict_proba(test_doc)
SGD_pred = np.argmax(SGD_prob, axis = 1)

MNB_prob = text_clf3.predict_proba(test_doc)
MNB_pred = np.argmax(MNB_prob, axis = 1)

SVC_prob = text_clf4.decision_function(test_doc)
SVC_pred = np.argmax(SVC_prob, axis = 1)

RF_prob = text_clf5.predict_proba(test_doc)
RF_pred = np.argmax(RF_prob, axis = 1)

Accuracy_logR = accuracy_score(test_category_id, logR_pred)
Accuracy_SGD = accuracy_score(test_category_id, SGD_pred)
Accuracy_MNB = accuracy_score(test_category_id, MNB_pred)
Accuracy_SVC = accuracy_score(test_category_id, SVC_pred)
Accuracy_RF = accuracy_score(test_category_id, RF_pred)

print("Classification Accuracy of testing set using LogR model: %0.2f" % Accuracy_logR)
print("Classification Accuracy of testing set using Linear_SGD model: %0.2f" % Accuracy_SGD)
print("Classification Accuracy of testing set using MNB model: %0.2f" % Accuracy_MNB)
print("Classification Accuracy of testing set using SVC model: %0.2f" % Accuracy_SVC)
print("Classification Accuracy of testing set using RF model: %0.2f" % Accuracy_RF)
Classification Accuracy of testing set using LogR model: 0.75
Classification Accuracy of testing set using Linear_SGD model: 0.74
Classification Accuracy of testing set using MNB model: 0.71
Classification Accuracy of testing set using SVC model: 0.75
Classification Accuracy of testing set using RF model: 0.60

Model Generalization

In order to get a visual ideal of how well each of the models are expected to generalize to new data, I made a plot of the classification accuracy score on the testing set vs the score on the developmental set. To get these accuracy values, each model was fit to the developmental set, then used to classify both the developmental and testing sets.

A model that gives similar classification accuracy on the testing set as it does on the development set, is expected to generalize well to new data set. Based on the model comparison plot shown below, the stochastic gradient descent classification model generalizes best compared to the others.

In [31]:
## Accuracy of Logistic Regression Model
text_clf1 = text_clf1.fit(dev_doc, dev_target_2)
LogR_y_dev_pred = text_clf1.predict_proba(dev_doc)
LogR_y_dev_pred = np.argmax(LogR_y_dev_pred, axis = 1)
LogR_y_test_pred = text_clf1.predict_proba(test_doc)
LogR_y_test_pred = np.argmax(LogR_y_test_pred, axis = 1)

a = accuracy_score(dev_category_id, LogR_y_dev_pred)
b = accuracy_score(test_category_id, LogR_y_test_pred)


## Accuracy of Linear Classifier with SGD Training
text_clf2 = text_clf2.fit(dev_doc, dev_target_2)
SGD_y_dev_pred = text_clf2.predict_proba(dev_doc)
SGD_y_dev_pred = np.argmax(SGD_y_dev_pred, axis = 1)
SGD_y_test_pred = text_clf2.predict_proba(test_doc)
SGD_y_test_pred = np.argmax(SGD_y_test_pred, axis = 1)

c = accuracy_score(dev_category_id, SGD_y_dev_pred)
d = accuracy_score(test_category_id, SGD_y_test_pred)


## Accuracy of Multinomial Naive Bayes Classifier
text_clf3 = text_clf3.fit(dev_doc, dev_target_2)
MNB_y_dev_pred = text_clf3.predict_proba(dev_doc)
MNB_y_dev_pred = np.argmax(MNB_y_dev_pred, axis = 1)
MNB_y_test_pred = text_clf3.predict_proba(test_doc)
MNB_y_test_pred = np.argmax(MNB_y_test_pred, axis = 1)

e = accuracy_score(dev_category_id, MNB_y_dev_pred)
f = accuracy_score(test_category_id, MNB_y_test_pred)


## Accuracy of Linear SVC 
text_clf4 = text_clf4.fit(dev_doc, dev_target)
SVC_y_dev_pred = text_clf4.decision_function(dev_doc)
SVC_y_dev_pred = np.argmax(SVC_y_dev_pred, axis = 1)
SVC_y_test_pred = text_clf4.decision_function(test_doc)
SVC_y_test_pred = np.argmax(SVC_y_test_pred, axis = 1)

g = accuracy_score(dev_category_id, SVC_y_dev_pred)
h = accuracy_score(test_category_id, SVC_y_test_pred)


## Accuracy of Random Forest 
text_clf5 = text_clf5.fit(dev_doc, dev_target)
RF_y_dev_pred = text_clf5.predict_proba(dev_doc)
RF_y_dev_pred = np.argmax(RF_y_dev_pred, axis = 1)
RF_y_test_pred = text_clf5.predict_proba(test_doc)
RF_y_test_pred = np.argmax(RF_y_test_pred, axis = 1)

i = accuracy_score(dev_category_id, RF_y_dev_pred)
j = accuracy_score(test_category_id, RF_y_test_pred)

## dateframe of accuracy score summary
Accuracy_Score_Summary = pd.DataFrame(columns = ['Model', 'Accuracy_Score in Developmental Set','Accuracy_Score in Testing Set'])

Accuracy_Score_Summary.loc[0] = ['Logistic Regression Model', a, b]
Accuracy_Score_Summary.loc[1] = ['Linear Classifier with SGD', c, d]
Accuracy_Score_Summary.loc[2] = ['Multinomial Naive Bayes', e, f]
Accuracy_Score_Summary.loc[3] = ['Linear SVC Model', g, h]
Accuracy_Score_Summary.loc[4] = ['Random Forest', i, j]
In [39]:
## Scatterplot of accuracy score in both training and testing set for the top 3 models
fig, ax = plt.subplots(figsize = (10,8))
ax.scatter(Accuracy_Score_Summary['Accuracy_Score in Developmental Set'], Accuracy_Score_Summary['Accuracy_Score in Testing Set'])
ax.plot((0.60, 0.8), (0.60, 0.8), ls = "--", c="black")

ax.set_xlabel('Accuracy_Score in Developmental Set', fontsize = 14)
ax.set_ylabel('Accuracy_Score in Testing Set', fontsize = 14)
ax.set_title('MODEL COMPARISON PLOT', fontsize = 20)

plt.text(0.745,0.75, 'LogR model')
plt.text(0.76,0.735, 'Linear_SGD model')
plt.text(0.741,0.695, 'MNB model')
plt.text(0.77,0.757, 'SVC model')
plt.text(0.655,0.60, 'RF model')
plt.show()

ROC Curves

Finally I wanted to determine which of these models generated the best binary classifiers for each class. Recall that each model was built using the One-vs-Rest Classifier. So, each model contains 18 binary classifiers (i.e. one for each category). A binary classifier, classifies one category as 1 and the others as 0.

The closer the curve is to the top left border of the graph the better the classifier. It means that the classifier does a good job of separating one class from all the rest.

As an example, I have shown below the ROC curves for the auto category using the 5 classification models. All 5 ROC curves appear quite similar, and all indicate good classifiers. To determine which is best of the 5, the area under the curves, also known as the AUC values will be used for comparison.

In [34]:
## The classification probability for each class (legal category)
y_test_prob_clf1 = text_clf1.fit(test_doc, test_target_2).predict_proba(test_doc)
y_test_prob_clf2 = text_clf2.fit(test_doc, test_target_2).predict_proba(test_doc)
y_test_prob_clf3 = text_clf3.fit(test_doc, test_target_2).predict_proba(test_doc)
y_test_prob_clf4 = text_clf4.fit(test_doc, test_target_2).decision_function(test_doc)
y_test_prob_clf5 = text_clf5.fit(test_doc, test_target_2).predict_proba(test_doc)
In [35]:
## Compute ROC curve and ROC area for each class
n_classes = 18

fpr_clf1 = dict()
tpr_clf1 = dict()
roc_auc_clf1= dict()
for i in range(n_classes):
    fpr_clf1[i], tpr_clf1[i], _ = roc_curve(test_target_2[:, i], y_test_prob_clf1[:, i])
    roc_auc_clf1[i] = auc(fpr_clf1[i], tpr_clf1[i])
    
fpr_clf2 = dict()
tpr_clf2 = dict()
roc_auc_clf2= dict()
for i in range(n_classes):
    fpr_clf2[i], tpr_clf2[i], _ = roc_curve(test_target_2[:, i], y_test_prob_clf2[:, i])
    roc_auc_clf2[i] = auc(fpr_clf2[i], tpr_clf2[i])
    
fpr_clf3 = dict()
tpr_clf3 = dict()
roc_auc_clf3= dict()
for i in range(n_classes):
    fpr_clf3[i], tpr_clf3[i], _ = roc_curve(test_target_2[:, i], y_test_prob_clf3[:, i])
    roc_auc_clf3[i] = auc(fpr_clf3[i], tpr_clf3[i])
     

fpr_clf4 = dict()
tpr_clf4 = dict()
roc_auc_clf4= dict()
for i in range(n_classes):
    fpr_clf4[i], tpr_clf4[i], _ = roc_curve(test_target_2[:, i], y_test_prob_clf4[:, i])
    roc_auc_clf4[i] = auc(fpr_clf4[i], tpr_clf4[i])
    
fpr_clf5 = dict()
tpr_clf5 = dict()
roc_auc_clf5= dict()
for i in range(n_classes):
    fpr_clf5[i], tpr_clf5[i], _ = roc_curve(test_target_2[:, i], y_test_prob_clf5[:, i])
    roc_auc_clf5[i] = auc(fpr_clf5[i], tpr_clf5[i])    
In [36]:
## ROC curve for Auto category
fig = plt.figure(figsize = (35, 20))
lw = 2

ax1 = fig.add_subplot(231)
ax1.plot(fpr_clf1[0], tpr_clf1[0], color='darkorange',
         lw=lw, label='AUC value logR (area = %0.2f)' % roc_auc_clf1[0])
ax1.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.05])
ax1.set_xlabel('False Positive Rate', fontsize = 20)
ax1.set_ylabel('True Positive Rate', fontsize = 20)
ax1.set_title('ROC - Logistic Regression', fontsize = 22)
ax1.legend(loc="lower right", fontsize = 20)


ax2 = fig.add_subplot(232)
ax2.plot(fpr_clf2[0], tpr_clf2[0], color='red',
         lw=lw, label='AUC value SGD (area = %0.2f)' % roc_auc_clf2[0])
ax2.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax2.set_xlim([0.0, 1.0])
ax2.set_ylim([0.0, 1.05])
ax2.set_xlabel('False Positive Rate', fontsize = 20)
ax2.set_ylabel('True Positive Rate', fontsize = 20)
ax2.set_title('ROC - Stochastic Gradient Descent Classification Model', fontsize = 22)
ax2.legend(loc="lower right", fontsize = 20)


ax3 = fig.add_subplot(233)
ax3.plot(fpr_clf3[0], tpr_clf3[0], color='red',
         lw=lw, label='AUC value MNB (area = %0.2f)' % roc_auc_clf3[0])
ax3.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax3.set_xlim([0.0, 1.0])
ax3.set_ylim([0.0, 1.05])
ax3.set_xlabel('False Positive Rate', fontsize = 20)
ax3.set_ylabel('True Positive Rate', fontsize = 20)
ax3.set_title('ROC - Multinomial Naive Bayes Classification Model', fontsize = 22)
ax3.legend(loc="lower right", fontsize = 20)


ax4 = fig.add_subplot(234)
ax4.plot(fpr_clf4[0], tpr_clf4[0], color='red',
         lw=lw, label='AUC value SVC (area = %0.2f)' % roc_auc_clf4[0])
ax4.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax4.set_xlim([0.0, 1.0])
ax4.set_ylim([0.0, 1.05])
ax4.set_xlabel('False Positive Rate', fontsize = 20)
ax4.set_ylabel('True Positive Rate', fontsize = 20)
ax4.set_title('ROC - Linear Support Vector Classifier', fontsize = 22)
ax4.legend(loc="lower right", fontsize = 20)


ax5 = fig.add_subplot(235)
ax5.plot(fpr_clf5[0], tpr_clf5[0], color='red',
         lw=lw, label='AUC value RF (area = %0.2f)' % roc_auc_clf5[0])
ax5.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax5.set_xlim([0.0, 1.0])
ax5.set_ylim([0.0, 1.05])
ax5.set_xlabel('False Positive Rate', fontsize = 20)
ax5.set_ylabel('True Positive Rate', fontsize = 20)
ax5.set_title('ROC - Random Forest Classifier', fontsize = 22)
ax5.legend(loc="lower right", fontsize = 20)


plt.show()

AUC values for the different classification models

The area under the ROC curve (AUC value) can be used to compare binary classifiers. The larger the area the better the classifier. As seen in the ROC curves above, the linear classifier with SGD training generated the best binary classifier for the auto category when compared with the other models.

Rather than show all 90 ROC curves, I have made a boxplot below that shows the distribution (or range for the 18 classes) of AUC values for each model.

In [40]:
## Dataframe of AUC values of each class for the top 3 classification models
columns = ['LogR', 'SGD', 'MNB', 'SVC', 'RF']
auc_values = pd.DataFrame(columns = columns)

auc_values['LogR'] = pd.Series(v for v in roc_auc_clf1.values())
auc_values['SGD'] = pd.Series(v for v in roc_auc_clf2.values())
auc_values['MNB'] = pd.Series(v for v in roc_auc_clf3.values())
auc_values['SVC'] = pd.Series(v for v in roc_auc_clf4.values())
auc_values['RF'] = pd.Series(v for v in roc_auc_clf5.values())

## Boxplot to compare the distribution of AUC values in the models
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
auc_values.boxplot(ax = ax)
ax.set_ylabel('AUC Values', fontsize = 12 )
plt.show()

The boxplot shows that all 5 classification models generated really good binary classifiers for all 18 classses, with AUC values mostly greater than 0.92. So, to determine which model I considered the best based on AUC values, I identified the model with the best binary classifier for each category. Then I determined which model had most of the best binary classifiers.

As shown in the table below, most of the best binary classifiers were found in the random forest model.

In [41]:
## determine which model has the max AUC value for the different categories, and tally them.
count_max_auc = Counter(auc_values.idxmax(axis = 1).tolist())

no_max_auc_class = pd.DataFrame.from_dict(count_max_auc, orient='index').reset_index()
no_max_auc_class.columns = ['Classification Model', 'No. of Legal Categories']
print(no_max_auc_class)
  Classification Model  No. of Legal Categories
0                  SGD                        8
1                   RF                       10

Although the random forest model generated better binary classifiers for more classes than the linear classifier with SGD training, the overall classification accuracy of the random forest model is low, and it does not generalize as well to new data, when compared to the other models. For this reasons, the random forest is not the best for this project.

Comparing the other 4 models, the multinomial naive bayes model has the lowest classification accuracy, and also does not generalize as well as the others. So, the multinomial naive bayes model is not the best for this project.

This leaves me with the logistic regression model, linear classifier with SGD training, and linear support vector model. These 3 models give similar classification accuracy, and show similar generalization ability, however; the linear classifier with SGD training has better binary classifiers which makes it stand out from the other two. So, based on these analysis, I decided the linear classifier with SGD training is the best for this project.

METRICS

Classification Matrix

In [42]:
## Generate classification matrix on the test data, using the Linear_SGD model
class_names = sorted(list(set(data['category'])))

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

   
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
test_pred_classes = le.inverse_transform(SGD_y_test_pred) ## revert the label encoding to strings
cnf_matrix = confusion_matrix(np.array(test_target), np.array(test_pred_classes))


# Plot non-normalized confusion matrix
plt.figure(figsize=(10, 10))
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

plt.show()

Classification Report

In [43]:
## Generate classification report on the test data, using the Linear_SGD model
y_true = test_target
test_pred_classes = le.inverse_transform(SGD_y_test_pred) ## revert the label encoding to strings
y_pred = test_pred_classes
target_names = sorted(list(set(data['category'])))
print(classification_report(y_true, y_pred, target_names=target_names))
             precision    recall  f1-score   support

       auto       0.73      0.72      0.72      1413
 bankruptcy       0.83      0.75      0.79       651
   business       0.59      0.54      0.57      1774
   consumer       0.65      0.37      0.47       383
   criminal       0.68      0.88      0.77      2990
    custody       0.67      0.63      0.65       852
       debt       0.59      0.54      0.56      1134
    divorce       0.78      0.75      0.77      2220
        dui       0.88      0.64      0.74       516
     estate       0.72      0.70      0.71      2645
immigration       0.91      0.72      0.81       442
     injury       0.65      0.51      0.57      1107
  insurance       0.64      0.16      0.25       441
      labor       0.82      0.90      0.86      3825
    medical       0.82      0.62      0.70       380
     tenant       0.75      0.87      0.81      2051
    traffic       0.59      0.24      0.34        79
      wills       0.85      0.86      0.86      2416

avg / total       0.74      0.74      0.73     25319

The classification report above shows poor classification accuracy ( < 0.5) for the insurance, traffic, and consumer categories.

From the classification matrix, we can see that most of the misclassified questions in these categories were classfied under the following categories:

  • insurance - labor, auto, wills, criminal, estate
  • traffic - criminal, auto
  • consumer - business

As discussed earlier, it is possible for a particular question to be classified under multiple legal categories due to the similarities or connections between the categories. For example, the poor classification accuracy observed for insurance category could be because the insurance claims may have been due to auto accident. Such a case could be misclassified under auto. Or, the insurance claim may be due to theft of property. Such a case could be misclassified under criminal.

RECOMMENDATION

Due to the challenge faced when attempting to classify questions that could easily fit into multiple categories, I recommend that rather than outputing one category, the model outputs the top two. For example, in a case involving a hit and run auto accident that resulted in death, rather than classify as just auto, or criminal, the model should output both auto and criminal.

By following this recommendation, the model will be more efficient in providing users with options on which legal specialist could effectively handle their cases.

CONCLUSION

To summarize, my primary goal working on this project was to build a high accuracy classification model that took as input texts of legal question description from people, and output the right legal category for their question.

In building this model, I made use of data collected from a free source legal advice website. The dataset consisted of question descriptions and their corresponding legal category. There were 18 categories in total. I generated classification features using term frequency counts of unigrams. Then using one-vs-rest classifier, I built 5 classification models. These were logistic regression model, stochastic gradient descent classification model, multinomial naive bayes classification model, linear support vector classification model and random forest classification model.

I concluded that the stochastic gradient descent classification model was the best for this project based on its relatively high classification accuracy, and its expected ability to generalize well to new data. This model has an estimated out-of-sample classification accuracy of ~74%.

The shortcomings of this model was discussed to be due to the fact that some questions can easily be classified into multiple categories, due to the nature of the questions, and due to similarities between certain categories. Based on this shortcoming, I recommended for better efficiency to users, the model could be built such that it outputs the top 2 classes for each question.

The benefit of this model is that users can describe their legal questions in their own words and be adviced on what legal specialist they need to consult.

APPENDIX

DATA VISUALIZATION

Using the data collected from the advice website, I performed exploratory analysis to answer some basic questions, such as:

  • Which states did most of the legal questions originate?
  • Which legal category had the most number of questions?
  • What are the most common legal categories in each state?

The visualization show the answers to these exploratory questions.