from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
This project would make use of data collected from a free source website. The website contains questions asked by users from around the country, and answers are provided by attorneys. The entries are grouped into 18 legal forums.
The different legal forums are: 1) Auto accident, 2) Bankruptcy, 3) Business, 4) Collections & Debt, 5) Consumer & Lemon, 6) Child Custody, 7) Criminal Defense, 8) Divorce, 9) DUI & DWI, 10) Employment & Labor, 11) Immigration, 12) Insurance, 13) Landlord & Tenant, 14) Medical Malpractice, 15) Personal Injury, 16) Real Estate, 17)Traffic, and 18) Wills,Trust & Probate.
The data collected from the website are:
There are 126597 questions in the dataset.
Using the available data, below are the questions I propose to address:
Based on my findings, I will make recommendations to a hypothetical law firm.
In addition to answering the above questions, I plan to build a classification model that takes the question title and detail, and determines what legal forum it belongs to. The classfication model will be presented in a separate report.
## Load required packages
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string
from nltk.util import ngrams
from IPython.display import Image
from IPython.core.display import Image, display
% matplotlib inline
The data from the website were stored in a postgres database, and read into jupyter for exploration and analysis.
## Load data from database
engine = create_engine('postgresql://postgres:Edoamen1@localhost:5433/LegalQuestions')
legal = pd.read_sql_query('select * from legal."legaldata"',con=engine)
Since the website did not keep track of the date each question was asked, I have to estimate that date using when it was answered and the date I collected the data. The assumption in this estimate is that the questions where answered around the same time the question was asked.
Also, I merged the question title and detail columns inorder to have a full description of the question. Finally I filled empty state entries with 'unspecified state'.
## Calculate the estimated date asked
year_five = legal[legal['date_answered'].str.contains('Answered 5 years')]
year_seven = legal[legal['date_answered'].str.contains('Answered 7 years')]
year_four = legal[legal['date_answered'].str.contains('Answered 4 years')]
year_six = legal[legal['date_answered'].str.contains('Answered 6 years')]
year_one = legal[legal['date_answered'].str.contains('Answered 1 year')]
year_eight = legal[legal['date_answered'].str.contains('Answered 8 years')]
year_two = legal[legal['date_answered'].str.contains('Answered 2 years')]
year_three = legal[legal['date_answered'].str.contains('Answered 3 years')]
month_one = legal[legal['date_answered'].str.contains('Answered 1 month')]
month_two = legal[legal['date_answered'].str.contains('Answered 2 months')]
month_three = legal[legal['date_answered'].str.contains('Answered 3 months')]
month_four = legal[legal['date_answered'].str.contains('Answered 4 months')]
month_five = legal[legal['date_answered'].str.contains('Answered 5 months')]
month_six = legal[legal['date_answered'].str.contains('Answered 6 months')]
month_seven = legal[legal['date_answered'].str.contains('Answered 7 months')]
month_eight = legal[legal['date_answered'].str.contains('Answered 8 months')]
month_nine = legal[legal['date_answered'].str.contains('Answered 9 months')]
month_ten = legal[legal['date_answered'].str.contains('Answered 10 months')]
month_eleven = legal[legal['date_answered'].str.contains('Answered 11 months')]
month_twelve = legal[legal['date_answered'].str.contains('Answered 12 months')]
day_1 = legal[legal['date_answered'].str.contains('Answered 1 day')]
day_2 = legal[legal['date_answered'].str.contains('Answered 2 days')]
day_4 = legal[legal['date_answered'].str.contains('Answered 4 days')]
day_5 = legal[legal['date_answered'].str.contains('Answered 5 days')]
day_6 = legal[legal['date_answered'].str.contains('Answered 6 days')]
day_7 = legal[legal['date_answered'].str.contains('Answered 7 days')]
day_8 = legal[legal['date_answered'].str.contains('Answered 8 days')]
day_9 = legal[legal['date_answered'].str.contains('Answered 9 days')]
day_10 = legal[legal['date_answered'].str.contains('Answered 10 days')]
day_11 = legal[legal['date_answered'].str.contains('Answered 11 days')]
day_15 = legal[legal['date_answered'].str.contains('Answered 15 days')]
day_16 = legal[legal['date_answered'].str.contains('Answered 16 days')]
day_17 = legal[legal['date_answered'].str.contains('Answered 17 days')]
day_18 = legal[legal['date_answered'].str.contains('Answered 18 days')]
day_19 = legal[legal['date_answered'].str.contains('Answered 19 days')]
day_20 = legal[legal['date_answered'].str.contains('Answered 20 days')]
day_21 = legal[legal['date_answered'].str.contains('Answered 21 days')]
day_22 = legal[legal['date_answered'].str.contains('Answered 22 days')]
day_23 = legal[legal['date_answered'].str.contains('Answered 23 days')]
day_24 = legal[legal['date_answered'].str.contains('Answered 24 days')]
day_25 = legal[legal['date_answered'].str.contains('Answered 25 days')]
day_26 = legal[legal['date_answered'].str.contains('Answered 26 days')]
day_27 = legal[legal['date_answered'].str.contains('Answered 27 days')]
day_28 = legal[legal['date_answered'].str.contains('Answered 28 days')]
day_29 = legal[legal['date_answered'].str.contains('Answered 29 days')]
today = legal[legal['date_answered'].str.contains('Answered today')]
unans = legal[legal['date_answered'].str.contains('unanswered')]
year_five['est_date_asked'] = year_five['date_collected'].apply(lambda x: x - pd.DateOffset(years=5))
year_seven['est_date_asked'] = year_seven['date_collected'].apply(lambda x: x - pd.DateOffset(years=7))
year_four['est_date_asked'] = year_four['date_collected'].apply(lambda x: x - pd.DateOffset(years=4))
year_six['est_date_asked'] = year_six['date_collected'].apply(lambda x: x - pd.DateOffset(years=6))
year_one['est_date_asked'] = year_one['date_collected'].apply(lambda x: x - pd.DateOffset(years=1))
year_eight['est_date_asked'] = year_eight['date_collected'].apply(lambda x: x - pd.DateOffset(years=8))
year_two['est_date_asked'] = year_two['date_collected'].apply(lambda x: x - pd.DateOffset(years=2))
year_three['est_date_asked'] = year_three['date_collected'].apply(lambda x: x - pd.DateOffset(years=3))
month_one['est_date_asked'] = month_one['date_collected'].apply(lambda x: x - pd.DateOffset(months=1))
month_two['est_date_asked'] = month_two['date_collected'].apply(lambda x: x - pd.DateOffset(months=2))
month_three['est_date_asked'] = month_three['date_collected'].apply(lambda x: x - pd.DateOffset(months=3))
month_four['est_date_asked'] = month_four['date_collected'].apply(lambda x: x - pd.DateOffset(months=4))
month_five['est_date_asked'] = month_five['date_collected'].apply(lambda x: x - pd.DateOffset(months=5))
month_six['est_date_asked'] = month_six['date_collected'].apply(lambda x: x - pd.DateOffset(months=6))
month_seven['est_date_asked'] = month_seven['date_collected'].apply(lambda x: x - pd.DateOffset(months=7))
month_eight['est_date_asked'] = month_eight['date_collected'].apply(lambda x: x - pd.DateOffset(months=8))
month_nine['est_date_asked'] = month_nine['date_collected'].apply(lambda x: x - pd.DateOffset(months=9))
month_ten['est_date_asked'] = month_ten['date_collected'].apply(lambda x: x - pd.DateOffset(months=10))
month_eleven['est_date_asked'] = month_eleven['date_collected'].apply(lambda x: x - pd.DateOffset(months=11))
month_twelve['est_date_asked'] = month_twelve['date_collected'].apply(lambda x: x - pd.DateOffset(months=12))
day_1['est_date_asked'] = day_1['date_collected'].apply(lambda x: x - pd.DateOffset(days=1))
day_2['est_date_asked'] = day_2['date_collected'].apply(lambda x: x - pd.DateOffset(days=2))
day_4['est_date_asked'] = day_4['date_collected'].apply(lambda x: x - pd.DateOffset(days=4))
day_5['est_date_asked'] = day_5['date_collected'].apply(lambda x: x - pd.DateOffset(days=5))
day_6['est_date_asked'] = day_6['date_collected'].apply(lambda x: x - pd.DateOffset(days=6))
day_7['est_date_asked'] = day_7['date_collected'].apply(lambda x: x - pd.DateOffset(days=7))
day_8['est_date_asked'] = day_8['date_collected'].apply(lambda x: x - pd.DateOffset(days=8))
day_9['est_date_asked'] = day_9['date_collected'].apply(lambda x: x - pd.DateOffset(days=9))
day_10['est_date_asked'] = day_10['date_collected'].apply(lambda x: x - pd.DateOffset(days=10))
day_11['est_date_asked'] = day_11['date_collected'].apply(lambda x: x - pd.DateOffset(days=11))
day_15['est_date_asked'] = day_15['date_collected'].apply(lambda x: x - pd.DateOffset(days=15))
day_16['est_date_asked'] = day_16['date_collected'].apply(lambda x: x - pd.DateOffset(days=16))
day_17['est_date_asked'] = day_17['date_collected'].apply(lambda x: x - pd.DateOffset(days=17))
day_18['est_date_asked'] = day_18['date_collected'].apply(lambda x: x - pd.DateOffset(days=18))
day_19['est_date_asked'] = day_19['date_collected'].apply(lambda x: x - pd.DateOffset(days=19))
day_20['est_date_asked'] = day_20['date_collected'].apply(lambda x: x - pd.DateOffset(days=20))
day_21['est_date_asked'] = day_21['date_collected'].apply(lambda x: x - pd.DateOffset(days=21))
day_22['est_date_asked'] = day_22['date_collected'].apply(lambda x: x - pd.DateOffset(days=22))
day_23['est_date_asked'] = day_23['date_collected'].apply(lambda x: x - pd.DateOffset(days=23))
day_24['est_date_asked'] = day_24['date_collected'].apply(lambda x: x - pd.DateOffset(days=24))
day_25['est_date_asked'] = day_25['date_collected'].apply(lambda x: x - pd.DateOffset(days=25))
day_26['est_date_asked'] = day_26['date_collected'].apply(lambda x: x - pd.DateOffset(days=26))
day_27['est_date_asked'] = day_27['date_collected'].apply(lambda x: x - pd.DateOffset(days=27))
day_28['est_date_asked'] = day_28['date_collected'].apply(lambda x: x - pd.DateOffset(days=28))
day_29['est_date_asked'] = day_29['date_collected'].apply(lambda x: x - pd.DateOffset(days=29))
today['est_date_asked'] = today['date_collected'].apply(lambda x: x - pd.DateOffset(days=0))
legal = year_five.append(year_seven, ignore_index = True)
legal = legal.append(year_four, ignore_index = True)
legal = legal.append(year_six, ignore_index = True)
legal = legal.append(year_one, ignore_index = True)
legal = legal.append(year_eight, ignore_index = True)
legal = legal.append(year_two, ignore_index = True)
legal = legal.append(year_three, ignore_index = True)
legal = legal.append(month_one, ignore_index = True)
legal = legal.append(month_two, ignore_index = True)
legal = legal.append(month_three, ignore_index = True)
legal = legal.append(month_four, ignore_index = True)
legal = legal.append(month_five, ignore_index = True)
legal = legal.append(month_six, ignore_index = True)
legal = legal.append(month_seven, ignore_index = True)
legal = legal.append(month_eight, ignore_index = True)
legal = legal.append(month_nine, ignore_index = True)
legal = legal.append(month_ten, ignore_index = True)
legal = legal.append(month_eleven, ignore_index = True)
legal = legal.append(month_twelve, ignore_index = True)
legal = legal.append(day_1, ignore_index = True)
legal = legal.append(day_2, ignore_index = True)
legal = legal.append(day_4, ignore_index = True)
legal = legal.append(day_5, ignore_index = True)
legal = legal.append(day_6, ignore_index = True)
legal = legal.append(day_7, ignore_index = True)
legal = legal.append(day_8, ignore_index = True)
legal = legal.append(day_9, ignore_index = True)
legal = legal.append(day_10, ignore_index = True)
legal = legal.append(day_11, ignore_index = True)
legal = legal.append(day_15, ignore_index = True)
legal = legal.append(day_16, ignore_index = True)
legal = legal.append(day_17, ignore_index = True)
legal = legal.append(day_18, ignore_index = True)
legal = legal.append(day_19, ignore_index = True)
legal = legal.append(day_20, ignore_index = True)
legal = legal.append(day_21, ignore_index = True)
legal = legal.append(day_22, ignore_index = True)
legal = legal.append(day_23, ignore_index = True)
legal = legal.append(day_24, ignore_index = True)
legal = legal.append(day_25, ignore_index = True)
legal = legal.append(day_26, ignore_index = True)
legal = legal.append(day_27, ignore_index = True)
legal = legal.append(day_28, ignore_index = True)
legal = legal.append(day_29, ignore_index = True)
legal = legal.append(today, ignore_index = True)
legal = legal.append(unans, ignore_index = True)
## Shuffle the dataframe
legal = legal.sample(frac=1, random_state = 12).reset_index(drop=True)
legal['est_date_asked'] = pd.to_datetime(legal['est_date_asked']).dt.date
## Replacing nan with space
legal = legal.replace(np.nan, '', regex=True)
## Merging the title and question text columns
legal['title and question'] = legal['titles'].str.cat(legal['questions'], sep=' ')
## Fill empty state entries with 'Unspecified State'
state_empty = legal[legal['state'] == '']['state'].index.tolist()
for i in range(0,len(state_empty)):
n = state_empty[i]
legal['state'].iloc[n] = 'Unspecified State'
As shown in the bar chart below, a majority of the question on the free advice website originated from California, while the least where from US Virgin Islands.
The pie charts show that most of the questions from California relate to labor law, and a majority of questions from the US virgin Islands relate to labor, criminal and divorce law.
## select top 5 states where most questions originated
top_5_states = legal['state'].value_counts()[0:5]
## select bottom 5 states where most questions originated
bottom_5_states = legal['state'].value_counts()[50:55]
## Plot bar chart of the 5 top and bottom states
fig = plt.figure(figsize=(20, 25))
ax = fig.add_subplot(211)
top_5_states.plot(kind = 'barh', ax = ax)
ax.set_xlabel('No. of Questions', fontsize = 25)
ax.set_ylabel('State', fontsize = 25)
ax.set_title('Top 5 states with most legal questions', size = 25)
ax.tick_params(axis='x', labelsize=30)
ax.tick_params(axis='y', labelsize=20)
plt.gca().invert_yaxis()
ax1 = fig.add_subplot(212)
bottom_5_states.plot(kind = 'barh', ax = ax1)
ax1.set_xlabel('No. of Questions', fontsize = 25)
ax1.set_ylabel('State', fontsize = 25)
ax1.set_title('Bottom 5 states with most legal questions', size = 25)
ax1.tick_params(axis='x', labelsize=30)
ax1.tick_params(axis='y', labelsize=20)
plt.gca().invert_yaxis()
plt.show()
california = legal[legal['state'] == 'California']
cali_category = california.groupby('category').size().reset_index(drop = False)
cali_category.columns = ['category', 'size']
islands = legal[legal['state'] == 'US Virgin Islands']
isl_category = islands.groupby('category').size().reset_index(drop = False)
isl_category.columns = ['category', 'size']
## Pie chart showing distribution of legal categories in California
sizes = cali_category['size']
labels = cali_category['category']
explode = [0.1] * 18
## Pie chart showing distribution of legal categories in US Virgin Islands
sizes_2 = isl_category['size']
labels_2 = isl_category['category']
explode_2 = [0.1] * 11
fig = plt.figure(figsize=(18, 8))
ax = fig.add_subplot(121)
patches, texts, autotexts = ax.pie(sizes, explode = explode, labels=labels,
autopct='%1.1f%%', shadow=False, startangle=80)
for t in texts:
t.set_fontsize(14)
for t in autotexts:
t.set_fontsize(13)
ax.set_title('Distribution of legal categories in California', fontsize = 18)
ax1 = fig.add_subplot(122)
patches, texts, autotexts = ax1.pie(sizes_2, explode = explode_2, labels=labels_2,
autopct='%1.1f%%', shadow=False, startangle=80)
for t in texts:
t.set_fontsize(14)
for t in autotexts:
t.set_fontsize(15)
ax1.set_title('Distribution of legal categories in US Virgin Islands', fontsize = 18)
plt.show()
According to the plot below, countrywide, most of the legal questions pertain to labor law, follwed by criminal defense and real estate.
## select the top 5 legal categories with the most questions
top_category = legal['category'].value_counts()[0:5]
## make bar chart of the top 5 legal categories
fig = plt.figure(figsize=(14, 4))
ax = fig.add_subplot(121)
top_category.plot(kind = 'barh', ax = ax)
ax.set_xlabel('No. of Questions', fontsize = 20)
ax.set_ylabel('Legal Category', fontsize = 20)
ax.set_title('Top 5 legal categories with most legal questions', fontsize = 20)
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=15)
plt.gca().invert_yaxis()
Since most of the questions originated from California, and most are related to labor law, we can expect a higher demand for lawyers in California than other states. We can also expect a high demand for lawyers specializing in labor law in all the states.
## Remove entries with non specific state entries
state_category = legal[legal['state']!= 'All USA']
state_category = state_category[state_category['state'] != 'Unspecified State']
The first map below shows that after taking the sum of the number of questions in each category over the years, most states have more labor related legal questions than any other legal questions.
The next three maps show that the major legal concerns of each state has changed over the years. This change in each states legal concerns could be random or as a result of political and economic changes, howevere; there is insufficient information to give a reason this change.
Not shown on the maps are:
This information could be useful in deciding what states to base various legal specialities.
## make dataframe of each state and its major legal category in the complete dataset
state_category_groupby = state_category.groupby(['category','state']).size().unstack()
major_category_per_state = state_category_groupby.idxmax(axis = 0).reset_index(drop = False)
major_category_per_state.columns = ['state','category']
## remove rows with null est_date_asked
yearly_state_category = state_category[state_category['est_date_asked'] != '']
yearly_state_category['est_date_asked'] = pd.to_datetime(yearly_state_category['est_date_asked'])
## extract year from estimated date asked
yearly_state_category['year'] = pd.DatetimeIndex(yearly_state_category['est_date_asked']).year.astype(int)
## make dataframe of each state and its major legal category in year 2009
y2009_state_category = yearly_state_category[yearly_state_category['year'] == 2009]
## make dataframe of each state and its major legal category in year 2009
y2009_state_category_groupby = y2009_state_category.groupby(['category','state']).size().unstack()
major_y2009_category_per_state = y2009_state_category_groupby.idxmax(axis = 0).reset_index(drop = False)
major_y2009_category_per_state.columns = ['state','category']
## make dataframe of each state and its major legal category in year 2013
y2013_state_category = yearly_state_category[yearly_state_category['year'] == 2013]
## make dataframe of each state and its major legal category in year 2013
y2013_state_category_groupby = y2013_state_category.groupby(['category','state']).size().unstack()
major_y2013_category_per_state = y2013_state_category_groupby.idxmax(axis = 0).reset_index(drop = False)
major_y2013_category_per_state.columns = ['state','category']
## make dataframe of each state and its major legal category in year 2017
y2017_state_category = yearly_state_category[yearly_state_category['year'] == 2017]
## make dataframe of each state and its major legal category in year 2017
y2017_state_category_groupby = y2017_state_category.groupby(['category','state']).size().unstack()
major_y2017_category_per_state = y2017_state_category_groupby.idxmax(axis = 0).reset_index(drop = False)
major_y2017_category_per_state.columns = ['state','category']
## display map picture from local disk
display(Image(filename="/Users/oofoegbu/us_map.png", width = 600, height = 600))
display(Image(filename="/Users/oofoegbu/us_map_2009.png", width = 600, height = 600))
display(Image(filename="/Users/oofoegbu/us_map_2013.png", width = 600, height = 600))
display(Image(filename="/Users/oofoegbu/us_map_2017.png", width = 600, height = 600))
The pie chart below shows the state where most questions from each legal category originated. As shown, a majority of the questions about custody and traffic where from Texas, while the a majority of questions in the other legal categories where from California.
major_state_per_category = state_category_groupby.idxmax(axis = 1).reset_index(drop = False)
major_state_per_category.columns = ['category','state']
major_state_per_category['no'] = [1]*18
major_state_per_category['colors'] = ['blue', 'blue', 'blue', 'blue', 'blue', 'red', 'blue', 'blue', 'blue', 'blue', 'blue','blue','blue','blue','blue','blue',
'red', 'blue']
## Pie chart showing distribution of legal categories in California
sizes = major_state_per_category['no']
labels = major_state_per_category['category']
explode = [0] * 18
colors = major_state_per_category['colors']
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111)
patches, texts = ax.pie(sizes, explode = explode, labels=labels,
shadow=False, startangle=80, colors = colors)
for t in texts:
t.set_fontsize(14)
ax.set_title('Major state from which each legal category originated', fontsize = 18)
plt.show()
legend = pd.DataFrame(columns = ['col', 'state'])
legend.loc[0] = ['blue', 'California']
legend.loc[1] = ['red', 'Texas']
print(legend)
On the website, the specific date the questions where answered were not recorded. Rather the date answered was recorded as years, months, and days ago. Due to the non-specific date recorded, analysis of how the category of legal questions changed over months, day of the week, and weekday/weekend the month was would be biased to the month I collected the data. The data was retrieved in July. That would explain the high number of questions recorded in July, as shown in the plot above. The most reliable information about the periods would be from the year.
The months, days and weekday/weekend are encoded as below:
## subset relevant columns
category_date = legal[['category','state', 'est_date_asked']]
## remove rows with null est_date_asked
periods = category_date[category_date['est_date_asked'] != '']
periods['est_date_asked'] = pd.to_datetime(periods['est_date_asked'])
periods['year'] = pd.DatetimeIndex(periods['est_date_asked']).year.astype(int)
periods['month'] = pd.DatetimeIndex(periods['est_date_asked']).month.astype(int)
periods['day_of_week'] = pd.DatetimeIndex(periods['est_date_asked']).dayofweek.astype(int)
periods['weekday'] = ((pd.DatetimeIndex(periods['est_date_asked']).dayofweek) // 5 == 1).astype(int)
## Plot bar charts shoqing no. of questions asked in each year, month, day of the week, weekend/weekday
fig = plt.figure(figsize=(12, 14))
ax = fig.add_subplot(221)
periods['year'].value_counts().sort_index().plot(kind = 'bar', ax = ax)
ax.set_xlabel('Est. Year Asked')
ax.set_ylabel('No. of Questions')
ax.set_title('No. of Questions Asked Each Year')
ax1 = fig.add_subplot(222)
periods['month'].value_counts().sort_index().plot(kind = 'bar', ax = ax1)
ax1.set_xlabel('Est. Month Asked')
ax1.set_ylabel('No. of Questions')
ax1.set_title('No. of Questions Asked Each Month')
ax2 = fig.add_subplot(223)
periods['day_of_week'].value_counts().sort_index().plot(kind = 'bar', ax = ax2)
ax2.set_xlabel('Est. Day of the Week Asked')
ax2.set_ylabel('No. of Questions')
ax2.set_title('No. of Questions Asked Each Day of the Week')
ax3 = fig.add_subplot(224)
periods['weekday'].value_counts().plot(kind = 'bar', ax = ax3)
ax3.set_xlabel('Est. Weekday or Weekend')
ax3.set_ylabel('No. of Questions')
ax3.set_title('No. of Questions Asked on Weekday and Weekends')
plt.show()
## Subset based on estimated year asked
year_2009 = periods.loc[periods['year'] == 2009]
year_2010 = periods.loc[periods['year'] == 2010]
year_2011 = periods.loc[periods['year'] == 2011]
year_2012 = periods.loc[periods['year'] == 2012]
year_2013 = periods.loc[periods['year'] == 2013]
year_2014 = periods.loc[periods['year'] == 2014]
year_2015 = periods.loc[periods['year'] == 2015]
year_2016 = periods.loc[periods['year'] == 2016]
year_2017 = periods.loc[periods['year'] == 2017]
## For each year, group data by category, and count the number of questions in each group
year_2009_groupby = year_2009.groupby(['category','year']).size().reset_index(drop=False)
year_2009_groupby.columns = ['category','year','size']
year_2010_groupby = year_2010.groupby(['category','year']).size().reset_index(drop=False)
year_2010_groupby.columns = ['category','year','size']
year_2011_groupby = year_2011.groupby(['category','year']).size().reset_index(drop=False)
year_2011_groupby.columns = ['category','year','size']
year_2012_groupby = year_2012.groupby(['category','year']).size().reset_index(drop=False)
year_2012_groupby.columns = ['category','year','size']
year_2013_groupby = year_2013.groupby(['category','year']).size().reset_index(drop=False)
year_2013_groupby.columns = ['category','year','size']
year_2014_groupby = year_2014.groupby(['category','year']).size().reset_index(drop=False)
year_2014_groupby.columns = ['category','year','size']
year_2015_groupby = year_2015.groupby(['category','year']).size().reset_index(drop=False)
year_2015_groupby.columns = ['category','year','size']
year_2016_groupby = year_2016.groupby(['category','year']).size().reset_index(drop=False)
year_2016_groupby.columns = ['category','year','size']
year_2017_groupby = year_2017.groupby(['category','year']).size().reset_index(drop=False)
year_2017_groupby.columns = ['category','year','size']
## percent of questions each year from the different legal categories
periods_groupby = periods.groupby(['category','year']).size().groupby(level = [1]).apply(lambda x: x/x.sum()).unstack()
year_2009_frac = periods_groupby.iloc[:,0].reset_index(drop = False)
year_2009_frac.columns = ['category','percent']
year_2009_frac = year_2009_frac.fillna(0) ## fill NA with zero
year_2010_frac = periods_groupby.iloc[:,1].reset_index(drop = False)
year_2010_frac.columns = ['category','percent']
year_2010_frac = year_2010_frac.fillna(0)
year_2011_frac = periods_groupby.iloc[:,2].reset_index(drop = False)
year_2011_frac.columns = ['category','percent']
year_2011_frac = year_2011_frac.fillna(0)
year_2012_frac = periods_groupby.iloc[:,3].reset_index(drop = False)
year_2012_frac.columns = ['category','percent']
year_2012_frac = year_2012_frac.fillna(0)
year_2013_frac = periods_groupby.iloc[:,4].reset_index(drop = False)
year_2013_frac.columns = ['category','percent']
year_2013_frac = year_2013_frac.fillna(0)
year_2014_frac = periods_groupby.iloc[:,5].reset_index(drop = False)
year_2014_frac.columns = ['category','percent']
year_2014_frac = year_2014_frac.fillna(0)
year_2015_frac = periods_groupby.iloc[:,6].reset_index(drop = False)
year_2015_frac.columns = ['category','percent']
year_2015_frac = year_2015_frac.fillna(0)
year_2016_frac = periods_groupby.iloc[:,7].reset_index(drop = False)
year_2016_frac.columns = ['category','percent']
year_2016_frac = year_2016_frac.fillna(0)
year_2017_frac = periods_groupby.iloc[:,8].reset_index(drop = False)
year_2017_frac.columns = ['category','percent']
year_2017_frac = year_2017_frac.fillna(0)
## Plot the distribution of questions from each legal category in the different years
fig = plt.figure(figsize=(35, 80))
ax = fig.add_subplot(311)
ax.plot(year_2009_frac.index, year_2009_frac['percent'],marker = 'o', color = 'r', label = 'year 2009')
ax.plot(year_2010_frac.index, year_2010_frac['percent'],marker = 'o', color = 'b', label = 'year 2010')
ax.plot(year_2011_frac.index, year_2011_frac['percent'],marker = 'o', color = 'g', label = 'year 2011')
ax.legend(fontsize = 30)
ax.set_ylim([-0.1,0.4])
ax.set_xticks(year_2009_frac.index) # choose which x locations to have ticks
ax.set_xticklabels(year_2009_frac['category'], rotation = 90) # set the labels to display at those ticks
ax.tick_params(axis='x', labelsize=30)
ax.tick_params(axis='y', labelsize=30)
ax.set_xlabel('Legal Category', fontsize = 40)
ax.set_ylabel('Fraction of total questions', fontsize = 40)
ax1 = fig.add_subplot(312)
ax1.plot(year_2012_frac.index, year_2012_frac['percent'],marker = 'o', color = 'r', label = 'year 2012')
ax1.plot(year_2013_frac.index, year_2013_frac['percent'],marker = 'o', color = 'b', label = 'year 2013')
ax1.plot(year_2014_frac.index, year_2014_frac['percent'],marker = 'o', color = 'g', label = 'year 2014')
ax1.legend(fontsize = 30)
ax1.set_ylim([-0.1,0.4])
ax1.set_xticks(year_2012_frac.index) # choose which x locations to have ticks
ax1.set_xticklabels(year_2012_frac['category'], rotation = 90) # set the labels to display at those ticks
ax1.tick_params(axis='x', labelsize=30)
ax1.tick_params(axis='y', labelsize=30)
ax1.set_xlabel('Legal Category', fontsize = 40)
ax1.set_ylabel('Fraction of total questions', fontsize = 40)
ax2 = fig.add_subplot(313)
ax2.plot(year_2015_frac.index, year_2015_frac['percent'],marker = 'o', color = 'r', label = 'year 2015')
ax2.plot(year_2016_frac.index, year_2016_frac['percent'],marker = 'o', color = 'b', label = 'year 2016')
ax2.plot(year_2017_frac.index, year_2017_frac['percent'],marker = 'o', color = 'g', label = 'year 2017')
ax2.legend(fontsize = 30)
ax2.set_ylim([-0.1,0.4])
ax2.set_xticks(year_2015_frac.index) # choose which x locations to have ticks
ax2.set_xticklabels(year_2015_frac['category'], rotation = 90) # set the labels to display at those ticks
ax2.tick_params(axis='x', labelsize=30)
ax2.tick_params(axis='y', labelsize=30)
ax2.set_xlabel('Legal Category', fontsize = 40)
ax2.set_ylabel('Fraction of total questions', fontsize = 40)
plt.show()
The plots above show the percent of questions in each year that are from each legal category. For example, in 2009 about 25% of the questions related to criminal defense law. We can see from the plots that the percent distribution of questions from different legal categories varied randomly over the years.
From the table below, labor and tenant law categories have a high ratio of questions to attorneys. It will be beneficial to increase the number of free advice contributing lawyers with specialities in labor and tenant law, in order to have a broader range of opinions.
auto = legal[legal['category'] == 'auto']['attorney'] ## list of the attorneys that contributed to category
len_auto_attorneys = len(auto.value_counts()) ## no. of unique attorneys that answered questions from category
no_auto = len(auto) ## no. of questions in category
bankruptcy = legal[legal['category'] == 'bankruptcy']['attorney']
len_bankruptcy_attorneys = len(bankruptcy.value_counts())
no_bankruptcy = len(bankruptcy)
business = legal[legal['category'] == 'business']['attorney']
len_business_attorneys = len(business.value_counts())
no_business = len(business)
consumer = legal[legal['category'] == 'consumer']['attorney']
len_consumer_attorneys = len(consumer.value_counts())
no_consumer = len(consumer)
criminal = legal[legal['category'] == 'criminal']['attorney']
len_criminal_attorneys = len(criminal.value_counts())
no_criminal = len(criminal)
custody = legal[legal['category'] == 'custody']['attorney']
len_custody_attorneys = len(custody.value_counts())
no_custody = len(custody)
debt = legal[legal['category'] == 'debt']['attorney']
len_debt_attorneys = len(debt.value_counts())
no_debt = len(debt)
divorce = legal[legal['category'] == 'divorce']['attorney']
len_divorce_attorneys = len(divorce.value_counts())
no_divorce = len(divorce)
dui = legal[legal['category'] == 'dui']['attorney']
len_dui_attorneys = len(dui.value_counts())
no_dui = len(dui)
estate = legal[legal['category'] == 'estate']['attorney']
len_estate_attorneys = len(estate.value_counts())
no_estate = len(estate)
immigration = legal[legal['category'] == 'immigration']['attorney']
len_immigration_attorneys = len(immigration.value_counts())
no_immigration = len(immigration)
injury = legal[legal['category'] == 'injury']['attorney']
len_injury_attorneys = len(injury.value_counts())
no_injury = len(injury)
insurance = legal[legal['category'] == 'insurance']['attorney']
len_insurance_attorneys = len(insurance.value_counts())
no_insurance = len(insurance)
labor = legal[legal['category'] == 'labor']['attorney']
len_labor_attorneys = len(labor.value_counts())
no_labor = len(labor)
medical = legal[legal['category'] == 'medical']['attorney']
len_medical_attorneys = len(medical.value_counts())
no_medical = len(medical)
tenant = legal[legal['category'] == 'tenant']['attorney']
len_tenant_attorneys = len(tenant.value_counts())
no_tenant = len(tenant)
traffic = legal[legal['category'] == 'traffic']['attorney']
len_traffic_attorneys = len(traffic.value_counts())
no_traffic = len(traffic)
wills = legal[legal['category'] == 'wills']['attorney']
len_wills_attorneys = len(wills.value_counts())
no_wills = len(wills)
## Dataframe of the no. of unique attorneys, no. of questions and ratio of questions to attorney in each category
attorneys_per_category = pd.DataFrame(columns = ['category', 'no. of unique attorneys', 'no. of questions'])
attorneys_per_category.loc[0] = ['auto', len_auto_attorneys, no_auto]
attorneys_per_category.loc[1] = ['bankruptcy', len_bankruptcy_attorneys, no_bankruptcy]
attorneys_per_category.loc[2] = ['business', len_business_attorneys, no_business]
attorneys_per_category.loc[3] = ['consumer', len_consumer_attorneys, no_consumer]
attorneys_per_category.loc[4] = ['criminal', len_criminal_attorneys, no_criminal]
attorneys_per_category.loc[5] = ['custody', len_custody_attorneys, no_custody]
attorneys_per_category.loc[6] = ['debt', len_debt_attorneys, no_debt]
attorneys_per_category.loc[7] = ['divorce', len_divorce_attorneys, no_divorce]
attorneys_per_category.loc[8] = ['dui', len_dui_attorneys, no_dui]
attorneys_per_category.loc[9] = ['estate', len_estate_attorneys, no_estate]
attorneys_per_category.loc[10] = ['immigration', len_immigration_attorneys, no_immigration]
attorneys_per_category.loc[11] = ['injury', len_injury_attorneys, no_injury]
attorneys_per_category.loc[12] = ['insurance', len_insurance_attorneys, no_insurance]
attorneys_per_category.loc[13] = ['labor', len_labor_attorneys, no_labor]
attorneys_per_category.loc[14] = ['medical', len_medical_attorneys, no_medical]
attorneys_per_category.loc[15] = ['tenant', len_tenant_attorneys, no_tenant]
attorneys_per_category.loc[16] = ['traffic', len_traffic_attorneys, no_traffic]
attorneys_per_category.loc[17] = ['wills', len_wills_attorneys, no_wills]
attorneys_per_category['questions per attorney'] = round(attorneys_per_category['no. of questions']/attorneys_per_category['no. of unique attorneys'], 1)
## bar chart showing the ratio of questions per attorney in each category
fig = plt.figure(figsize=(30, 14))
ax = fig.add_subplot(221)
attorneys_per_category['questions per attorney'].plot(kind = 'bar', ax = ax)
ax.set_xlabel('Legal Category', fontsize = 15)
ax.set_ylabel('Ratio of questions per attorney', fontsize = 15)
ax.set_xticklabels(attorneys_per_category['category'], rotation = 90)
plt.show()
This question is to identify the named attorneys that answered the most questions from each legal category. This information could help the firm identify experts in the various legal categories, which may come in handy if the firm is seeking to increase its manpower.
## Select rows with named attorneys
rows_named_attorney = legal[legal['attorney'].str.contains('^FreeAdvice')].index.tolist()
named_attorney = legal.drop(rows_named_attorney)
## Which named attorney answered the maximum no. of questions in each category
attorney = named_attorney.groupby(['category', 'attorney']).size().unstack()
max_attorney_category = attorney.idxmax(axis = 1).reset_index(drop = False)
max_attorney_category.columns = ['category', 'attorney']
print(max_attorney_category)
We can conclude from the table that SJZ answers most of the questions in most of the legal categories.
The goal is to highlight one major concern of users from each of the legal categories, this could help the firm tailor its practice to meet the needs of people.
To do this, I found the largest group of similar questions in each category. Using the most common words in those questions, I was able to determine the most common topic in each category.
The length of the largest similar group, a sample question from each group as well as the top 7 most common words in the groups are provided in the appendix.
The common topics table below shows the topic of the largest group of similar questions in each category.
## Dataframe of the common topic in each legal category
common_topics = pd.DataFrame(columns = ['Legal Category','Topic'])
common_topics.loc[0] = ['Auto', 'Car accident involving rental cars']
common_topics.loc[1] = ['Bankruptcy', 'Involves filing Chapter 7 or 13']
common_topics.loc[2] = ['Business', 'Business involing car']
common_topics.loc[3] = ['Consumer', 'Involves titles']
common_topics.loc[4] = ['Criminal', 'Sex involving minors']
common_topics.loc[5] = ['Custody', 'Involves the fathers name on birth certificate']
common_topics.loc[6] = ['Debt', 'Debt involving car loan']
common_topics.loc[7] = ['Divorce', 'Cars in divorce']
common_topics.loc[8] = ['Dui', 'Miranda rights during arrest']
common_topics.loc[9] = ['Estate', 'Fence around property']
common_topics.loc[10] = ['Immigration', 'Involving name change and green card']
common_topics.loc[11] = ['Injury', 'Injury involving dogs']
common_topics.loc[12] = ['Insurance', 'Beneficiary of life insurance']
common_topics.loc[13] = ['Labor', 'Cases involving vacation time']
common_topics.loc[14] = ['Medical', 'Medical case involving knee replacement']
common_topics.loc[15] = ['Tenant', 'Late rental fees']
common_topics.loc[16] = ['Traffic', 'Suspended license']
common_topics.loc[17] = ['Wills', 'Beneficiaries of life insurance policy']
print(common_topics)
There are only 9 unanswered out of over 120,000 questions in the dataset. This shows that most of the questions on this website get answered. Due to the small number of unanswered questions it is impossible to make any correlation involving the types of unanswered questions.
I checked if there where other questions that where similar to the unanswered questions, and found two questions similar to one of the unanswered questions in the estate category. So, the individual who asked the unanswered questions can get answers from the similar questions. The similar questions can be seen in the appendix.
## No. of unanswered questions
unans_group = unans.groupby(['category']).size().reset_index()
unans_group.columns = ['category', 'no. of unanswered questions']
print(unans_group)
Based on the above analysis, these are my recommendations to the hypothetical law firm:
## Subsets of the different legal categories
auto_data = legal[legal['category'] == 'auto'].reset_index()
bankruptcy_data = legal[legal['category'] == 'bankruptcy'].reset_index()
business_data = legal[legal['category'] == 'business'].reset_index()
consumer_data = legal[legal['category'] == 'consumer'].reset_index()
criminal_data = legal[legal['category'] == 'criminal'].reset_index()
custody_data = legal[legal['category'] == 'custody'].reset_index()
debt_data = legal[legal['category'] == 'debt'].reset_index()
divorce_data = legal[legal['category'] == 'divorce'].reset_index()
dui_data = legal[legal['category'] == 'dui'].reset_index()
estate_data = legal[legal['category'] == 'estate'].reset_index()
immigration_data = legal[legal['category'] == 'immigration'].reset_index()
injury_data = legal[legal['category'] == 'injury'].reset_index()
insurance_data = legal[legal['category'] == 'insurance'].reset_index()
labor_data = legal[legal['category'] == 'labor'].reset_index()
medical_data = legal[legal['category'] == 'medical'].reset_index()
tenant_data = legal[legal['category'] == 'tenant'].reset_index()
traffic_data = legal[legal['category'] == 'traffic'].reset_index()
wills_data = legal[legal['category'] == 'wills'].reset_index()
def similarity_function(data):
doc = data['title and question'].tolist()
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in doc]
stop_words = set(stopwords.words('english'))
filtered_sentence = [[] for i in range(len(gen_docs))]
for i in range(len(gen_docs)):
for w in gen_docs[i]:
if w not in stop_words:
filtered_sentence[i].append(w)
dictionary = gensim.corpora.Dictionary(filtered_sentence)
corpus = [dictionary.doc2bow(sent) for sent in filtered_sentence]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/Users/oofoegbu/',tf_idf[corpus],num_features=len(dictionary))
similarity = pd.DataFrame(columns = ['similar','index'])
for i in range(0,len(doc)):
query_doc = [w.lower() for w in word_tokenize(doc[i])]
query_sentence = [w for w in query_doc if not w in stop_words]
query_doc_bow = dictionary.doc2bow(query_sentence)
query_doc_tf_idf = tf_idf[query_doc_bow]
sim = sims[query_doc_tf_idf]
similar = np.where((sim >= 0.3))[0]
similarity.loc[i] = [len(similar),similar]
similar_questions = similarity['similar'].value_counts().reset_index()
similar_questions.columns = ['no. of similar questions', 'index of similar questions']
return similarity
auto_sim_group = similarity_function(auto_data)
bankruptcy_sim_group = similarity_function(bankruptcy_data)
business_sim_group = similarity_function(business_data)
consumer_sim_group = similarity_function(consumer_data)
criminal_sim_group = similarity_function(criminal_data)
custody_sim_group = similarity_function(custody_data)
debt_sim_group = similarity_function(debt_data)
divorce_sim_group = similarity_function(divorce_data)
dui_sim_group = similarity_function(dui_data)
estate_sim_group = similarity_function(estate_data)
immigration_sim_group = similarity_function(immigration_data)
injury_sim_group = similarity_function(injury_data)
insurance_sim_group = similarity_function(insurance_data)
labor_sim_group = similarity_function(labor_data)
medical_sim_group = similarity_function(medical_data)
tenant_sim_group = similarity_function(tenant_data)
traffic_sim_group = similarity_function(traffic_data)
wills_sim_group = similarity_function(wills_data)
top_auto_group = auto_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_auto_group = top_auto_group.reset_index()
top_auto_group.columns = ['no. of similar questions','no. of groups this size']
top_bankruptcy_group = bankruptcy_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_bankruptcy_group = top_bankruptcy_group.reset_index()
top_bankruptcy_group.columns = ['no. of similar questions','no. of groups this size']
top_business_group = business_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_business_group = top_business_group.reset_index()
top_business_group.columns = ['no. of similar questions','no. of groups this size']
top_consumer_group = consumer_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_consumer_group = top_consumer_group.reset_index()
top_consumer_group.columns = ['no. of similar questions','no. of groups this size']
top_criminal_group = criminal_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_criminal_group = top_criminal_group.reset_index()
top_criminal_group.columns = ['no. of similar questions','no. of groups this size']
top_custody_group = custody_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_custody_group = top_custody_group.reset_index()
top_custody_group.columns = ['no. of similar questions','no. of groups this size']
top_debt_group = debt_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_debt_group = top_debt_group.reset_index()
top_debt_group.columns = ['no. of similar questions','no. of groups this size']
top_divorce_group = divorce_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_divorce_group = top_divorce_group.reset_index()
top_divorce_group.columns = ['no. of similar questions','no. of groups this size']
top_dui_group = dui_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_dui_group = top_dui_group.reset_index()
top_dui_group.columns = ['no. of similar questions','no. of groups this size']
top_estate_group = estate_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_estate_group = top_estate_group.reset_index()
top_estate_group.columns = ['no. of similar questions','no. of groups this size']
top_immigration_group = immigration_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_immigration_group = top_immigration_group.reset_index()
top_immigration_group.columns = ['no. of similar questions','no. of groups this size']
top_injury_group = injury_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_injury_group = top_injury_group.reset_index()
top_injury_group.columns = ['no. of similar questions','no. of groups this size']
top_insurance_group = insurance_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_insurance_group = top_insurance_group.reset_index()
top_insurance_group.columns = ['no. of similar questions','no. of groups this size']
top_labor_group = labor_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_labor_group = top_labor_group.reset_index()
top_labor_group.columns = ['no. of similar questions','no. of groups this size']
top_medical_group = medical_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_medical_group = top_medical_group.reset_index()
top_medical_group.columns = ['no. of similar questions','no. of groups this size']
top_tenant_group = tenant_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_tenant_group = top_tenant_group.reset_index()
top_tenant_group.columns = ['no. of similar questions','no. of groups this size']
top_traffic_group = traffic_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_traffic_group = top_traffic_group.reset_index()
top_traffic_group.columns = ['no. of similar questions','no. of groups this size']
top_wills_group = wills_sim_group['similar'].value_counts().sort_index(ascending = False)[0:1,]
top_wills_group = top_wills_group.reset_index()
top_wills_group.columns = ['no. of similar questions','no. of groups this size']
def sim_index_function(sim_group, top_group):
sim_questions_list = sim_group[sim_group['similar'] == top_group['no. of similar questions'].iloc[0]]
sim_questions_list = sim_questions_list.reset_index()
sim_questions_list.columns = ['top group index', 'no. of similar questions', 'doc index']
top_index = sim_group['index'].iloc[sim_questions_list['top group index'].iloc[0]]
return top_index
top_auto_index = sim_index_function(auto_sim_group, top_auto_group)
top_bankruptcy_index = sim_index_function(bankruptcy_sim_group, top_bankruptcy_group)
top_business_index = sim_index_function(business_sim_group, top_business_group)
top_consumer_index = sim_index_function(consumer_sim_group, top_consumer_group)
top_criminal_index = sim_index_function(criminal_sim_group, top_criminal_group)
top_custody_index = sim_index_function(custody_sim_group, top_custody_group)
top_debt_index = sim_index_function(debt_sim_group, top_debt_group)
top_divorce_index = sim_index_function(divorce_sim_group, top_divorce_group)
top_dui_index = sim_index_function(dui_sim_group, top_dui_group)
top_estate_index = sim_index_function(estate_sim_group, top_estate_group)
top_immigration_index = sim_index_function(immigration_sim_group, top_immigration_group)
top_injury_index = sim_index_function(injury_sim_group, top_injury_group)
top_insurance_index = sim_index_function(insurance_sim_group, top_insurance_group)
top_labor_index = sim_index_function(labor_sim_group, top_labor_group)
top_medical_index = sim_index_function(medical_sim_group, top_medical_group)
top_tenant_index = sim_index_function(tenant_sim_group, top_tenant_group)
top_traffic_index = sim_index_function(traffic_sim_group, top_traffic_group)
top_wills_index = sim_index_function(wills_sim_group, top_wills_group)
def topic_function(data, top_index):
top_questions = data.iloc[top_index]
top_doc = top_questions['title and question'].tolist()
docs = [[w.lower() for w in word_tokenize(text)]
for text in top_doc]
## Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = [[] for i in range(len(docs))]
for i in range(len(docs)):
for w in docs[i]:
if w not in stop_words:
filtered_sentence[i].append(w)
## Remove punctuations
punct = set(string.punctuation)
filtered_punct = [[] for i in range(len(filtered_sentence))]
for m in range(len(filtered_sentence)):
for w in filtered_sentence[m]:
if w not in punct:
filtered_punct[m].append(w)
## Word frequency count
words = sum(filtered_punct,[])
counts = Counter(words)
top_words = sorted(counts, key=counts.get, reverse=True)[0:7]
return top_words
## No. of questions in largest group of similar questions from auto category
print('No. of similar questions in largest auto group = %s' % top_auto_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from auto category
print('Sample Question: %s' % auto_data['title and question'].iloc[top_auto_index[4]])
print()
## Most common words from group of similar questions in auto category
print('Most common words: %s' % topic_function(auto_data, top_auto_index))
## No. of questions in largest group of similar questions from bankruptcy category
print('No. of similar questions in largest bankruptcy group = %s' % top_bankruptcy_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from bankruptcy category
print('Sample Question: %s' % bankruptcy_data['title and question'].iloc[top_bankruptcy_index[9]])
print()
## Most common words from group of similar questions in bankruptcy category
print('Most common words: %s' %topic_function(bankruptcy_data, top_bankruptcy_index))
## No. of questions in largest group of similar questions from business category
print('No. of similar questions in largest business group = %s' % top_business_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from business category
print('Sample Question: %s' % business_data['title and question'].iloc[top_business_index[12]])
print()
## Most common words from group of similar questions in business category
print('Most common words: %s' %topic_function(business_data, top_business_index))
## No. of questions in largest group of similar questions from consumer category
print('No. of similar questions in largest consumer group = %s' % top_consumer_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from consumer category
print('Sample Question: %s' % consumer_data['title and question'].iloc[top_consumer_index[12]])
print()
## Most common words from group of similar questions in consumer category
print('Most common words: %s' %topic_function(consumer_data, top_consumer_index))
## No. of questions in largest group of similar questions from criminal category
print('No. of similar questions in largest criminal group = %s' % top_criminal_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from criminal category
print('Sample Question: %s' % criminal_data['title and question'].iloc[top_criminal_index[12]])
print()
## Most common words from group of similar questions in criminal category
print('Most common words: %s' %topic_function(criminal_data, top_criminal_index))
## No. of questions in largest group of similar questions from custody category
print('No. of similar questions in largest custody group = %s' % top_custody_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from custody category
print('Sample Question: %s' % custody_data['title and question'].iloc[top_custody_index[12]])
print()
## Most common words from group of similar questions in custody category
print('Most common words: %s' %topic_function(custody_data, top_custody_index))
## No. of questions in largest group of similar questions from debt category
print('No. of similar questions in largest debt group = %s' % top_debt_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from debt category
print('Sample Question: %s' % debt_data['title and question'].iloc[top_debt_index[12]])
print()
## Most common words from group of similar questions in debt category
print('Most common words: %s' %topic_function(debt_data, top_debt_index))
## No. of questions in largest group of similar questions from divorce category
print('No. of similar questions in largest divorce group = %s' % top_divorce_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from divorce category
print('Sample Question: %s' % divorce_data['title and question'].iloc[top_divorce_index[2]])
print()
## Most common words from group of similar questions in divorce category
print('Most common words: %s' %topic_function(divorce_data, top_divorce_index))
## No. of questions in largest group of similar questions from dui category
print('No. of similar questions in largest dui group = %s' % top_dui_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from dui category
print('Sample Question: %s' % dui_data['title and question'].iloc[top_dui_index[12]])
print()
## Most common words from group of similar questions in dui category
print('Most common words: %s' %topic_function(dui_data, top_dui_index))
## No. of questions in largest group of similar questions from estate category
print('No. of similar questions in largest estate group = %s' % top_estate_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from estate category
print('Sample Question: %s' % estate_data['title and question'].iloc[top_estate_index[2]])
print()
## Most common words from group of similar questions in estate category
print('Most common words: %s' %topic_function(estate_data, top_estate_index))
## No. of questions in largest group of similar questions from immigration category
print('No. of similar questions in largest immigration group = %s' % top_immigration_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from immigration category
print('Sample Question: %s' % immigration_data['title and question'].iloc[top_immigration_index[10]])
print()
## Most common words from group of similar questions in immigration category
print('Most common words: %s' %topic_function(immigration_data, top_immigration_index))
## No. of questions in largest group of similar questions from injury category
print('No. of similar questions in largest injury group = %s' % top_injury_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from injury category
print('Sample Question: %s' % injury_data['title and question'].iloc[top_injury_index[70]])
print()
## Most common words from group of similar questions in injury category
print('Most common words: %s' %topic_function(injury_data, top_injury_index))
## No. of questions in largest group of similar questions from insurance category
print('No. of similar questions in largest insurance group = %s' % top_insurance_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from insurance category
print('Sample Question: %s' % insurance_data['title and question'].iloc[top_insurance_index[1]])
print()
## Most common words from group of similar questions in insurance category
print('Most common words: %s' %topic_function(insurance_data, top_insurance_index))
## No. of questions in largest group of similar questions from labor category
print('No. of similar questions in largest labor group = %s' % top_labor_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from labor category
print('Sample Question: %s' % labor_data['title and question'].iloc[top_labor_index[12]])
print()
## Most common words from group of similar questions in labor category
print('Most common words: %s' %topic_function(labor_data, top_labor_index))
## No. of questions in largest group of similar questions from medical category
print('No. of similar questions in largest medical group = %s' % top_medical_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from medical category
print('Sample Question: %s' % medical_data['title and question'].iloc[top_medical_index[12]])
print()
## Most common words from group of similar questions in medical category
print('Most common words: %s' %topic_function(medical_data, top_medical_index))
## No. of questions in largest group of similar questions from tenant category
print('No. of similar questions in largest tenant group = %s' % top_tenant_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from tenant category
print('Sample Question: %s' % tenant_data['title and question'].iloc[top_tenant_index[10]])
print()
## Most common words from group of similar questions in tenant category
print('Most common words: %s' %topic_function(tenant_data, top_tenant_index))
## No. of questions in largest group of similar questions from traffic category
print('No. of similar questions in largest traffic group = %s' % top_traffic_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from traffic category
print('Sample Question: %s' % traffic_data['title and question'].iloc[top_traffic_index[2]])
print()
## Most common words from group of similar questions in traffic category
print('Most common words: %s' %topic_function(traffic_data, top_traffic_index))
## No. of questions in largest group of similar questions from wills category
print('No. of similar questions in largest wills group = %s' % top_wills_group['no. of similar questions'].iloc[0])
print()
## Sample question from this group of similar questions from wills category
print('Sample Question: %s' % wills_data['title and question'].iloc[top_wills_index[4]])
print()
## Most common words from group of similar questions in wills category
print('Most common words: %s' %topic_function(wills_data, top_wills_index))
def similarity_unanswered_function(data, unanswered):
doc = data['title and question'].tolist()
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in doc]
stop_words = set(stopwords.words('english'))
filtered_sentence = [[] for i in range(len(gen_docs))]
for i in range(len(gen_docs)):
for w in gen_docs[i]:
if w not in stop_words:
filtered_sentence[i].append(w)
dictionary = gensim.corpora.Dictionary(filtered_sentence)
corpus = [dictionary.doc2bow(sent) for sent in filtered_sentence]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/Users/oofoegbu/',tf_idf[corpus],num_features=len(dictionary))
similarity = pd.DataFrame(columns = ['similar','index'])
query_doc = [w.lower() for w in word_tokenize(unanswered)]
query_sentence = [w for w in query_doc if not w in stop_words]
query_doc_bow = dictionary.doc2bow(query_sentence)
query_doc_tf_idf = tf_idf[query_doc_bow]
sim = sims[query_doc_tf_idf]
similar = np.where((sim >= 0.5) & (sim < 0.9999))[0]
similarity.loc[0] = [len(similar),similar]
similar_questions = similarity['similar'].value_counts().reset_index()
similar_questions.columns = ['no. of similar questions', 'index of similar questions']
return similarity
## Find question with nearest similarity to unanswered question from auto category
unans_auto = auto_data[auto_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_auto = auto_data['title and question'].iloc[unans_auto[0]]
sim_unans_auto = similarity_unanswered_function(auto_data, unanswered_auto)['index'][0]
## Find question with nearest similarity to unanswered question from business category
unans_business = business_data[business_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_business = business_data['title and question'].iloc[unans_business[0]]
sim_unans_business = similarity_unanswered_function(business_data, unanswered_business)['index'][0]
## Find question with nearest similarity to unanswered question from criminal category
unans_criminal = criminal_data[criminal_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_criminal = criminal_data['title and question'].iloc[unans_criminal[0]]
sim_unans_criminal = similarity_unanswered_function(criminal_data, unanswered_criminal)['index'][0]
## Find question with nearest similarity to unanswered question from divorce category
unans_divorce = divorce_data[divorce_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_divorce1 = divorce_data['title and question'].iloc[unans_divorce[0]]
unanswered_divorce2 = divorce_data['title and question'].iloc[unans_divorce[1]]
sim_unans_divorce1 = similarity_unanswered_function(divorce_data, unanswered_divorce1)['index'][0]
sim_unans_divorce2 = similarity_unanswered_function(divorce_data, unanswered_divorce2)['index'][0]
## Find question with nearest similarity to unanswered question from dui category
unans_dui = dui_data[dui_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_dui = dui_data['title and question'].iloc[unans_dui[0]]
sim_unans_dui = similarity_unanswered_function(dui_data, unanswered_dui)['index'][0]
## Find question with nearest similarity to unanswered question from estate category
unans_estate = estate_data[estate_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_estate1 = estate_data['title and question'].iloc[unans_estate[0]]
unanswered_estate2 = estate_data['title and question'].iloc[unans_estate[1]]
sim_unans_estate1 = similarity_unanswered_function(estate_data, unanswered_estate1)['index'][0]
sim_unans_estate2 = similarity_unanswered_function(estate_data, unanswered_estate2)['index'][0]
## Find question with nearest similarity to unanswered question from wills category
unans_wills = wills_data[wills_data['date_answered'] == 'unanswered'].index.tolist()
unanswered_wills = wills_data['title and question'].iloc[unans_wills[0]]
sim_unans_wills = similarity_unanswered_function(wills_data, unanswered_wills)['index'][0]
similarity = pd.DataFrame(columns = ['category of unanswered question','index of similar question'])
similarity.loc[0] = ['auto', sim_unans_auto]
similarity.loc[1] = ['business', sim_unans_business]
similarity.loc[2] = ['criminal', sim_unans_criminal]
similarity.loc[3] = ['divorce', sim_unans_divorce1]
similarity.loc[4] = ['divorce', sim_unans_divorce2]
similarity.loc[5] = ['dui', sim_unans_dui]
similarity.loc[6] = ['estate', sim_unans_estate1]
similarity.loc[7] = ['estate', sim_unans_estate2]
similarity.loc[8] = ['wills', sim_unans_wills]
print(similarity)
## Unanswered question from estate category
print('Unanswered Question: %s' % unanswered_estate1)
print()
## Similar question to unanswered question
print('Similar Question: %s' % estate_data['title and question'].iloc[1766])
print()
## Similar question to unanswered question
print('Similar Question: %s' % estate_data['title and question'].iloc[4791])
print()