Text Analytics: Easy Classification For Routing Service Requests

The first step we ask as data scientists when we approach a new project is what’s the customer’s available data? While some of the time the answer will be a table or file with lots of nice numbers just waiting to be ingested by a machine learning classifier, most of the time a big chunk of the information will be stored in free text columns or documents.

As a customer-facing organization we store information describing EMC’s interaction with clients: some of it structured such as time to close, problem codes, etc.— but also free text fields such as problem summary or comments from the customer satisfaction survey.  These free text fields can be used for accurately routing service requests to the right support team to improve resolution times and customer satisfaction, identify burning issues in the customer satisfaction survey or identify emerging problems.

Similarly, Sales would like to use a potential customer’s web site in order to categorize that company’s needs and identify products sold to similar companies.

Internet companies such as Google or Amazon have learned to leverage textual information into big earnings.  Corporate data is usually trapped in some columns in a database and is rarely used for analytics, such as identifying trends, and predictive purposes like request routing or providing leads for Sales.

One of the hurdles in the way of harnessing the power of text analytics is the need to address two prevalent misconceptions among the recipients.  The first group believes that text processing is a solved problem and expects a “magic box” to deliver perfect results to whatever business question they have in mind.

The second group holds the opposite belief that a computer cannot possibly solve the problem they have been assigning humans to for decades.  Naturally, the truth lies somewhere in the middle.   Smart text analytics can provide accurate but not perfect results in a manner that would save the business most of the manual labor currently attached to it.

A common approach in the corporate environment to text analytics is the use of rule-based systems.  These usually cater to the second group (non-believers) and require the end user to spend months on refining and adding rules.  These rules are often not data-driven.   For example, a statistical analysis we ran on the rule list produced by a business unit in EMC showed that most of the words chosen for the rules didn’t appear in later data and the rules simply did not cover 40 percent of the data.

A second hurdle comes from the industry’s favorite buzz word in recent years, Big Data.  While our corporate databases may hold terabytes of textual data, these document collections may often be unrelated and combining service request text from many different products may not be beneficial.   So don’t forget to ask, do we really need a Big Data approach?

Let’s get back to our low-hanging fruit which in our case is automating service Request routing using machine learning classifiers and Python, an open-source language that has powerful libraries for data manipulation and analysis.  Learning Supervised Classifiers requires labeled data: a list of service requests with the correct support group assignation.  Luckily for us, modern learning methods can usually learn even from noisy data (some of the assignation may be wrong).  Our approach is made of a few steps:

  • Get the data from the database
  • Bootstrap the labels by using the metadata of closing group or by simply using the output of the rule-based system
  • Learn and evaluate classifier using Python
  • Route using the ML Classifier, retrain regularly on the new data

We applied this framework for routing within EMC with high accuracy of more than 90 percent, a great improvement over the rule-based or manual systems currently employed.

Before we dive into the Python example code, remember that a “small data” approach using linear classifiers can be easily scaled using the Hadoop Streaming Interface.

For our technical readers, here is how you build a text classification system in one hour using Python.
Here we will use some great EMC blogs as an example.  We will train a classifier to tell “EMC IT’s Journey to the Private Cloud” apart from “Virtual Geek”. We collected 20 posts [http://www.cs.bgu.ac.il/~cohenrap/blogs.txt] to work with (the first word cloud describes these 40 posts).   Capture2

Step 1 – Preprocessing

Here we assume that we extracted a data file from our database with metadata of the label, in our example it is the blog name, and the text field of the blog content. If we are routing service request the columns will contain the label of the support group which solved the problem (instead of blog name) and the request text field. We will extract the data, clean it and lemmatize (extracting the lemma is preferable to stemming especially when presenting the results, for example: transforming “pricing” to “price” instead of “pric” as a stemmer would). A deeper approach to lemmatizing can be using a Parts-of-Speech tagger to determine the base form.

[sourcecode language=”python” wraplines=”false” collapse=”true”]

from nltk.corpus import wordnet

from nltk.stem.wordnet import WordNetLemmatizer

import sys

import os

import pandas as pd

import re

lmtzr = WordNetLemmatizer()

lemmas = {}

 

#lemmatization using NLTK

def myLemmatize(token):

if not token in lemmas:

lemma = wordnet.morphy(token)

if not lemma:

lemma = token

if lemma == token:

lemma = lmtzr.lemmatize(token)

lemmas[token] = lemma

return lemmas[token]

 

#some tokenization and cleansing

def cleanStringAndLemmatize(mytext):

cleantext = unicode(mytext,”ascii”,errors=”ignore”)

cleantext = cleantext.replace(“,”, ” , “).replace(“-“, ” “).replace(“?”, ” “).replace(“)”, ” ) “).replace(“(“, ” ( “).replace(“n”, ” “).replace(“‘”, ” “).replace(“%”, ” % “).replace(“!”, ” ! “).replace(“t”, ” “).replace(“/”, ” “).replace(“.[n| |$]”,” . “)

words = map(myLemmatize,cleantext.lower().split(” “))

return ” “.join(words)

 

#read the data

header = [“class”,”text”]

df = pd.read_csv(“blogs.txt”,names=header,sep=’t’)

#call the lemmatization and cleaning function

texts = df[“text”].map(cleanStringAndLemmatize)

#write out the intemediate data

df2 = pd.DataFrame(data={“class”:df[“class”],”texts”:texts})

df2.to_csv(“tokenized_data.csv”,index=False,sep=”t”)[/sourcecode]

Step 2 – Feature collection, training and evaluation

In this stage we choose the best word and n-gram (i.e.  word combination of length n) features based on frequency.  General stop words are removed using a list and your domain specific stop words are removed by a term frequency cutoff.

Feature selection is then performed using Chi-Square criteria, a statistical measure for the strength of correlation between two variables.

The classifier is trained, evaluated on documents left out of the training and a report is printed with the precision / recall scored.  The model is then written to a csv file which can be evaluated and edited by the domain experts.

[sourcecode language=”python” wraplines=”false” collapse=”true”]

import pandas as pd

import os

import numpy

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.metrics import f1_score,classification_report

from sklearn.naive_bayes import MultinomialNB

from sklearn import multiclass,svm,tree

from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.linear_model import RidgeClassifier

 

#Constants

#Minimal frequency should be much higher for big data (50-100)

MINIMAL_FREQUENCY = 4

# token appearing in more than 10-30% may be an in domain stop word (“EMC”)

MAXIMAL_FREQUENCY_FRACTION = 0.3

#We have 40 samples, so 40 features should be safe

NUMBER_OF_FEATURES = 40

#leave out 8 posts (20%) for testing

TEST_SIZE = 8

 

#train the model on the training portion

#test it on the held out portion

def trainTestModel(train_x,train_y,test_x,test_y,model):

print “training”
<ol>
<li>fit(train_x, train_y)</li>
</ol>
print “test”

pred = model.predict(test_x)

try:

print classification_report(test_y, pred)

except:

print “Error”

return True

&nbsp;

#write the features weights

def writeResults(clf,ch2,vectorizer):

fnames = list(numpy.asarray(vectorizer.get_feature_names())[ch2.get_support(vectorizer)])

fout = open(r”model.csv”,”w”)

for sclass,weights in zip(clf.classes_,clf.coef_):

for feat,weight in zip(fnames,weights):
<ol>
<li>write(sclass+”,”+feat+”,”+str(weight)+”n”)</li>
<li>close()</li>
</ol>
&nbsp;

#tokenization

def mytokenize(st):

ft = [y[:-1]+y[-1].replace(r”.”,””).replace(r”:”,””) for y in (x.replace(“t”,””).replace(“.^”,””) for x in st.split(” “)) if len(y) > 1]

return ft

&nbsp;

print “reading data”

df = pd.read_csv(“tokenized_data.csv”,sep=”t”,header=0)

df2 = df.reindex(numpy.random.permutation(df.index))

&nbsp;

print “vectorizing”

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=MAXIMAL_FREQUENCY_FRACTION, min_df =

MINIMAL_FREQUENCY,stop_words=’english’,ngram_range = (1,3),tokenizer=mytokenize)

X = vectorizer.fit_transform(numpy.asarray(df2.texts)).toarray()

Y= numpy.asarray(df2[“class”])

train_x = X[:-TEST_SIZE]

train_y = list(Y[:-TEST_SIZE])

test_x  = X[-TEST_SIZE:]

test_y  = list(Y[-TEST_SIZE:])

&nbsp;

print “select k best features”

ch2 = SelectKBest(chi2, k=NUMBER_OF_FEATURES)

train_x2 = ch2.fit_transform(train_x, train_y)

test_x2 = ch2.transform(test_x)

clf = PassiveAggressiveClassifier(n_iter=100)

“””

We can use other classfiers:

clf = RidgeClassifier(tol=1e-2, solver=”lsqr”)

clf = tree.DecisionTreeClassifier(max_depth = 5,criterion=’entropy’)

“””

trainTestModel(train_x2,train_y,test_x2,test_y,clf)

writeResults(clf,ch2,vectorizer)[/sourcecode]

For our blogs example F-score of 87% was obtained.

The model can be visualized as two word clouds, one for each type of blog.
Capture3

In “EMC IT’s Journey to the Private Cloud” we can see that the important features such as “transformation”, “strategy” or “big data”.

Capture8

In “Virtual Geek” we see the word “awesome” as well as hardware related terms.  The term “continue reading” is actually a leftover from cleaning the text, we would expect the SME to point this out so we can scrub the data some more and retrain the model.

About the Author: Raphael Cohen