You've probably heard by now about all of the advances machine learning is enabling, in areas like voice recognition, conversations, image processing, and self-driving cars. But how can you harness this amazing new power to do something as basic as improve your business's sales? What level of work is required, and what kind of results can you expect to achieve?

This blog article sets out to answer these questions by giving a concrete, real-world example: improving the close rate of an outbound sales team.

You don't need to be a machine learning expert to understand the example I will give. I'll take you through the whole process at a high level, and summarize the results. And for the programmers out there, I'll include the data and sample code.

The first thing you'll need in order to work on any machine learning problem is historical data - the more the better. Even basic machine learning algorithms require hundreds or thousands of data points to achieve reasonable accuracy; some algorithms like neural networks can require millions.

For most (supervised) learning algorithms, the historical data has to be tagged with the "correct" answer. For example, if you are trying to train your algorithm to recognize faces in a picture, you need to start with data where the people are already tagged. Similarly, in our case, where we are trying to predict whether or not a sales lead will purchase our product, we need historical data on prior leads, their attributes, and whether or not they purchased the product. The goal of the machine learning code is then to predict which ones will purchase the product in the future.

I googled for some sample data and found a data set of 3000 sales records that were generously provided by a Portuguese bank and used as the basis of a kaggle competition. This data includes 3000 of their leads, and for each, it has about a dozen attributes (age, education, profession, etc) as well as whether or not the lead purchased the product they were telemarketing (a term deposit). 

I randomly pulled out about 10% of these leads and set them aside as part of a "verification set". Once the algorithm is trained, we will test it by applying the predictions to this verification set, and, by comparing our predictions to what actually happened, we will be able to see how accurate our predictions were.

The remaining data was used to train the algorithm.

In practice, the first choice you would make regarding the algorithm is what platform to use to develop it. Theoretically, you can write your algorithm from scratch, use a pre-built library like tensorflow, or use a development environment geared to developing and deploying machine learning models like Amazon Sagemaker. I chose the last option as this is the easiest and also in general the Amazon algorithms are scalable and efficient. Sagemaker uses convenient Jupyter notebooks (a popular development environment for python/ML) and also works directly on the AWS cloud with ample computing resources available - important for the training part of the process, which can consume a lot of computer resources.

The next choice you might make is what machine learning algorithm to use. There are about a dozen popular machine learning algorithms (cheat sheet here), each one tuned to a particular type of problem and data set. The one I chose is XGBoost (a form of gradient boosted tree), which works very well for classification problems (where the answer is one of  limited set of values, like "yes" or "no", as opposed to a number) and does not require a huge data set. In my case, I had ~2700 record to train with, and just needed to predict "yes" or "no", whether they would buy the product or not.

With Sagemaker, once you have the data set up the way you want it, actually doing the machine learning training is just a few lines of code. You simply pass off the data to an XGBoost training implementation and it trains a model for you. In my case, this took about 15 minutes of execution time. This "model" is essentially a predictor function that will allow you to predict future sales. AWS lets you easily set this up as an endpoint that is easily callable from your code.

The whole process took me 3-4 hours, most of which was cleaning up the data beforehand. 

What were the results?

  • Without using machine learning, and just calling every lead on the list, the close rate would have been 7.5%.
  • With using machine learning, and just calling the leads it predicts would close, the close rate would have been 85%.

In other words, even with this simple example, relatively small data set, and no model tuning, sales close rates with machine learning were over 11 times higher than without it.

With more work, It's possible to improve it even further.

Hopefully this example gives you a sense of the power of machine learning, and how it can be used in real world problems all business face.

Here is the code for those that are curious. You should be able to run this directly in a Sagemaker Jupyter notebook.

The same data used is here.

bucket = 'marketing-example-1'
prefix = 'sagemaker/xgboost'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

#import libraries
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import sagemaker                                  # Amazon SageMaker's Python SDK provides many helper functions
from sagemaker.predictor import csv_serializer    # Converts strings for HTTP POST requests on inference

#download data set
!wget https://fasttrackteam.com/Data/sites/1/media/data.csv

#read into data frame
data = pd.read_csv('./data.csv', sep=',')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20)         # Keep the output on one page
data

#clean up data
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

#split into train, test, validation sets
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])   # Randomly sort the data then split out first 70%, second 20%, and last 10%

#prep for XGBoost
pd.concat([train_data['convert_yes'], train_data.drop(['convert_no', 'convert_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['convert_yes'], validation_data.drop(['convert_no', 'convert_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)

#copy to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

#set up training instances
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest',
              'ap-northeast-1': '501404015308.dkr.ecr.ap-northeast-1.amazonaws.com/xgboost:latest'}

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

#create training job
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

#create an endpoint based on trained model
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

#evaluate results
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['convert_no', 'convert_yes'], axis=1).as_matrix())

pd.crosstab(index=test_data['convert_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

#clean up
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
Posted by Brian Conte Monday, July 2, 2018 2:34:00 AM Categories: B2B big data technology