Model performance metrics: How well does my model perform? – Part 1

Tavish Last Updated : 17 Apr, 2015
6 min read

In case you are preparing for an analytics interview, you have hit a jackpot. This blog will give you answers to at least 2 – 3 questions, which are likely to be asked in the interview. In case you already know some of the metrics discussed in this article, it will still be worth reading to brush up your memory.

Once you have done all the hard work and built a model, it all boils down to a single number or a single curve, which will tell you “How well does my model perform?” We will discuss exactly this here.

In our industry, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. With this article, I attempt to layout the most commonly used metrics and plots to evaluate the performance of the model and explain which metric is used in what scenarios.

Warming up: Types of Predictive models

When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). The evaluation metrics used in each of these models are different. We will first take metrics used for classification problems.

In classification problems, we use two types of algorithms (dependent on the kind of output it creates):

  1. Class output : Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms which can convert these class outputs to probability. But these algorithms are not well accepted by the statistics community.
  2. Probability output : Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.

We will address both the types of models in the evaluation metrics discussed below.

 

Illustrative Example :

For the entire Classification model evaluation metric discussion, I have used my predictions for the problem BCI challenge on Kaggle (link) . The solution of the problem is irrelevant for the discussion, however the final predictions on the training set has been used for this article. The predictions made for this problem were probability outputs which have been converted to class outputs assuming a threshold of 0.5 .

Classification Problems Evaluation metrics

1. Confusion Matrix:

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few definitions, you need to remember for a confusion matrix :

  • Accuracy : the proportion of the total number of predictions that were correct.
  • Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
  • Negative Predictive Value : the proportion of negative cases that were correctly identified.
  • Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
  • Specificity : the proportion of actual negative cases which are correctly identified.

Confusion_matrix

Confusion_matrix

The accuracy for the problem in hand comes out to be 88%.  As you can see from the above two tables, the Positive predictive Value is high, but negative predictive value is quite low. Same holds for Senstivity and Specificity. This is primarily driven by the threshold value we have chosen. If we decrease our threshold value, the two pairs of starkly different numbers will come closer. In general we are concerned with one of the above defined metric. For instance, in a pharmaceutical company, they will be more concerned with minimal wrong positive diagnosis. Hence, they will be more concerned about high Specificity. On the other hand an attrition model will be more concerned with Senstivity.Confusion matrix are generally used only with class output models.

2. Gain and Lift charts :

Gain and Lift chart are mainly concerned to check the rank ordering of the probabilities. Here are the steps to build a Lift/Gain chart:

Step 1 : Calculate probability for each observation

Step 2 : Rank these probabilities in decreasing order.

Step 3 : Build deciles with each group having almost 10% of the observations.

Step 4 : Calculate the response rate at each deciles for Good (Responders) ,Bad (Non-responders) and total.

You will get following table from which you need to plot Gain/Lift charts:

LiftnGain

This is a very informative table. Cumulative Gain chart is the graph between Cumulative %Right and Cummulative %Population. For the case in hand here is the graph :

CumGain

This graph tells you how well is your model segregating responders from non-responders. For example, the first decile however has 10% of the population, has 14% of responders. This means we have a 140% lift at first decile.

What is the maximum lift we could have reached in first decile? From the first table of this article, we know that the total number of responders are 3850. Also the first decile will contains 543 observations. Hence, the maximum lift at first decile could have been 543/3850 ~ 14.1%. Hence, we are quite close to perfection with this model.

Let’s now plot the lift curve. Lift curve is the plot between total lift and %population. Note that for a random model, this always stays flat at 100%. Here is the plot for the case in hand :

Lift

You can also plot decile wise lift with decile number :

Liftdecile

What does this graph tell you? It tells you that our model does well till the 7th decile. Post which every decile will be skewed towards non-responders. Any model with lift @ decile above 100% till minimum 3rd decile and maximum 7th decile is a good model. Else you might consider over sampling first.

Lift / Gain charts are widely used in campaign targeting problems. This tells us till which decile can we target customers for an specific campaign. Also, it tells you how much response do you expect from the new target base.

3. K-S chart :

K-S or Kolmogorov-Smirnov chart measures performance of classification models. More accurately, K-S is a measure of the degree of separation between the positive and negative distributions. The K-S is 100, if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0. In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.

For the case in hand, following is the table :

KS

We can also plot the %Cumulative Good and Bad to see the maximum separation. Following is a sample plot :

KS_plotEnd Notes:

The metrics covered in this article are some of the most used metrics of evaluation in a classification problems. In the next article we will take up a few more metrics like : concordant ratio, AUC – ROC and Gini Coefficient. All these parameters are related to the relative ranking of the predicted probability.

Did you find the article useful? Which metrics do you prefer and why?  Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

anup
anup

good work,its really Useful.

Diptesh
Diptesh

Hi The article was quiet informative. Thanks for such well-informed article. I am in learning stage therefore, a) Can you share an excel file for lift chart, there are few steps where I require a hand-holding (Especially when we need to draw a lift chart with oversampled datasets). If its convenient to you, may I discuss with you through email. b) In certain datasets we are getting misclassification rate as high as 40-45%. Furthermore misclassification rate of testing set (data on which model is built) is less than misclassification rate of validation set (data on which the model is tested). What should be done in these cases? Are there any step by step process for dealing such high misclassification rates? Regards Diptesh

Preetha Rajan
Preetha Rajan

Hi, many thanks for this useful article! As a fresher looking to enter the field, this is very useful! However, when I open the article both in Internet Explorer and Chrome, the images uploaded for the confusion matrix, all the images uploaded for the gain chart and all the images uploaded for the k-s chart, are just not loading and I am unable to view them in the blog post! Could you please fix this? Thanks, Preetha