Tuesday, August 7, 2018

The Confusion Matrix


This blog series is intended to discuss some of the most widely used concepts during the task of 'Classification' in Data Science. 

Prerequisite: Familiarity with basics of Machine Learning terminologies. 

Confusion Matrix (also known as Contingency Table) plays an important role in assessment of the strength of the Classification Model in Machine Learning. 


The four numbers inside this Table:


1. True Positive

2. True Negative

3. False Positive 

4. False Negative 


can be helpful in telling your data story. One can easily plot a confusion matrix using the R library called 'caret'.

Looking at the Confusion Matrix alone, one can calculate:


1. Precision

2. Recall

3. F1 Score

4. Accuracy




Figure 1: Confusion Matrix


'Precision' represents 'Exactness' of the classifier. It is also known as Positive Predictive Value(PPV). It tells us,  how likely it is to be correct, when you predict Positive.


'Recall' represents 'Completeness' of the classifier. It is also known as Sensitivity/True Positive Rate. It tells us, how much % of the Positive Class is caught by the model of the Total Positives.


'F1 score' is the Harmonic Mean of Precision and Recall.


'Accuracy Paradox' is well known in the Data Science world. A model may look highly accurate but in reality it might be misleading, in case of imbalanced dataset. Therefore it is imperative to check the distribution of the Positive Class and the Negative Class in the data set. 


Hence exploring the topics like CAP (Cumulative Accuracy Profile) and ROC(Receiver Operating Curve) is a vital step. Both visualizations are quite popular and are vastly used when one wants to assess the Discriminatory Capabilities of the Model.

ROC curve can be plotted and AUC (area under the curve) can be calculated using Open Source R function called roc.curve. Read more on ROC here

CAP can be visualized on an Excel sheet easily. 

CAP is a very powerful technique to improve the Hit Ratio of your marketing efforts within the predetermined Budget, Time and Man power of the campaign.




Figure 2: Example for a CAP-Curve

Also note that the relation between these two techniques is mathematically expressed as :

AR = 2 AUC − 1 where AR is Accuracy Ratio of the CAP.




1 comment:

  1. Easy to understand and very useful. Short and simple explanation.

    ReplyDelete

ROC

Before you begin any Analysis, its best to consult the concerned Business Expert and know what type of Error that you see in the Confusio...