Tuesday, August 7, 2018

ROC


Before you begin any Analysis, its best to consult the concerned Business Expert and know what type of Error that you see in the Confusion Matrix is of more concern for the Business Problem.

Ex- It may be the case that False Negatives are probably worse than False Positives to your Business.


Remember, that a Confusion Matrix that the ML model is usually calculated for a Threshold Value of 0.5, a default value in most of the Predictive Analytics Tools in the market.  Nevertheless, you can always tune the parameters to arrive at the Best Results. You can plot an ROC curve for the model to visualize the performance of the model across all the the possible different Threshold values. One point on an ROC curve corresponds to just one threshold used to calculate the values of the Confusion Marix.


Knowing the AUC of the ROC curve can help you compare different models. Higher AUC is desirable. When the curves do not cross each other, its easy to judge. It implies, One model is outperforming the other model against all the Threshold values. In such a case, you pick the model with Higher AUC.


But AUC alone doesn't tell you the entire story. The shape of the ROC curve matters too when you are comparing two models, apart from its AUC. It needs closer inspection than AUC, when the ROC curves of two models crosses each other. At this point you need to make a choice, High Precision/Low Recall or Low Precision/High Recall. Notice, that at the end of the day, you are interested in picking one threshold for your classification. When the ROC curves  it may happen that a model with relatively less AUC value could be of more significance to you.




The Confusion Matrix


This blog series is intended to discuss some of the most widely used concepts during the task of 'Classification' in Data Science. 

Prerequisite: Familiarity with basics of Machine Learning terminologies. 

Confusion Matrix (also known as Contingency Table) plays an important role in assessment of the strength of the Classification Model in Machine Learning. 


The four numbers inside this Table:


1. True Positive

2. True Negative

3. False Positive 

4. False Negative 


can be helpful in telling your data story. One can easily plot a confusion matrix using the R library called 'caret'.

Looking at the Confusion Matrix alone, one can calculate:


1. Precision

2. Recall

3. F1 Score

4. Accuracy




Figure 1: Confusion Matrix


'Precision' represents 'Exactness' of the classifier. It is also known as Positive Predictive Value(PPV). It tells us,  how likely it is to be correct, when you predict Positive.


'Recall' represents 'Completeness' of the classifier. It is also known as Sensitivity/True Positive Rate. It tells us, how much % of the Positive Class is caught by the model of the Total Positives.


'F1 score' is the Harmonic Mean of Precision and Recall.


'Accuracy Paradox' is well known in the Data Science world. A model may look highly accurate but in reality it might be misleading, in case of imbalanced dataset. Therefore it is imperative to check the distribution of the Positive Class and the Negative Class in the data set. 


Hence exploring the topics like CAP (Cumulative Accuracy Profile) and ROC(Receiver Operating Curve) is a vital step. Both visualizations are quite popular and are vastly used when one wants to assess the Discriminatory Capabilities of the Model.

ROC curve can be plotted and AUC (area under the curve) can be calculated using Open Source R function called roc.curve. Read more on ROC here

CAP can be visualized on an Excel sheet easily. 

CAP is a very powerful technique to improve the Hit Ratio of your marketing efforts within the predetermined Budget, Time and Man power of the campaign.




Figure 2: Example for a CAP-Curve

Also note that the relation between these two techniques is mathematically expressed as :

AR = 2 AUC − 1 where AR is Accuracy Ratio of the CAP.




ROC

Before you begin any Analysis, its best to consult the concerned Business Expert and know what type of Error that you see in the Confusio...