🔮Try making your own predictions using the dashboard here!

One of the trade-offs that come with using black-box models includes losing model interpretability. Fortunately, with the help of the confusion matrix and Shapley’s summary plots, there are ways of interpreting these complicated models.

I created two models, one using XGBoosts XGBClassifier() and the other using Scikit-Learns RandomForestClassifier() to compare and try to interpret how both models came to the same answer differently. The data comes from google analytics and uses standard features associated with the google analytics platform. This target of the data was the “Revenue” columns, which showed a True or False for whether or not the customer bought something on the site.

The data for the test came from Kaggle and included some features standard with google analytics such as:

  • Revenue (Class Label)
  • Administrative
  • Administrative Duration
  • Informational
  • Informational Duration
  • Product-Related
  • Product-Related Duration
  • Bounce Rate
  • Exit Rate
  • Page Value
  • Special Day
  • Operating System
  • Browser
  • Region
  • Traffic Type
  • Month

Descriptions about each feature are found here.

Accuracy is the most commonly known metric when performing classification, and it’s what everyone is familiar with. Some limitations come when using accuracy, and one of those limitations is not knowing how many false positives exist in your model. A false positive is an incorrect answer that is being classified positive. Knowing how many false positives exist in the model will help us understand if the model is guessing or predicting the correct answer.

The mean baseline for the majority class was 84.5%. This means that the model could guess False for every prediction and still be right 84.5% of the time. Understanding the mean baseline made it easier to determine how the model was performing, and if it was inherently useful.

Confusion Matrix (Revenue=1, No Revenue=0)

Sometimes the model can classify a positive class negative, and this is called a false negative. This can be troublesome, especially if your positive class has to be identified correctly, like a rare disease, dangerous person, or fraud. If you wanted to understand how well your model identified all positive classes, then you would use Recall. This is calculated by dividing the true positive by the sum of all true positives plus false negatives.

If you wanted to know how well your model predicted the right class correctly, then you would want to find the precision. Precision is calculated by dividing the true positives by the total predicted positive. For example, sometimes, your credit card will get declined if your visiting a new destination. The bank’s fraud detection is making an incorrect prediction this time and with low precision would make the wrong decision many times. A model with higher precision will be able to correctly predict the positive class correctly.

Shapley Values

The following charts are made by plotting the Shapley values for every row found in the dataset. The most important features are represented on the left of the chart from top to bottom. The positive and negative values represent the respective correlations to the target calculated using log odds. The color represents how big of an impact that selected feature had during that individual prediction.

Shapleys are great for understanding what role each feature played in each prediction—plotting all Shapley values seen in the data shows how the model performed overall.


Random Forest


Classifier Accuracy Precision Recall ROC/AUC
XGBoost 90.1% 64.0% 77.0% 92.5%
Random Forest 90.7% 64.4% 88.2% 90.1%