Erez Katz, CEO and Cofounder of Lucena Research
Maximize Your Data's Potential By Qualifying the Best Model Before Risking Capital
One of the biggest challenges in machine learning research is to identify the best model from an array of trained models. Here I’ll review several metrics by which models are traditionally measured for quantitative . I will then further hone in on AUC score, one method you should consider using to compare between a wide set of models.
Machine Learning Model Training
TensorBoard, a machine learning open source platform developed by Google, provides an excellent visualization of an artificial neural network model training in process. The binomial classifier of a stock forecaster image below presents the compelling precision and accuracy of a trained model.
A (1) label represents the stock reaching a meaningful target % change within 21 days.
You can clearly see how over time, the model improves. With more data, the model’s accuracy and precision improve. Hence, the model is learning.
The model binary classifier is tasked to identify one of two states:

(1)  Buy.

(0)  Sell if you were holding a position, or do nothing.
After 140 sets of training data (called epochs in machine learning lingo) we’ve achieved 64% accuracy and 69% precision.
Two Popular Statistical Measures of Your Model's Output
Accuracy: Measures how often the classifier is correct (64.4%). In other words, how many times (1) and (0) signals were triggered correctly as a percentage of all possible signals.
Precision: Indicates the rate of true positives (69%). Simply stated, precision is how often the model classifies a buy signal successfully as a percentage of all buy signals.
On the surface, the statistics look compelling. If we assume that 69% of all of our trades will generate profit, we can theoretically have a very successful fund. However, this is only true if buy/sell signals were split 50/50 and the profit/loss amount on every transaction were equal. We clearly know that in reality this is never the case. We need more information to determine if the model is good, as demonstrated by the following scenarios:
 Imagine we evaluate the model during bullish periods in which 70% of the time the constituents are favorable to move higher anyway. If the model did nothing but bet a Buy or (1) every time, it would have achieved a 70% accuracy by default.
 Imagine the model only makes one bet and was successful. Does it really mean the model can reliably achieve 100% success? Probably not.
The default assumption of 50/50 signal distribution and 50/50 chance of (1) or (0) of our signal universe (often referred to as label data or data of which its predictive outcome is known) is clearly not a realworld example, and therefore evaluating trading model efficacy can be tricky.
How To Grade and Compare Multiple Machine Learning Models
Grading Your Machine Learning Model: Confusion Matrix
It’s clear that measuring how often the model was correct in identifying (1) labels and (0) labels (true positive and true negatives) is not enough. Measuring the rate in which the model was wrong (false positive and false negatives) should be considered as well.
Confusion matrix is a tabular representation of the performance of a classification model. Given a set of validation data for which the true values are known, the confusion matrix identifies how effective the model was in predicting the known outcome. In our case we are looking at a binary classifier with two potential outputs 1 (buy) or 0 (sell or do nothing).
A sample of a confusion matrix with 1000 labels. Predicted 1’s and 0’s vs actual 1’s and 0’s.
Blue cells: Represent the total predicted 1 (buys) and 0 (sells) 340 and 660 respectively for a total of 1,000 predictions.
Red cells: Represent the total of actual 1 (buys) and 0 (sells) 460 and 540 respectively for a total of 1,000 observations.
Green cells: True positive, 150 predictions, and true negative 350 predictions. Measures how the classifier correctly predicted the known outcome. We want the model to maximize the value of the green cells.
Yellow cells: False positives of 190 predictions and false negatives of 310 signify the incorrect predictions by the model. We naturally would favor a model that minimizes the yellow cells’ values and maximizes the green cells’ values.
As you can see, even with a simple binary 0 and 1 classification the outcome can be fairly confusing not to mention when you have three or more potential outcomes, for example 1, 0, 1 (1 buy, 0 do nothing, 1 sell) would generate a three by three confusion matrix.
Deriving Key Performance Metrics From a Confusion Matrix

It turns out that there are quite a few valuable statistical measures that can be derived from the confusion matrix, some of which we’ve seen earlier. To name a few:

 Accuracy  How often is the classifier correct? TP+TN/Total Observations  150+350/1000 = 50%

 Precision  How often are the buy predictions correct? TP/Total buy predictions  150/340 = 44.12%

 Sensitivity – Also called True Positive Rate or Recall. What percentage of the buy labels was the classifier able to identify? TP/Actual Buy 150/460 = 32.61%

 Specificity – Also called True Negative Rate. What percentage of the sell (0) labels were identified correctly by the classifier? TN / Actual Sell  350/540 = 64.81%

 Misclassification Rate – How often is the classifier wrong? FP + FN / Total (190 + 310) / 1000 = 50%

 Null Error Rate – How often you would be wrong if you always predicted the majority class.
In our case, sell is the majority class (540 cases). If I forecasted 1000 sells I would be right in 540 of the cases and wrong in 460 of the cases (1000 – 540 = 460). And so, 460 wrong sell forecast against 1000 predictions is 460 / 1000 = 46%.
Breaking Down The Results
Judging from the results above, the classifier didn’t truly learn anything meaningful. However, there are so many parameters to sift through, it’s difficult to wrap your head around in order to identify what each of these truly mean. For example:

 Is the classifier making fewer bets in order to obtain a higher degree of precision?

 Is the classifier making better sell or donothing predictions vs. buy prediction? This could be meaningful for some investment paradigms, but typically we want to deploy our capital vs. sit on cash.
There must be a way to put all these parameters into a common baseline in order to compare models’ fitness for a predetermined investment objective. Introducing the ROC curve!
Receiver Operative Characteristic (ROC)
ROC is a graphical representation of a model’s ability to distinguish between sensitivity and specificity across multiple probability thresholds. In essence, we are looking to evaluate a model’s efficacy based on its ability to distinguish between True Positive Rate and True Negative Rate for a wide range of probability scores.
The image below shows how a model distinguishes between true positive and true negative for a 50% probability threshold (designated as “cut off value”).
A synthetic distribution of a hypothetical model which is able to distinguish very effectively (90% accuracy) between buy (true positive) and sell signals (true negative). The shaded area in which the distribution curves overlap indicate false negative (for the red region left of the vertical threshold line) or false positive (for the blue region right of the vertical threshold line).
Blue cells: Represent the total 0 (sell or do nothing) predictions.
Red cells: Represent the total 1 (buy) predictions.
As you can see, we’ve introduced two additional concepts:

Probability  A confidence score (normally 0 to 1 or 0 to 100) generated by the classifier. An example is a normalized value of SoftMax output for the 1 labels classification.

Threshold  The probability cutoff point to distinguish between 1’s and a 0’s labels. In our example, distinguish between sell (blue) and buy (red) signals.
It’s important to note that in the real world the data is rarely distributed as perfectly as depicted above.
The two true positive vs. true negative distributions (also known as sensitivity vs specificity) below do a much better job of distinguishing between a bad model and a more effective model.
The top model is not very effective as there are large overlap areas under the curves which indicates a high degree of false positives and false negatives. However, in the lower image we see a much more effective model. There is a clear distinction between the true positives and true negatives leading to 85% accuracy for a 0.5 threshold (50% threshold).
Separating Good From Bad Trading Models
As we move the threshold up or down, the distinction between true positive and false positive will change accordingly. A valid probability threshold is mainly driven by business needs, or more specifically its ability to tolerate errors manifested by false positives or false negatives.
For example, if we were to build a model that predicts the likelihood of a deadly disease such as cancer, we would probably use a low threshold since we want to capture ALL cancer patients even with the likelihood of a few false positives. In trading models however, we can be more tolerant towards bad trades if we are able to capture more profitable trades with superior net gains.
Now that we are able to capture true positive rate and false positive rate for any probability threshold in a model, we can go ahead and capture it in a ROC graph.
We plot the ROC graph based on two measures for every possible probability threshold:
X axis  False positive rate
Y axis  True positive rate
In the top image we marked two probability thresholds (A and B) for a model that was able to separate effectively the true positives from the true negatives. This class separation can be further depicted along an ROC curve (bottom image), which plots the false positive rate on the X axis and true positive rate on the Y axis.
Finding the AUC Score
The ROC graph in essence depicts the tradeoff between true positive rate versus false positive rate as we vary the decision threshold. When comparing multiple ROC curves, a larger “area under the curve” shows a model that does a better job separating classes, or simply put, is more predictive.
In order to place a single value that depicts the strength of the classification model, we can simply measure the area under the curve of the ROC graph. A larger value represents a better model.
The above graph represents:
Blue – Random guesses by which the true positive and true negatives completely overlap. AUC is measured at 0.5.
Green – Poor model in which large portion of the true positive curve overlaps the true negative. AUC is measured at 0.62.
Orange – Good model with a distinct separation between the true positive and the true negative. AUC is measured at 0.93.
Better AUC Scores Drive Superior Models for Algorithmic Investment
When conducting hyperparameter tuning, the system can generate hundreds if not thousands of models where its difficult to assert which model is superior. Traditionally, deciding which model is most suitable for an investment objective has been a function of human intuition (some would call it art) and science.
By comparing the AUCs (ares under curve) of multiple ROC curves, we now have a very clean and scientific way to quantifiably measure the strength of a model. Further, we can now modify or extend the cost function of the learner in order to train for a greater AUC score.
Another interesting approach that can be used as an overlay to the AUC score is Cohen’s Kappa. Cohen’s Kappa measures the interrater agreement of a classifier. It is a statistical measure of how meaningful a model’s confusion matrix is above a random guess. In our case, it would be nice to further ascertain that our winning AUC score did not occur by chance. We will try and cover Cohen’s Kappa in future posts.
Want to know more about our machine learning platform?