July 5, 2024

Model Evaluations

 

Model evaluations (or ‘evals’) are the steps that use different metrics to evaluate an AI/ML model’s performance.  These evals are important for assessing performance, measuring effectiveness, ensuring safety, and improving overall reliability of the model.  Model evals can be done in two ways:


  • Offline:  Evaluated after model training or continuous retraining. 
  • Online:  Evaluated in production as part of model monitoring.


Let’s ask GPT-4 to summarize the key aspects of evals.






For supervised learning, some popular metrics for classification model evaluation include:


  • Confusion Matrix provides a summary of prediction results on a classification problem.  
  • Accuracy measures how often the classifier makes the correct predictions.  It’s the ratio of number of correct predictions over total number of predictions.
  • Precision measures the accuracy of the positive predictions made by the model.  It’s the ratio of true positive predictions to the total number of positive predictions (both true positives and false positives).  A high precision indicates that the classifier has a low false positive rate, meaning it rarely classifies negative instances as positive.  Precision is an important metric when you want to be very sure of your prediction or when the cost of false positives is high.  For example, in medical diagnostics for a rare disease, a high precision ensures that healthy individuals are rarely misdiagnosed as having the disease.
  • Recall (aka sensitivity or true positive rate) measures the model’s ability to identify all positive instances.  It’s the ratio of true positive predictions to the total number of actual positives (the sum of true positives and false negatives).  A high recall indicates that the classifier successfully identifies most of the positive instances.  Recall is often traded off with precision. While recall focuses on identifying all positive instances, precision focuses on the accuracy of positive predictions.  Improving recall can sometimes lead to a decrease in precision and vice versa.
  • F1 Score provides a balanced measure of a model’s performance by combining precision and recall into a single value.  


Let's use a spam email classification as an example:


1)  True Positives (TP): 70 emails correctly identified as spam.

2)  False Positives (FP): 10 emails incorrectly identified as spam (they were actually not spam).

3)  False Negatives (FN): 5 emails incorrectly identified as not spam (they were actually spam).

4)  True Negatives (TN): 100 emails correctly identified as not spam.


Accuracy =  (TP + TN) / (TP + TN + FP + FN)  

=  (70 + 100) / (70 + 100 + 10 + 5) =  91.9%


Precision =  TP / (TP + FP)  =  70 / (70 + 10) =  87.5%


Recall =  TP / (TP + FN)  =  70 / (70 + 5) =  93.3%


F1 Score =  2 * (Precision * Recall) / (Precision + Recall)

=  2 * (0.875 * 0.933) / (0.875 + 0.933)  =  90.2%


The choice between precision, recall, and F1 score depends on the specific context and objectives:

  • If false positives are more costly, prioritize precision.
  • If false negatives are more costly, prioritize recall.
  • If a balance is needed, the F1 score is a good metric to use.

The F1 score provides a single, comprehensive measure of a model's accuracy by balancing precision and recall, making it a valuable tool for evaluating and comparing classification models.


No comments:

Post a Comment