LLM Metrics Explained

Posted by Venkatesh Subramanian on March 08, 2024 · 5 mins read

Metrics are very important in any software development activity, and Generative AI/LLM is no different. In this post we will look at some interesting metrics- each with its own strengths and weaknesses- that can be employed to evaluate the models in use for applications.

Language models work by predicting next word based on existing words. Probability of a sentence is the product of probability of each token in that sentence, given the previous tokens already in there.

So, if sentence text S is "Here we define LLM metrics" , then 

P(S) = P(Here) * P(we | Here) * P(define | Here we) * 
       P(LLM | Here we define) * P(metrics | Here we define LLM)

The amount of information produced on an average per token above is referred to as “entropy” of the text.
Cross-entropy is a measure of closeness of two text distributions.

Perplexity is a measure of uncertainty of the probability distribution when predicting the next symbol. Let’s say there are 3 tokens in a text, and each token has entropy of 3, then the choice of next token could be in a space of 27 possible options. This means a perplexity of 27 results in. Lower the perplexity more accurate the outcome.
Perplexity is based heavily on the training dataset, so it cannot be used for comparison between models or datasets.

BLEU (Bilingual Evaluation Understudy) is a metric used in machine translation to compare against reference translation data sets. It compares the n-grams of machine generated sentences to n-grams of reference comparison text. This formula also penalizes candidate texts that are shorter than reference texts using a factor called “Brevity Penalty”.
The disadvantage here is that even with a high score the meaning of the full sentence could differ significantly.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to compare automatically generated text summarization to the reference summarization. The reference summary would have been typically created by a human for evaluation purpose. ROUGE has variants such as ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, ROUGE-SU.

ROUGE-N: This measures overlap of n-grams between system and reference summaries.
ROUGE-L: This measures the longest co-occuring sub-sequence (LCS) n-grams in summaries.
ROUGE-W: Weighted LCS-based statistics that favors consecutive LCS occurences.
ROUGE-S: Skip-bigram based co-occurence statistics.
ROUGE-SU: hybrid of Skip-bigram and Unigram.

ROUGE favors lexical similarity even when the generated summaries have same syntactic meaning. So this limits the diversity of generated summaries.

METEOR stands for Metric for Evaluation of Translation with Explicit Ordering. From its full form, it’s evident that it measures the alignment between generated text and reference text. It is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.

BLEU score emphasized precision, ROUGE score emphasized recall, and METEOR strikes a balance to address their limitations by correlating with reference text structural alignment.

METEOR is a computationally intensive algorithm, making this metric generation slower. It may not be as sensitive to minor differences in translation as other metrics like BLEU, so less useful for evaluating minor improvements or errors.

BertScore is another good alternative to traditional NLP statistical evaluation metrics. It computes cosine similarity calculations using contextualized token embeddings, including rare words importance weighting IDF(Inverse document frequency), and entailment detections in long-range dependencies.
BertScore correlates well with human judgement in sentence-level and system-level evaluation. It does involve downloading the BERT model that takes up significant storage, network download, and compute cycle. To address this concern user can download a distilled version of the BERT model, which will be significantly smaller in size.

MoverScore uses BertScore approach and then takes it up a notch by using something called EMD (Earth Mover’s Distance) to compute the minimal cost that must be paid to transform the distribution from generated text to the distribution in reference text.

Both BertScore and MoverScore rely on the underlying BERT model, so will inherit its contextual awareness and bias.
For conventional software it is wise to choose a few handful of metrics that give us an idea of the health of the applications. However, in case of LLM more may be better- as each metric completes the picture of reality more and more - given the inherently stochastic and emergent nature of these large models.

In this post we covered some important statistical and model based metrics for LLM evaluation. There are also LLM models that evaluate other LLMs. Additionally with ethical AI becoming very relvant now, there are also more metrics to measure how Responsible these AI based tasks are. These will be some topics of our future posts. So, stay tuned and happy learning!


* indicates required

Intuit Mailchimp