Ever since the UAE’s TII launched Falcon, Hugging Face Open LLM Leaderboard has been trending for both right and wrong reasons. The model came out as the champion of open source on various evaluation metrics. Interestingly, there has been no paper of the model yet. It might be possible that the researchers would have used some other metric or dataset for the evaluation of the model.
Hugging Face founders, including Thomas Wolf, the one who made a lot of noise about Falcon reaching the top of the leaderboard, stumbled upon this problem with the evaluation metrics of the recent models. According to the Open LLM Leaderboard, the benchmark of Massive Multitask Language Understanding (MMLU) showed that Meta AI’s LLaMa’s score was significantly lower than the score published in the model’s paper.
This was questioned by many people. Firstly, Andrej Karpathy raised concerns about the leaderboard and promotion of Falcon over LLaMa. It was later evaluated by Yao Fu from Allen Institute, that with no fancy prompting and decoding, LLaMa performed better than Falcon on MMLU evaluation.
In the blog ‘What’s going on with the Open LLM Leaderboard?’ Hugging Face founders, including Wolf, decided to dive into the issue to discuss why there is a discrepancy between the paper’s benchmarks and the leaderboard benchmarks.
Evaluating Evaluation Metrics
With the increasing number of LLM papers, it always becomes a question if the evaluation metrics used by the researchers are trustable or not. This also becomes a question for researchers if a single evaluation for a model is enough or not.
Firstly, MMLU score in the LLaMa paper was claimed to be irreproducible. This was because the LLaMA team used two different code implementations to evaluate their model on the MMLU benchmark. One implementation was developed by the original UC Berkeley team, which can be called the original implementation, and the other was provided by Stanford’s CRFM evaluation benchmark called HELM.
Now, the Open LLM Leaderboard uses these different implementations along with other benchmarks to evaluate the models. They do this because these benchmarks gather multiple evaluations in a single codebase, giving a comprehensive view of a model’s performance.
To resolve the discrepancy, the researchers ran these three implementations (the LLaMA team’s adapted code, the UC Berkeley implementation, and the Stanford HELM implementation) on a set of models to rank them based on the results. The surprising part is that these different implementations produced significantly different numbers and even changed the ranking order of the models on the leaderboard.
We can observe that the choice of evaluation method has a significant impact on both the absolute scores and the rankings of models when assessing the same dataset. Based on the average of the three scores by researchers, Falcon actually dropped below LLaMa.
Let’s imagine you have trained a perfect replica of the LLaMA 65B model and evaluated it using the harness, resulting in a score of 0.488 (as shown above). Now, if you compare it to the published score of 0.637 (evaluated using the original MMLU implementation), which has a difference of 30%, you might worry that your training went completely wrong. However, this discrepancy in scores does not indicate a failure in your training process. These numbers cannot be directly compared, even though they are both labelled as “MMLU scores” and evaluated on the same MMLU dataset.
This clearly shows that there needs to be a standardisation on the evaluation metric of LLMs. Now, is there an ideal way to evaluate models among all the methods we have discussed? It’s a challenging question. Different models may perform differently depending on the evaluation method, as evidenced by the changes in rankings.
The Hugging Face founders say that evaluations are closely tied with the implementation details such as prompt and tokenization. “The mere indication of ‘MMLU results’ gives you little to no information about how you can compare these numbers to others you evaluated on another library,” reads the blog.
The way forward
It would be beneficial for a researcher to use the evaluation metric that gives the highest score to rank on top of the leaderboard. But simply calling it MMLU does not qualify its rating above any other as the evaluation is varied.
Standardisation of evaluation methods in the field is still lacking. Different research papers report different datasets, sometimes with overlapping content. Additionally, papers often do not provide detailed analysis beyond reporting average scores, as they are not constrained by page limits. This is why the average, open, and reproducible benchmarks such as EleutherAI Eval Harness or Stanford HELM become very important for the open source community.
It is also important to take a closer look at these evaluation datasets to gain a better understanding of their characteristics and what makes them suitable for assessing LLMs. Up until then, Hugging Face is pushing to fix the leaderboard by implementing standardised and open evaluation methods.