Pytest for LLM Apps is finally here!
DeepEval turns LLM evals into a two-line test suite to help you identify the best models, prompts, and architecture for AI workflows (including MCPs).
Learn the limitations of G-Eval and an alternative to it in the explainer below:

Most LLM-powered evals are BROKEN!
These evals can easily mislead you to believe that one model is better than the other, primarily due to the way they are set up.
G-Eval is one popular example.
Here's the core problem with LLM eval techniques and a better alternative to them:
Typical evals like G-Eval assume you’re scoring one output at a time in isolation, without understanding the alternative.
So when prompt A scores 0.72 and prompt B scores 0.74, you still don’t know which one’s actually better.
This is unlike scoring, say, classical ML models, where metrics like accuracy, F1, or RMSE give a clear and objective measure of performance.
There’s no room for subjectivity, and the results are grounded in hard numbers, not opinions.
LLM Arena-as-a-Judge is a new technique that addresses this issue with LLM evals.
In a gist, instead of assigning scores, you just run A vs. B comparisons and pick the better output.
Just like G-Eeval, you can define what “better” means (e.g., more helpful, more concise, more polite), and use any LLM to act as the judge.
LLM Arena-as-a-Judge is actually implemented in @deepeval (open-source with 12k stars), and you can use it in just three steps:
- Create an ArenaTestCase, with a list of “contestants” and their respective LLM interactions.
- Next, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case.
- Finally, run the evaluation and print the scores.
This gives you an accurate head-to-head comparison.
Note that LLM Arena-as-a-Judge can either be referenceless (like shown in the snippet below) or reference-based. If needed, you can specify an expected output as well for the given input test case and specify that in the evaluation parameters.
Why DeepEval?
It's 100% open-source with 12k+ stars and implements everything you need to define metrics, create test cases, and run evals like:
- component-level evals
- multi-turn evals
- LLM Arena-as-a-judge, etc.
Moreover, tracing LLM apps is as simple as adding one Python decorator.
And you can run everything 100% locally.
I have shared the repo in the replies.

11,47k
120
Innholdet på denne siden er levert av tredjeparter. Med mindre annet er oppgitt, er ikke OKX forfatteren av de siterte artikkelen(e) og krever ingen opphavsrett til materialet. Innholdet er kun gitt for informasjonsformål og representerer ikke synspunktene til OKX. Det er ikke ment å være en anbefaling av noe slag og bør ikke betraktes som investeringsråd eller en oppfordring om å kjøpe eller selge digitale aktiva. I den grad generativ AI brukes til å gi sammendrag eller annen informasjon, kan slikt AI-generert innhold være unøyaktig eller inkonsekvent. Vennligst les den koblede artikkelen for mer detaljer og informasjon. OKX er ikke ansvarlig for innhold som er vert på tredjeparts nettsteder. Beholdning av digitale aktiva, inkludert stablecoins og NFT-er, innebærer en høy grad av risiko og kan svinge mye. Du bør nøye vurdere om handel eller innehav av digitale aktiva passer for deg i lys av din økonomiske tilstand.