llm evals

--------------------------------

every step of ai development can have a downstream effect on performance or accuracy of an application

if you change models, change prompts, change tooling, even without changing stuff your model provider might change something without you being aware

evaluations are important so you know where you stand and you aren't building blindly

but how can i possibly know what model to use or techniques to use without spending the time to try them? evaluation systems.

--------------------------------

who cares?

building ai applications is really, really easy

building good applications is less easy

building great, reliable, and complex ai applications is really, really hard

know what level of evals and when to use them. any requirements or risk involved?

--------------------------------

evaluating systematically

requirements:

upfront time and effort
domain experts
possibly building your own tooling
covering both accuracy and breadth, answer some questions very very well, but still answer 90% of questions pretty well
more

sounds hard but not really, spend the time upfront and it will 100x your dev process and iterations down the line

--------------------------------

levels to evals

no evals, build the app, hook up AI, push to prod
vibe check, just test a model by chatting with it for a bit to get a feel
vibe check dataset, some set of questions you always ask when testing out a new model to better see differences
evals v1, a more thorough set of questions, spanning easy to complex, covering different domains maybe, and you have an idea of what good answers are
evals v2, a set of evaluation questions, along with corresponding answers from domain experts
evaluation system, applying your eval set agains your entire AI application while measuring performance being the difference in what was generated and what you expect or want

--------------------------------

the process

what is your goal? chat app? answer user questions as accurately as possible? ensure fairness?
come up with 100 questions you expect your users to ask. maybe a mix of questions you want to make sure to answer correctly. maybe a mix of questions you don't want to be answered.
come up with the correct or "ground truth" answers to these questions. ideally with the help of some domain experts if applicable. you don't want software engineers coming up with the "ground truth" answer to personal finance questions focused on students.
run those 100 questions through the model you are currently using, how many of the generated answers are correct compared to the ground truth answers? lets say 70/100.
now some new model comes out, your favorite infleuncer swears its the best model they've used. instead of blindly making changes, simply apply your eval set to this new model and measure how many of these new generated answers are correct comapred to ground truth. only 60/100 ? good thing you didn't change just based on some person on the internet.

--------------------------------

getting modular

i am a strong believer in building your own tooling for ai applications today. 1. building these things really isnt hard. 2. with something you built and intimately understood, it can be easily adapted to the quickly changing space.

having a modular eval system enables you to do so much more than just compare new models.

that same system should enable you to compare models, compare prompts, compare prompting techniques, measure performance of retrieval, measure overall performance given some model+prompting+retrieval.

--------------------------------

types of eval measures or metrics

apply vibe check dataset, do you like the answers? yes or no
word matching, how close are the words of generated responses to your ground truths? does it mostly say the same thing? does it at least include some expected key words?
levenshtein distance, character level differences
cosine similarity, is the generated response semantically similar to your ground truth?
llm-as-a-judge, are you lazy?
things that measure entailment in rag applications, does the generated answer actually come from your retrieved documents?

whats important to you or your use case? do specific words matter? does wording not matter just semantics?

--------------------------------

tools for evals (if you must)

braintrust

mlflow evaluations

--------------------------------