llm evals

--------------------------------

every step of ai development can have a downstream effect on performance or accuracy of an application

if you change models, change prompts, change tooling, even without changing stuff your model provider might change something without you being aware

evaluations are important so you know where you stand and you aren't building blindly

but how can i possibly know what model to use or techniques to use without spending the time to try them? evaluation systems.

--------------------------------

who cares?

building ai applications is really, really easy

building good applications is less easy

building great, reliable, and complex ai applications is really, really hard

know what level of evals and when to use them. any requirements or risk involved?

--------------------------------

evaluating systematically

requirements:

sounds hard but not really, spend the time upfront and it will 100x your dev process and iterations down the line

--------------------------------

levels to evals

  1. no evals, build the app, hook up AI, push to prod
  2. vibe check, just test a model by chatting with it for a bit to get a feel
  3. vibe check dataset, some set of questions you always ask when testing out a new model to better see differences
  4. evals v1, a more thorough set of questions, spanning easy to complex, covering different domains maybe, and you have an idea of what good answers are
  5. evals v2, a set of evaluation questions, along with corresponding answers from domain experts
  6. evaluation system, applying your eval set agains your entire AI application while measuring performance being the difference in what was generated and what you expect or want

--------------------------------

the process

  1. what is your goal? chat app? answer user questions as accurately as possible? ensure fairness?
  2. come up with 100 questions you expect your users to ask. maybe a mix of questions you want to make sure to answer correctly. maybe a mix of questions you don't want to be answered.
  3. come up with the correct or "ground truth" answers to these questions. ideally with the help of some domain experts if applicable. you don't want software engineers coming up with the "ground truth" answer to personal finance questions focused on students.
  4. run those 100 questions through the model you are currently using, how many of the generated answers are correct compared to the ground truth answers? lets say 70/100.
  5. now some new model comes out, your favorite infleuncer swears its the best model they've used. instead of blindly making changes, simply apply your eval set to this new model and measure how many of these new generated answers are correct comapred to ground truth. only 60/100 ? good thing you didn't change just based on some person on the internet.

--------------------------------

getting modular

i am a strong believer in building your own tooling for ai applications today. 1. building these things really isnt hard. 2. with something you built and intimately understood, it can be easily adapted to the quickly changing space.

having a modular eval system enables you to do so much more than just compare new models.

that same system should enable you to compare models, compare prompts, compare prompting techniques, measure performance of retrieval, measure overall performance given some model+prompting+retrieval.

--------------------------------

types of eval measures or metrics

whats important to you or your use case? do specific words matter? does wording not matter just semantics?

--------------------------------

tools for evals (if you must)

braintrust

mlflow evaluations

--------------------------------

directory

--------------------------------