llm evals
--------------------------------
every step of ai development can have a downstream effect on performance or
accuracy of an application
if you change models, change prompts, change tooling, even without changing
stuff your model provider might change something without you being aware
evaluations are important so you know where you stand and you aren't building
blindly
but how can i possibly know what model to use or techniques to use without
spending the time to try them? evaluation systems.
--------------------------------
who cares?
building ai applications is really, really easy
building good applications is less easy
building great, reliable, and complex ai applications is really, really hard
know what level of evals and when to use them. any requirements or risk
involved?
--------------------------------
evaluating systematically
requirements:
- upfront time and effort
- domain experts
- possibly building your own tooling
-
covering both accuracy and breadth, answer some questions very very well,
but still answer 90% of questions pretty well
- more
sounds hard but not really, spend the time upfront and it will 100x your dev
process and iterations down the line
--------------------------------
levels to evals
- no evals, build the app, hook up AI, push to prod
-
vibe check, just test a model by chatting with it for a bit to get a
feel
-
vibe check dataset, some set of questions you always ask when testing
out a new model to better see differences
-
evals v1, a more thorough set of questions, spanning easy to complex,
covering different domains maybe, and you have an idea of what good answers
are
-
evals v2, a set of evaluation questions, along with corresponding
answers from domain experts
-
evaluation system, applying your eval set agains your entire AI
application while measuring performance being the difference in what was
generated and what you expect or want
--------------------------------
the process
-
what is your goal? chat app? answer user questions as accurately as
possible? ensure fairness?
-
come up with 100 questions you expect your users to ask. maybe a mix of
questions you want to make sure to answer correctly. maybe a mix of
questions you don't want to be answered.
-
come up with the correct or "ground truth" answers to these questions.
ideally with the help of some domain experts if applicable. you don't want
software engineers coming up with the "ground truth" answer to personal
finance questions focused on students.
-
run those 100 questions through the model you are currently using, how many
of the generated answers are correct compared to the ground truth answers?
lets say 70/100.
-
now some new model comes out, your favorite infleuncer swears its the best
model they've used. instead of blindly making changes, simply apply your
eval set to this new model and measure how many of these new generated
answers are correct comapred to ground truth. only 60/100 ? good thing you
didn't change just based on some person on the internet.
--------------------------------
getting modular
i am a strong believer in building your own tooling for ai applications today.
1. building these things really isnt hard. 2. with something you built and
intimately understood, it can be easily adapted to the quickly changing space.
having a modular eval system enables you to do so much more than just compare
new models.
that same system should enable you to compare models, compare prompts, compare
prompting techniques, measure performance of retrieval, measure overall
performance given some model+prompting+retrieval.
--------------------------------
types of eval measures or metrics
- apply vibe check dataset, do you like the answers? yes or no
-
word matching, how close are the words of generated responses to your ground
truths? does it mostly say the same thing? does it at least include some
expected key words?
- levenshtein distance, character level differences
-
cosine similarity, is the generated response semantically similar to your
ground truth?
- llm-as-a-judge, are you lazy?
-
things that measure entailment in rag applications, does the generated
answer actually come from your retrieved documents?
whats important to you or your use case? do specific words matter? does
wording not matter just semantics?
--------------------------------
tools for evals (if you must)
braintrust
mlflow evaluations
--------------------------------
directory
--------------------------------